Unfortunately, the ACOPOST project seems to be dead as March 2007: their site has not been updated since August 2002, when the author stated that he would not be able to keep working on it and was asking for maintainers. I tried to contact the author and all the member of the SourceForge project, but unfortunately all messages were returned or not replied. Thus, consider my modifications an unauthorised fork.
If you are or know one of the maintainers of ACOPOST, please drop me an email.
Regarding my upgrades, in March 2007 I released version 1.8.5-tresoldi, which compiles silently in gcc version 4.1 with the -Wall option. (as version 1.8.4 was probably written for gcc-2.95, it was issuing some warnings).
The latest version is 1.8.6-tresoldi, which compiles silently in gcc version 4.1 with both the -Wall and the -ansi options. It also compiles (even though with some warning being issued) with -Wall -ansi -pedantic. You can download my unauthorised version 1.8.6-tresoldi here.
The following information are taken and modified from ACOPOST's home page.
Here is an English example of a tagged sentence, taken from the Wall Street Journal of the Penn Treebank:
Measures NNS of IN manufacturing VBG activity NN fell VBD more RBR than IN the DT overall JJ measures NNS . .ACOPOST is a set of freely available POS taggers that Ingo Schöder modelled after well-known techniques. The programs are written in C and run under various UNIX flavors and under Windows when using a suitable compiler such as MinGW. ACOPOST currently consists of four taggers which are based on different frameworks:
For the various trainings, you need a cooked file, i.e., a manually tagged corpus. The cooked file format used by ACOPOST requires a sentence per line, with tokens' text and tags separed by white spaces. For example, the first two lines of the NILC corpus of Brazilian Portuguese would be:
$ head -2 nilc.tt Antes LPREP de LPREP iniciarmos VTD o ART estudo N da PREP+ART origem N da PREP+ART vida N , , é VLIG necessário ADJ conhecer VTD alguns ADJ caracteres N que PR distinguem VBI os ART seres N vivos ADJ dos PREP+ART seres N brutos ADJ . . Dentre PREP+PREP esses PD caracteres N , , os ART mais ADV importantes ADJ são VLIG : : presença N de PREP ácido N nucléico ADJ , , reprodução N , , evolução N , , metabolismo N , , organização N celular ADJ , , movimento N e CONJCOORD crescimento N . .The first necessary step is to generate a file of lexical counts with cooked2lex.pl:
$ cooked2lex.pl < nilc.tt > nilc.lex 4415 sentences 82 tags 17185 types 104963 tokens 1 15562 90.556% 43527 41.469% 2 1232 7.169% 22292 21.238% 3 239 1.391% 11135 10.609% 4 103 0.599% 14162 13.492% 5 35 0.204% 4434 4.224% 6 6 0.035% 1081 1.030% 7 4 0.023% 2928 2.790% 8 2 0.012% 3370 3.211% 9 1 0.006% 102 0.097% 10 0 0.000% 0 0.000% 11 0 0.000% 0 0.000% 12 0 0.000% 0 0.000% 13 1 0.006% 1932 1.841% Mean ambiguity A=2.670560 Entropy H(p)=4.280438Inspecting the last lines (as the first ones will usually include only punctuation):
$ tail nilc.lex últimas ADJ 9 último ADJ 20 ORD 1 últimos ADJ 13 úmida ADJ 2 úmido ADJ 1 única ADJ 14 únicas ADJ 4 único ADJ 13 útero N 4 útil ADJ 2For training hte HMM Tagger (T3), we use:
$ cooked2ngram.pl < nilc.tt > nilc.ngramsFor training the Transformation-based Tagger (TBT), we use:
$ tbt -l nilc.lex -m 4 -n 1 -o 2 -t nilc.templates nilc.rules < nilc.tt Transformation-based Tagger (c) Ingo Schröder, ingo@nats.informatik.uni-hamburg.de [ 4 ms::1] "nilc.rules" seems to be a new file, good [ 2284 ms::1] initially generated 118328 rules rule NP rare[0] tag[0]=N cap[0]=some 992 - 99 == 893 [ 109894 ms::1] best rule is NP rare[0] tag[0]=N cap[0]=some delta 893 good 992 - bad 99 == 893 rule ADJ tag[-1]=N tag[0]=N 1053 - 173 == 880 [ 231050 ms::1] best rule is ADJ tag[-1]=N tag[0]=N delta 880 good 1053 - bad 173 == 880 rule VTD tag[0]=N tag[1]=ART 568 - 47 == 521 [ 349809 ms::1] best rule is VTD tag[0]=N tag[1]=ART delta 521 good 568 - bad 47 == 521 (...)Some notes on training a Transformatio-based model:
tbt -l nilc.lex -m 4 -n 1 -o 2 -t nilc.templates nilc.rules < nilc.tt 2> nilc.log &
$ cooked2wtree.pl -a 3 nilc.known.etf < nilc.tt > nilc.known.tree No. of features: 5 (from "nilc.known.etf") No. of sentences: 4415 No of words: 104963 Most frequent words: 7739 "," 4414 "." 4075 "de" 2852 "a" 2398 "o" Word at rank 100: "outros" (73 occurances) Frequent word threshold: 73 Entropy: 4.280438 Features: 0 TAG[-2] H==3.432 IG==0.849 S==3.864 GR==0.220 1 TAG[-1] H==2.816 IG==1.464 S==3.829 GR==0.382 2 WORD[0] H==1.341 IG==2.940 S==3.665 GR==0.802 3 CLASS[0] H==0.302 IG==3.979 S==5.364 GR==0.742 4 CLASS[1] H==2.511 IG==1.770 S==5.094 GR==0.347 Permutation: 2 3 1 4 0 $ cooked2wtree.pl -b 2 -e nilc.exclude_tags nilc.unknown.etf < nilc.tt > nilc.unknown.tree No. of features: 10 (from "nilc.unknown.etf") No. of sentences: 4415 No of words: 104963 Most frequent words: 7739 "," 4414 "." 4075 "de" 2852 "a" 2398 "o" Word at rank 100: "outros" (73 occurances) Frequent word threshold: 73 Entropy: 4.280438 Features: 0 TAG[-1] H==0.314 IG==3.966 S==1.027 GR==3.861 1 CAP[0] H==0.384 IG==3.897 S==0.498 GR==7.818 2 NUMBER[0] H==0.418 IG==3.862 S==0.429 GR==9.009 3 HYPHEN[0] H==0.399 IG==3.881 S==0.452 GR==8.579 4 LETTER[0,1] H==0.349 IG==3.932 S==1.107 GR==3.553 5 LETTER[0,-4] H==0.381 IG==3.900 S==1.058 GR==3.686 6 LETTER[0,-3] H==0.357 IG==3.924 S==1.051 GR==3.732 7 LETTER[0,-2] H==0.358 IG==3.922 S==1.008 GR==3.891 8 LETTER[0,-1] H==0.352 IG==3.928 S==0.868 GR==4.528 9 CLASS[1] H==0.326 IG==3.954 S==1.210 GR==3.267 Permutation: 2 3 1 8 7 0 6 5 4 9
$ t3 nilc.ngrams nilc.lex < corpus > corpus.tagged_t3 $ tbt -l nilc.lex -r nilc.rules < corpus > corpus.tagged_tbt $ et nilc.known.tree nilc.unknown.tree nilc.lex < corpus > corpus.tagged_et $ tbt -l nilc.lex nilc.rules < corpus.tagged_t3 > corpus.tagged_t3_tbt
You can download them here.