ACOPOST

This page hosts my upgrades to ACOPOST (for "A COllection of Part-Of-Speech Taggers), a set of taggers developed by Ingo Schröder.

Unfortunately, the ACOPOST project seems to be dead as March 2007: their site has not been updated since August 2002, when the author stated that he would not be able to keep working on it and was asking for maintainers. I tried to contact the author and all the member of the SourceForge project, but unfortunately all messages were returned or not replied. Thus, consider my modifications an unauthorised fork.

If you are or know one of the maintainers of ACOPOST, please drop me an email.

Regarding my upgrades, in March 2007 I released version 1.8.5-tresoldi, which compiles silently in gcc version 4.1 with the -Wall option. (as version 1.8.4 was probably written for gcc-2.95, it was issuing some warnings).

The latest version is 1.8.6-tresoldi, which compiles silently in gcc version 4.1 with both the -Wall and the -ansi options. It also compiles (even though with some warning being issued) with -Wall -ansi -pedantic. You can download my unauthorised version 1.8.6-tresoldi here.

The following information are taken and modified from ACOPOST's home page.

What is ACOPOST about?

Part-of-speech (POS) tagging is the task os assigning grammatical classes to words in a natural language sentence. It is important because subsequent processing states (such as parsing) become easier if the word class for a word is available.

Here is an English example of a tagged sentence, taken from the Wall Street Journal of the Penn Treebank:

Measures         NNS
of               IN 
manufacturing    VBG
activity         NN 
fell             VBD
more             RBR
than             IN
the              DT
overall          JJ
measures         NNS
.                .

ACOPOST is a set of freely available POS taggers that Ingo Schöder modelled after well-known techniques. The programs are written in C and run under various UNIX flavors and under Windows when using a suitable compiler such as MinGW. ACOPOST currently consists of four taggers which are based on different frameworks:

Maximum Entropy Tagger (MET): This tagger uses an iterative procedure to successively improve parameters for a set of features that help to distinguish between relevant contexts. It is based on a framework suggested by Ratnaparkhi [1997].
Trigram Tagger (T3): This kind of tagger is based on Hidden Markov Models where the states are tag pairs that emit words, i.e., it is based on transitional and lexical probabilities. The technique has been suggested by Rabiner [1990] and the implementation is influenced by Brants [2000].
Error-driven Transformation-based Tagger (TBT): Transformation rules are learned from an annotated corpus which change the currently assigned tag depending on triggering context conditions. The general approach as well as the application to POS tagging has been proposed by Brill [1993].
Example-based tagger (ET): Example-based models (also called memory-based, instance-based or distance-based) rest on the assumption that cognitive behavior can be achieved by looking at past experiences that resemble the current problem rather that learning and applying abstract rules. They have been suggested for NLP by Daelemans et al. [1996].

How can I use ACOPOST?

(based in Marco Baroni's instructions for Italian).

For the various trainings, you need a cooked file, i.e., a manually tagged corpus. The cooked file format used by ACOPOST requires a sentence per line, with tokens' text and tags separed by white spaces. For example, the first two lines of the NILC corpus of Brazilian Portuguese would be:

$ head -2 nilc.tt
Antes LPREP de LPREP iniciarmos VTD o ART estudo N da PREP+ART origem N da PREP+ART vida N  , , é VLIG  necessário ADJ conhecer VTD alguns ADJ caracteres N que PR distinguem VBI os ART seres N  vivos ADJ dos PREP+ART seres N brutos ADJ . .
Dentre PREP+PREP esses PD caracteres N , , os ART mais ADV importantes ADJ são VLIG : : presença N de PREP ácido N nucléico ADJ , , reprodução N , , evolução N , , metabolismo N  , , organização N celular ADJ , , movimento N e CONJCOORD crescimento N . .

The first necessary step is to generate a file of lexical counts with cooked2lex.pl:

$ cooked2lex.pl < nilc.tt > nilc.lex
4415 sentences
82 tags 17185 types 104963 tokens
1     15562  90.556%     43527  41.469% 
2      1232   7.169%     22292  21.238% 
3       239   1.391%     11135  10.609% 
4       103   0.599%     14162  13.492% 
5        35   0.204%      4434   4.224% 
6         6   0.035%      1081   1.030% 
7         4   0.023%      2928   2.790% 
8         2   0.012%      3370   3.211% 
9         1   0.006%       102   0.097% 
10        0   0.000%         0   0.000% 
11        0   0.000%         0   0.000% 
12        0   0.000%         0   0.000% 
13        1   0.006%      1932   1.841% 
Mean ambiguity A=2.670560
      
Entropy H(p)=4.280438

Inspecting the last lines (as the first ones will usually include only punctuation):

$ tail nilc.lex 
últimas ADJ 9
último ADJ 20 ORD 1
últimos ADJ 13
úmida ADJ 2
úmido ADJ 1
única ADJ 14
únicas ADJ 4
único ADJ 13
útero N 4
útil ADJ 2

For training hte HMM Tagger (T3), we use:

$ cooked2ngram.pl < nilc.tt > nilc.ngrams

For training the Transformation-based Tagger (TBT), we use:

$ tbt -l nilc.lex -m 4 -n 1 -o 2 -t nilc.templates nilc.rules < nilc.tt

Transformation-based Tagger (c) Ingo Schröder, ingo@nats.informatik.uni-hamburg.de

[        4 ms::1] "nilc.rules" seems to be a new file, good
[     2284 ms::1] initially generated 118328 rules
rule NP rare[0] tag[0]=N cap[0]=some 992 - 99 == 893
[   109894 ms::1] best rule is NP rare[0] tag[0]=N cap[0]=some delta 893 good 992 - bad 99 == 893
rule ADJ tag[-1]=N tag[0]=N 1053 - 173 == 880
[   231050 ms::1] best rule is ADJ tag[-1]=N tag[0]=N delta 880 good 1053 - bad 173 == 880
rule VTD tag[0]=N tag[1]=ART 568 - 47 == 521
[   349809 ms::1] best rule is VTD tag[0]=N tag[1]=ART delta 521 good 568 - bad 47 == 521

(...)

Some notes on training a Transformatio-based model:

You need to provide a file with templates for the transformations, such as nilc.templates in our example (which is included in the "Resources" section at the bottom of this page);
As you can see from the time stamps, this kind of training is very time-consuming, even when dealing with small corpora such as the NILC one (about 100,000 tokens). You will probably want to perform the training in the background, redirecting its output:
```
tbt -l nilc.lex -m 4 -n 1 -o 2 -t nilc.templates nilc.rules < nilc.tt 2> nilc.log &
```

To generate an example-based model, you need to specify features to be known, unknown and tags to be excluded (example files are given in the resources). After this, the "known" and "unknown" trees can be generated:

$ cooked2wtree.pl -a 3 nilc.known.etf < nilc.tt > nilc.known.tree
No. of features: 5 (from "nilc.known.etf")
No. of sentences: 4415
No of words: 104963
Most frequent words:
            7739        ","
            4414        "."
	    4075        "de"
	    2852        "a"
	    2398        "o"
Word at rank 100: "outros" (73 occurances)
Frequent word threshold: 73
Entropy: 4.280438
Features:
  0              TAG[-2] H==3.432 IG==0.849 S==3.864 GR==0.220
  1              TAG[-1] H==2.816 IG==1.464 S==3.829 GR==0.382
  2              WORD[0] H==1.341 IG==2.940 S==3.665 GR==0.802
  3             CLASS[0] H==0.302 IG==3.979 S==5.364 GR==0.742
  4             CLASS[1] H==2.511 IG==1.770 S==5.094 GR==0.347
Permutation: 2 3 1 4 0
								      
$ cooked2wtree.pl -b 2 -e nilc.exclude_tags nilc.unknown.etf < nilc.tt > nilc.unknown.tree
No. of features: 10 (from "nilc.unknown.etf")
No. of sentences: 4415
No of words: 104963
Most frequent words:
            7739        ","
	    4414        "."
	    4075        "de"
	    2852        "a"
	    2398        "o"
Word at rank 100: "outros" (73 occurances)
Frequent word threshold: 73
Entropy: 4.280438
Features:
  0              TAG[-1] H==0.314 IG==3.966 S==1.027 GR==3.861
  1               CAP[0] H==0.384 IG==3.897 S==0.498 GR==7.818
  2            NUMBER[0] H==0.418 IG==3.862 S==0.429 GR==9.009
  3            HYPHEN[0] H==0.399 IG==3.881 S==0.452 GR==8.579
  4          LETTER[0,1] H==0.349 IG==3.932 S==1.107 GR==3.553
  5         LETTER[0,-4] H==0.381 IG==3.900 S==1.058 GR==3.686
  6         LETTER[0,-3] H==0.357 IG==3.924 S==1.051 GR==3.732
  7         LETTER[0,-2] H==0.358 IG==3.922 S==1.008 GR==3.891
  8         LETTER[0,-1] H==0.352 IG==3.928 S==0.868 GR==4.528
  9             CLASS[1] H==0.326 IG==3.954 S==1.210 GR==3.267
Permutation: 2 3 1 8 7 0 6 5 4 9

Tagging

Once the models have been training, the taggers can be used. The corpora to be tagged must be in the same one line per sentence format, with tokens (including punctuation marks) separated by one or more whice spaces. For obvious reasons, the encoding of the corpora to be tagged must be the same of the corpus used for the training. Assuming that the corpus to be tagged is stored in file "corpus":

$ t3 nilc.ngrams nilc.lex < corpus > corpus.tagged_t3

$ tbt -l nilc.lex -r nilc.rules < corpus > corpus.tagged_tbt

$ et nilc.known.tree nilc.unknown.tree nilc.lex < corpus > corpus.tagged_et

$ tbt -l nilc.lex nilc.rules < corpus.tagged_t3 > corpus.tagged_t3_tbt

Resouces

I have developed a set of resource files for testing ACOPOST, based on a small (about 100,000 tokens) corpus for Brazilian Portugues developed by the Núcleo Interinstitucional de Lingüística Computacional (NILC) of the University of São Paulo (link). NILC has absolutely no relation or responsability regarding this set of files.

You can download them here.

For contact: tresoldi at users dot sourceforge dot net