picture of Hermes, Greek god
Welcome to the page of the Hermes Computational Linguistics Project; this page is used as a repository of CL-related software that I developed and of other material that I wanted to modify/mirror.


ACOPOST upgrade

My development on the (probably) abandoned A COllection of Part Of Speech Taggers, by Ingo Schröder. Check my page for further information. (Last update: 2007/03/23)

NLTK

I occasionaly work with the amazing Natural Language ToolKit (NLTK); the AffixTagger was first a module posted here.

Corpus and Language Models

In December 2009, I decided to start working on a project that for sometime had been in my head: to build complete and free corpus and language models, particularly for smaller languages, using data from Wikipedia. The number of uses for these data would be enormous, but it might end up helping Wikipedia itself with machine translation (think of Moses the best decoder for statistical machine translation around).

In the following table you can download the ones that are already available; all language models currently include 1-, 2- and 3-grams. All the files, as derivatives of Wikipedia, are licensed under the Creative Commons Attribution-Share Alike 3.0 Unported License, which, in short, allows you to share (copy, distribute and transmit) and remix (adapt) the work for all your needs, provided that you attribute the work and share it under the same license. However, I would love to know if you use this data for any kind of project/research, so please let me know if someone is actually using this.

I am planning to write a better page, explaining advantages, shortcomings and alternatives to using Wikipedia as a source for this.

The language models are in the iARPA format and were compiled with IRSTLM.

Language code (ISO-639) Language name (local) Language name (English) Compilation date Wikipedia date & link Corpus (size) Language model (size)
lmo Lumbard Lombard 2009/12/07 2008/06 corpus (818.8 KB, gz) lm (3.7 MB, gz)
nap Napulitano Neapolitan 2009/12/06 2008/06 corpus (398.9 KB, gz) lm (2.1 MB, gz)

Updates

2009/12/07 - New corpora and language models from Wikipedia available for download.
2007/03/23 - New, much simpler page.
2007/03/23 - Added my ACOPOST update.


For contact: tresoldi at gmail dot com