Welcome to the page of the
Hermes Computational Linguistics Project; this page is used as a repository
of CL-related software that I developed and of other material that I wanted
to modify/mirror.
ACOPOST upgrade
My development on the (probably) abandoned
A COllection of
Part Of Speech Taggers, by Ingo Schröder. Check my
page for further information. (Last update: 2007/03/23)
NLTK
I occasionaly work with the amazing Natural
Language ToolKit (NLTK); the AffixTagger was first a module posted
here.
Corpus and Language Models
In December 2009, I decided to start working on a project that for sometime had
been in my head: to build complete and free corpus and language models,
particularly for smaller languages, using data from Wikipedia. The
number of uses for these data would be enormous, but it might end up
helping Wikipedia itself with machine translation (think of
Moses the best decoder for
statistical machine translation around).
In the following table you can download the ones that are already
available; all language models currently include 1-, 2- and 3-grams.
All the files, as derivatives of Wikipedia, are licensed
under the Creative
Commons Attribution-Share Alike 3.0 Unported License, which, in short,
allows you to share (copy, distribute and transmit) and remix (adapt) the
work for all your needs, provided that you attribute the work and share
it under the same license. However, I would love to know if you
use this data for any kind of project/research, so please let me know
if someone is actually using this.
I am planning to write a better page, explaining advantages, shortcomings
and alternatives to using Wikipedia as a source for this.
The language models are in the iARPA format and were compiled with
IRSTLM.
| Language code (ISO-639) |
Language name (local) |
Language name (English) |
Compilation date |
Wikipedia date & link |
Corpus (size) |
Language model (size) |
| lmo |
Lumbard |
Lombard |
2009/12/07 |
2008/06 |
corpus (818.8 KB, gz) |
lm (3.7 MB, gz) |
| nap |
Napulitano |
Neapolitan |
2009/12/06 |
2008/06 |
corpus (398.9 KB, gz) |
lm (2.1 MB, gz) |
Updates
2009/12/07 - New corpora and language models from
Wikipedia
available for download.
2007/03/23 - New, much simpler page.
2007/03/23 - Added my ACOPOST update.
For contact: tresoldi at gmail dot com