Overview of the documentation

This directory contains the documentation for the Hunglish CDROM.


The papers are available both in ps and pdf formats. The main paper describing the corpus collection and the alignment method is a draft submitted to RANLP05 pdf ps. The Hunglish project is described in Hungarian in a paper presented at the 2nd Hungarian Computational Linguistics Conference in December 2004 pdf ps.

The entire NLP toolchain that was used in creating the corpus began with extending and improving the Hungarian open source spellchecker Hunspell used in OpenOffice pdf ps as part of the WordSword (SzóSzablya) project, which is described here both in Hungarian, for the 1st Hungarian Computational Linguistics Conference held in November 2003 pdf ps, and in English, for LREC04 pdf ps, a paper that provides details on the frequency count included on this CD and the gigaword Hungarian corpus it is based on.

The tools, which include a stemmer, a morphological analyzer and a generator, as well as characterset-detection, normalization, and sentence-levele tokenization utilities, are described in a paper published at the SALTMIL 2004 workshop pdf ps and the following draft pdf ps to be presented at the ACL05 Software Workshop. For now, the system of morphological codes used in the morphological analyzer (pdf ps), and some other low-level aspects of the tools (pdf ps), are only described in Hungarian.

Other documentation

Some unix-style man pages and GNU texinfo-style info pages are available with the distribution tarballs. The main directories, bin, data, lr, and src, have their own readme.html (English) and sometimes also olvassel.html (Hungarian) readme files. Some online documentation is also available, both in English and Hungarian.