This is the data directory of the Hunglish CDROM, containing the parallel corpus as sentence pair files and, where copyright permits, the original monolingual files as well. The corpus is sorted by genre: the main categories are as follows.


Movie subtitles. The raw files were provided to MOKK for research only and can not be republished on this CD. The sentence aligned files are given in "shuffled" format: one sentence per line (Hungarian sentence TAB English sentence) and the lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many speling errors owing to OCR text extraction, and as a special aid for subselecting a higher quality dataset, sentence by sentence figures of alignment merit are provided in the file quality.


Legal texts. The raw files came from CELEX and are reproduced in full under the law/en (English) and law/hu (Hungarian) subdirectories. The aligned files are in the law/bi directory.


Literature. For "classical" material no longer under copyright, the raw files came from Project Gutenberg and the Hungarian Electronic Library. For these, both raw (en, hu) and aligned (bi) files are available. For "modern" material still under copyright and made avaliable to MOKK for research purposes, the sentence pair (bi) files are shuffled together in one bi/Shuffle.


Magazines and news. This material is still largely in preparation at the time the CD goes to press, please visit the Hunglish website for more.


Software documentation. The raw files come from, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects.


Monolingual Hungarian files taken from the Hungarian National Corpus: parliament and city council minutes, laws, regulations, and other "official" material as well as the archives of the chat rooms of a major Hungarian internet portal, Please note that the tokenization of these files differs from the rest of the corpus inasmuch as both sentence-final punctuation and trailing periods after abbreviations are given as whitespace-separated tokens. This explains the discrepancy between the numbver of words and sentences quoted in the summary below and the number coming from wc.


directory size (MB) words contents raw text
film 18 3.27 m move subtitles never
law 233 31.53 m EU law (CELEX) full
lit 85 17.24 m literature when (C) lapsed
mag 5 0.36 m magazines, news yes, research only
swdoc 8 1.27 m software documentation full
mono 8 46.4 m monolingual Hungarian full

A more detailed inventory is available in the form of a catalog and wc files in each directory, the former providing author and title information (in both languages) and the latter containing the output of wc for each file. A file conf which summarizes our confidence in the alignment is also provided for each directory so that users can select a high-confidence subset of the corpus if they wish.


The raw files are either public domain or provided for research purposes with the consent of the copyright holder, Diplomacy and Trade Magazine (see mag/*/DTM*).


Further details on corpus preparation and alignment can be found in this draft and this paper.


We gratefully acknowledge CELEX, Project Gutenberg, the Hungarian Electronic Library, Typotex, and Diplomacy and Trade Magazine.