This is the source directory of the Hunglish CDROM, containing all the tools required for creating the parallel sentence-aligned files from the raw text files. All software here is

licensed under the CC-GNU LGPL.


The key tool is the hunalign aligner. Before you can run it generally you need to take several steps to preprocess your input files, so you should add the following to your path. hunnorm provides filters for characterset detection and normalization. huntoken provides sentence-level chunking and tokenization, both for English and Hungarian. The stemmer/morphological analyzer framework has online documentation, For creating the morphological resources we used hunlex, a morphological description framework and precompilation tool.

Various aspects of the hun* toolchain are described at a higher level in our drafts and published papers. To see how these tools are combined for creating proper input for the aligner and postprocessing the output of the aligner consider the Hungarian and English raw text in examples/hu.raw and examples/hu.raw.

If the tokenizer 'huntoken' is in the path, the very simple scripts in scripts/ and scripts/ turn raw latin1 or latin2 text to sentence-segmented text:

  scripts/ < examples/hu.raw > examples/hu.sen
  scripts/ < examples/en.raw > examples/en.sen

The following shellscript performs a very simple tokenization, which is more appropriate for us than huntoken's more sophiticated tokenization. Most importantly, unlike huntoken, whitespace characters are not allowed inside tokens:

  scripts/ < examples/hu.sen > examples/hu.tok
  scripts/ < examples/en.sen > examples/en.tok

The following scripts do Hungarian and English stemming, calling the stemtool program. The stemtool source can be found in the hunmorph package. Make-ing it builds a stemtool binary. In the following example we assume that this is copied or softlinked to tools/stemtool.

  tools/stemtool data/hungarian.aff data/hungarian.dic < examples/hu.tok > examples/hu.stem
  tools/stemtool data/english.aff data/english.dic < examples/en.tok > examples/en.stem

Now we can finally align the texts:

  hunalign data/hu-en.stem.dic examples/hu.stem examples/en.stem > examples/example.ladder

The ladder output format is not very human-readable, but we can help this with another simple script:

  scripts/ examples/example.ladder examples/hu.sen examples/en.sen > examples/example.align.sen.text
  scripts/ examples/example.ladder examples/hu.tok examples/en.tok > examples/example.align.tok.text

The workflow used for the Hunglish parallel corpus

The Hunglish parallel corpus uses a simple directory structure for its workflow. The Hunglish scripts run the previously described pre- and postprocessing steps on this data in batch mode.

The main charasteristic of this directory structure is that every relevant metadatum is encoded in the pathless filenames, but directory hierarchy is also used to structure data. So full pathnames encode metadata twice: e.g. Hungarian/Steinbeck/raw/

More specifically, input data is expected in the following pattern: Hungarian/Author/raw/ English/Author/raw/Author_BookNumber.en.raw

The directory of the raw data is expected in a catalog file, with Author and BookNumber. Example for the file format:

Goethe	1
Goethe	3
Steinbeck	1
Steinbeck	2
Steinbeck	3

The root path of the corpus should be exported into the BICDIR shell variable, e.g.,

    export BICDIR=/work/bicorpus

The name of the catalog file should be exported into the CATALOG shell variable, e.g.,

    export CATALOG=$BICDIR/catalog.txt

The path to the scripts and tools should be exported into the CATALOG shell variable, e.g.,
    export BINDIR=$BICDIR/scripts

After this, the scripts/ shell script can be called.

Sentence-segmented data:

Tokenized data:

Stemmed data:

Alignment is stored in an Align/ directory, in various formats:

Ladder format:

Text format, based on the sentence-aligned, untokenized text:

Text format, based on the sentence-aligned, untokenized text, filtered for quality. (In this step, we throw away all not-one-to-one segments. This may be a bad idea for some domains and applications.):

The text text.qf contains the sentences lexicographically ordered in order to avoid reconstructibility (copyright) problems: