This is the source directory of the Hunglish CDROM, containing all the tools required for creating the parallel sentence-aligned files from the raw text files. All software here is
Various aspects of the hun* toolchain are described at a higher level in our drafts and published papers. To see how these tools are combined for creating proper input for the aligner and postprocessing the output of the aligner consider the Hungarian and English raw text in examples/hu.raw and examples/hu.raw.
If the tokenizer 'huntoken' is in the path, the very simple scripts in scripts/ and scripts/ turn raw latin1 or latin2 text to sentence-segmented text:
scripts/ < examples/hu.raw > examples/hu.sen scripts/ < examples/en.raw > examples/en.sen
The following shellscript performs a very simple tokenization, which is more appropriate for us than huntoken's more sophiticated tokenization. Most importantly, unlike huntoken, whitespace characters are not allowed inside tokens:
scripts/ < examples/hu.sen > examples/hu.tok scripts/ < examples/en.sen > examples/en.tok
The following scripts do Hungarian and English stemming, calling the stemtool program. The stemtool source can be found in the hunmorph package. Make-ing it builds a stemtool binary. In the following example we assume that this is copied or softlinked to tools/stemtool.
tools/stemtool data/hungarian.aff data/hungarian.dic < examples/hu.tok > examples/hu.stem tools/stemtool data/english.aff data/english.dic < examples/en.tok > examples/en.stem
Now we can finally align the texts:
hunalign data/hu-en.stem.dic examples/hu.stem examples/en.stem > examples/example.ladder
The ladder output format is not very human-readable, but we can help this with another simple script:
scripts/ examples/example.ladder examples/hu.sen examples/en.sen > examples/example.align.sen.text scripts/ examples/example.ladder examples/hu.tok examples/en.tok > examples/example.align.tok.text
The Hunglish parallel corpus uses a simple directory structure for its workflow. The Hunglish scripts run the previously described pre- and postprocessing steps on this data in batch mode.
The main charasteristic of this directory structure is that every relevant metadatum is encoded in the pathless filenames, but directory hierarchy is also used to structure data. So full pathnames encode metadata twice: e.g. Hungarian/Steinbeck/raw/
More specifically, input data is expected in the following pattern: Hungarian/Author/raw/ English/Author/raw/Author_BookNumber.en.raw
The directory of the raw data is expected in a catalog file, with Author and BookNumber. Example for the file format:
Goethe 1 Goethe 3 Steinbeck 1 Steinbeck 2 Steinbeck 3 (...)
The root path of the corpus should be exported into the BICDIR shell variable, e.g.,
export BICDIR=/work/bicorpus
The name of the catalog file should be exported into the CATALOG shell variable, e.g.,
export CATALOG=$BICDIR/catalog.txt The path to the scripts and tools should be exported into the CATALOG shell variable, e.g., export BINDIR=$BICDIR/scripts
After this, the scripts/ shell script can be called.
Sentence-segmented data: Hungarian/Author/sen/ Tokenized data: Hungarian/Author/sen.tok/ Stemmed data: Hungarian/Author/sen.tok.stem/ Alignment is stored in an Align/ directory, in various formats: Ladder format: Align/Author/ladder/Author_Number.ladder Text format, based on the sentence-aligned, untokenized text: Align/Author/text/Author_Number.text Text format, based on the sentence-aligned, untokenized text, filtered for quality. (In this step, we throw away all not-one-to-one segments. This may be a bad idea for some domains and applications.): Align/Author/text.qf/Author_Number.text.qf The text text.qf contains the sentences lexicographically ordered in order to avoid reconstructibility (copyright) problems: Align/Author/shuffled/Author_Number.shuffled