HunLex - Reference Manual

Hunlex Reference Manual
1 Introduction
- 1.1 Hunlex: A Short Description
- 1.2 Motivation
- 1.3 Configurable Compilations
2 License
3 Authors, Contact, Bugs
- 3.1 License? What license?
- 3.2 Submitting a Bug Report
- 3.3 Requesting a New Feature
- 3.4 Praises
- 3.5 Contribution
- 3.6 Reference
- 3.7 Authors
- 3.8 Contact
4 Installation
- 4.1 Download
- 4.2 Supported Platforms
- 4.3 Prerequisites
- 4.4 Install
- 4.5 Uninstall and Reinstall
- 4.6 Installed Files
5 Bootstrapping
6 Toplevel Control
- 6.1 Verbosity and Debugging
- 6.2 Storing your Settings
- 6.3 Targets
- 6.4 Options
7 Description Language
- 7.1 Morphs
  - 7.1.1 Morph Preamble and Variants
  - 7.1.2 Blocks
- 7.2 Macros
- 7.3 Metadata
8 Files
- 8.1 Input Resources
- 8.2 Output Resources
9 Command-line Control
10 Levels
- 10.1 Levels and Affix Rules
- 10.2 Levels and Stems
- 10.3 Levels and Ordering
- 10.4 Manipulating Levels with Options
- 10.5 Levels and Optimizing Performance
11 Tags
- 11.1 Merging Tags
- 11.2 Feature Structures
12 Flags
- 12.1 Two Forms of Flags
- 12.2 Flaggable Characters
- 12.3 Limit on the Number of Flags
- 12.4 Special Flags
13 Troubleshooting
- 13.1 Installation Problems
- 13.2 Problems running hunlex
- 13.3 Resource Compilation Problems
- 13.4 Grammar Problems
14 Related Software and Resources
- 14.1 Software that can use the output of Hunlex as input
- 14.2 Available resources
  - 14.2.1 The Hungarian Morphdb Project
  - 14.2.2 The English Morphdb Project
- 14.3 Hunlex's relatives
  - 14.3.1 XFST, TWOLC, LEXC
Variables and Options Index
Description Language Index
Files Index
Concept Index
Frequently Asked Questions

Next: Introduction, Previous: (dir), Up: (dir)

Hunlex Reference Manual

Next: License, Previous: Top, Up: Top

1 Introduction

This document presents the HunLex morphological resource specification framework and precompilation tool which is being developed as part of the Budapest Institute of Technology Media Education and Research Center's HunTools Natural Language Processing Toolkit http://lab.mokk.bme.hu

Next: Motivation, Up: Introduction

1.1 Hunlex: A Short Description

HunLex offers a description language, ie., a formalism for specifying a base lexicon and morphological rules which describe a language's morphology. This description which is stored in textual format serves as your primary resources that represents your knowledge about the morphology and lexicon of the language in question.

Now, providing a resource-specification language is rather useless in itself. Hunlex is able to process these primary resources and create the type of resources that are used by some real-time word-level analysis tools. If you create these from your primary resources you might call them secondary resources. These provide the the language-specific knowledge to a variety of word-level analysis tools.

At present, most importantly, Hunlex provides the language specific resources for the HunTools word-level analysis toolkit see Huntools. This package contains the MorphBase library of word-analysis routines such as spell-checker, stemmer, morphological analyzer/generator and their standalone executable wrappers. Therefore, your single Hunlex description of your favourite language will enable you to perform spell-checking, stemming, and morphological analysis for that language, which is more than useful.

In addition to the HunTools routines, other software which use ispell-type resources will be able to use Hunlex's output. Among these are myspell, an open-source spell-checker (also used in Open Office http://www.openoffice.org, see Myspell), or jmorph, a superfast java morphological analyzer (see Jmorph).

This document describes how you can create your primary resources and what you can (make Hunlex) do with them.

Note: This document is not intended to describe how to use any of these real-time tools, what they are good for. See the above links to learn more about them.

In particular, this document provides you with:

The compulsory tedium about License, Authors, Contact, Submitting a Bug Report, etc. See About.
The indispensable but trivial Installation notes, see Installation.
A bit about Bootstrapping your way as a Hunlex user.
The detailed exposition of the syntax and semantics of the resource specification language (see Description Language);
TODO: not yet
The description of the toplevel control of the hunlex resoure compiler (see Toplevel Control) detailing all the options and parameters. The direct command line interface is also descibed there.
Some hints on Troubleshooting.
Information about Related Software and Resources.
as well as a lot of advanced issues, like Flags, Levels, Tags, the list and format of Files.

Next: Configurable Compilations, Previous: Hunlex, Up: Introduction

1.2 Motivation

The motivation behind HunLex came from two opposing types of requirements lexical resources are supposed to fulfill:

(i) scalability, maintainability, extensibility; and
(ii) optimized format for the application.

The constraints in (i) favour one central, redundancy-free, abstract, but transparent specification, while the ones in (ii) require possibly multiple application-specific, potentially redundant, optimized formats.

In order to reconcile these two opposing requirements, HunLex introduces an offline layer into the word-analysis workflow, which mediates between two levels of resources:

a central database conforming to (i) (also primary resource, input resource),
various application-specific formats conforming to (ii) (also secondary or output resource)

The primary resources are supposed to reasonably designed to help human maintanance, and the secondary ones are supposed to optimize very different things ranging from file size, performance with the tool that uses it, coverage, robustness, verbosity, normative strictness depending on who uses it for what purpose.

HunLex is used to compile the primary resources into a particular application-specific format see Output Resources. This resource compilation phase is an offline process which is highly configurable so that users can fine-tune the output resources according to their needs.

By introducing this layer of offline resource compilation, maintenance, extendability, portability of lexical resources is possible without compromising your performance on specific word-analysis tasks.

Providing the environment for a sensible primary resource specification framework and managing the offline precompilation process are the raison d'être behind Hunlex.

Previous: Motivation, Up: Introduction

1.3 Configurable Compilations

Configuration allows you to adjust the compilation of resources along various dimensions:

choice of output format that suits the algorithm (spell-checking, stemming, morphological analysis, generation, synthesis),
selection of morphemes to be included in the resource
grouping of morphemes to be stripped in one step as an affix cluster (with one rule application)
selection of morphophonological features that are to be observed or ignored
depth of recursive rule application
selection of registers, degree of normativity, etc. based on usage qualifiers in the database
selection of output morphological annotation, configurable tags information

Next: About, Previous: Introduction, Up: Top

2 License

Hunlex is free software.

It is licensed under LGPL, which roughly means the following.

There are no restrictions on downloading it other than your bandwidth and our slothful ways of making things available.

There are no restrictions on use either other than its deficiencies, clumsy features and outragous bugs. However, this can be amended, because there are no restrictions on modifying it either. See also Contribution.

Freedom of use implies that any resources that you created, compiled with the mediation of Hunlex is yours and you hold the right to distribute it in any way. Consider telling us about this great news, see Contact.

What is more, there are no restrictions on redistributing this software or any modified version of it.

For some legalese telling you the same, read the License http://creativecommons.org/licenses/LGPL/2.1/

Todo: Shall we not include the License?

Next: Installation, Previous: License, Up: Top

3 Authors, Contact, Bugs

Next: Submitting a Bug Report, Up: About

3.1 License? What license?

See License.

Next: Requesting a New Feature, Previous: LicenseExtra, Up: About

3.2 Submitting a Bug Report

If you find a bug or an undesireable feature or anything that is worth a couple of lines ranting at the authors, please go ahead and send a bugreport on the MOKK Lab bugzilla page at http://lab.mokk.bme.hu or send a mail to me (see Contact).

Next: Praises, Previous: Submitting a Bug Report, Up: About

3.3 Requesting a New Feature

So you are using hunlex and find yourself realizing that you would need a certain feature desparately which happens not to be implemented. Go ahead and request it from the authors (see Contact) or sit silently and hope!

Next: Contribution, Previous: Requesting a New Feature, Up: About

3.4 Praises

So you found hunlex cool and/or useful and would like the authors to hear about that. How nice is that! See Contact.

Next: Reference, Previous: Praises, Up: About

3.5 Contribution

Hunlex is open source development, so developpers are welcome to contribute to make it better in any imaginable way. Contact us (see Contact) to work out the details of how and what you would want to contribute to Hunlex.

Next: Authors, Previous: Contribution, Up: About

3.6 Reference

For the context of the whole huntools kit, use

@InProceedings{szoszablya_saltmil:04,
  author =       {L\'aszl\'o N\'emeth and Viktor Tr\'on
and P\'eter Hal\'acsy and Andr\'as Kornai
and Andr\'as Rung and Istv\'an Szakad\'at},
  title =        {Leveraging the open-source ispell codebase
for minority language analysis},
  booktitle =    {Proceedings of SALTMIL 2004},
  year =         2004,
  organization = {European Language Resources Association},
  url =          {http://lab.mokk.bme.hu/}
}

A very brief intro to hunlex with a one-page English resumé.

@InProceedings{hunlex_mszny:04,
  author =       {Tr\'on, Viktor},
  title =        {HunLex - a description framework and
resource compilation tool for morphological dictionaries},
  booktitle =    {II. Magyar Sz\'am\'it\'og\'epes 
Nyelv\'eszeti Konferencia},
  institution =  {Szegedi Tudom\'anyegyetem},
  address =      {Szeged, Hungary}
  year =         2004
}

These and other papers can be downloaded from the MOKK Lab publications page at http://lab.mokk.bme.hu

Next: Contact, Previous: Reference, Up: About

3.7 Authors

The author of hunlex and this document is Viktor Trón. He can be mailed to on
v.tron@ed.ac.uk
Hopefully more can be found on MOKK Lab's pages at http://lab.mokk.bme.hu.

Previous: Authors, Up: About

3.8 Contact

We can get in contact if you

Mail to Viktor Trón on
v.tron@ed.ac.uk
Join the forums on http://lab.mokk.bme.hu
Submit a bug report (see Submitting a Bug Report) or feature request (see Requesting a New Feature).

Next: Bootstrapping, Previous: About, Up: Top

4 Installation

So you want to install the hunlex toolkit (see Introduction) from the hunlex source distribution. This document describes what and how you can install with this distribution.

Next: Supported Platforms, Up: Installation

4.1 Download

The latest version of the hunlex source distribution is always available from the MOKK LAB website at http://lab.mokk.bme.hu or, if all else fails, by mailing to me v.tron@ed.ac.uk.

Next: Prerequisites, Previous: Download, Up: Installation

4.2 Supported Platforms

The hunlex executable in principle runs on any platform for which there is an ocaml compiler (see Prerequisites). This includes all Linuxes, unices, MS Windows, etc.

Warning: This package has not been tested on platforms other than linux.

Next: Install, Previous: Supported Platforms, Up: Installation

4.3 Prerequisites

— Prerequisite: ocaml

Hunlex is written in the ocaml programming language http://www.ocaml.org/. OCaml compilers are extremely easy to install and are available for various platforms and downloadable in various package formats for free from http://caml.inria.fr/ocaml/distrib.html.
You will need ocaml version >=3.08 to compile hunlex.

— Prerequisite: ocaml-make

ocaml-make OCamlMakefile (i.e., ocaml-make) is needed for the installation of hunlex and is available from Markus Mottl's homepage at
http://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefile
(I used version 6.19. writing on 8.1.2004).
For OCamlMakefile you will need ocaml and GNU make. (for ocaml-make version 6.19 you will need GNU make version >= 3.80)
NB: Most probably earlier versions of ocaml-make and GNU make should also work but have not been tested yet.

You don't need anything else to use hunlex (but a little patience).

Next: Uninstall and Reinstall, Previous: Prerequisites, Up: Installation

4.4 Install

Hunlex is installed in the good old way, i.e., by typing

     $ make && sudo make install

in the toplevel directory of the unpacked distribution. Read no further if you know what I am talking about or if you trust some God.

The hunlex distribution is available in a source tarball called hunlex.tgz. First you have to unpack it by typing

     $ tar xzvf hunlex.tgz

Then, you enter the toplevel directory of the unpacked distribution with

     $ cd hunlex

To compile it, simply type

     $ make

in the toplevel directory of the distribution.

To install it (on what gets installed, see Installed Files), type

     $ make install

Well, by default this would want to install things under /usr/local, so you have to have admin permissions. If you are not root but you are in the sudoers file with the appropriate rights, you type:

     $ sudo make install

You can change the location of the installation by changing the install prefix path with

     $ sudo make PREFIX=/my/favourite/path install

Changing the location of installation for individual install targets individually is not recommended but easy-peasy if you have a clue about make and Makefile-s. To do this you have to change the relevant Makefile-s in the subdirectories of the distribution. See Installed Files.

If it works, great! Go ahead to Bootstrapping.

If you have problems, doubleckeck that you have the prerequisites (see Prerequisites). If you think you followed the instructions but still have problems, submit a bug report (see Submitting a Bug Report).

If you are upgrading an earlier version of hunlex, you may want to uninstall the earlier one first (see Uninstall and Reinstall).

Next: Installed Files, Previous: Install, Up: Installation

4.5 Uninstall and Reinstall

The install prefix is remembered in the source distribution in the file install_prefix. So after you cd into the toplevel directory of the distribution, you can uninstall hunlex by typing

     $ make uninstall

You can reinstall it with

     $ make reinstall

at any time if you make modifications to the code or compile options.

Warning: Note that if you fiddle with changing the location of individual install targets, uninstall and resinstall will not work correctly.

Previous: Uninstall and Reinstall, Up: Installation

4.6 Installed Files

The following files and directories are installed, paths are relative to the install prefix (see Install):

bin/hunlex
the executable which can be run on the command line (see Command-line Control)
lib/HunlexMakefile
is the Makefile that defines the toplevel control of hunlex (see Toplevel Control). This file is to be include-ed into your local Makefile to give you a Makefile-style wrapper for calling hunlex (see Bootstrapping and Toplevel Control).
Note that HunlexMakefile will assume that the hunlex executable is found in your path. Make sure that install-prefix/bin is in the path (usually /usr/local/bin is in the PATH.
share/doc/hunlex//
is a directory containing hunlex documentation. Various documents in various formats are found under this directory including a replica of this document.
TODO: this is not yet the case
man/hunlex.1
is the hunlex man page describes the command-line use of hunlex (also see Command-line Control. Command-line use of hunlex is not the recommended way of using it for the general user. Instead, use hunlex through the toplevel control described in a chapter (see Toplevel Control).
Todo: there is no man page yet

Next: Toplevel Control, Previous: Installation, Up: Top

5 Bootstrapping

So you installed hunlex and its running smoothly.

This section leads you through the first steps and gives you hints on how you set out working with hunlex.

Create your sandbox directory.

Change to it.

Create your own local Makefile. This will be your connection to the hunlex toplevel control. For your Makefile to understand hunlex predefined toplevel targets (see Targets), you have to include (not insert) the hunlex systemwide Makefile. So you create a Makefile with the following content:

     -include /path/to/HunlexMakefile

where /path/to/HunlexMakefile is the path to HunlexMakefile which is supposed to be installed on your system (see Installed Files), by default under /usr/local/lib/HunlexMakefile.

Now, you are ready to test things for yourself. In order to see if all is well, type

     $ make

at your prompt in the same sandbox directory.

In fact, you will always type the make command to control hunlex. If you don't give arguments to make, a so-called default action (target, see Targets) is assumed. The default target is resources which creates the output resources according to the default settings (see Options). Toplevel control assumes by default that all its necessary resources are found in the current directory (see Input File Options). If this is not the case, because the files do not exist, the compulsory ones are created and the compilation runs creating the output resources.

Surely, the missing files are created without contents and your output resources will be empty as well. However, this vacuous run will test whether hunlex (and toplevel control) is working properly.

Now if you list your directory, you should see:

     $ ls
     affix.aff       grammar  Makefile    phono.conf
     dictionary.dic  lexicon  morph.conf  usage.conf

If this is not the case, go to see Troubleshooting.

The meaning of these files in your directory are explained in detail in another chapter (see Files).

If you type make (or the equivalent make resources again, your resources will not be compiled again, since the input resources did not change. If you still want to compile your resources again, you type

     $ make new resources

which forces toplevel to recompile although no input files changed (see Special Targets).

Now.

If you want to develop (toy around with) your own data and create resources, the next step is to fill in the input files. Read on to learn more about files (see Files) and then about the hunlex morphological resource specification language (see Description Language). Since you want to test your creation, you ultimately have to learn about toplevel control (see Toplevel Control) and gradually about the advanced issues in the chapters that follow these.

If you already have your hunlex-resources describing your favourite language ready and you want to compile specific output resources from it with hunlex, you better read about toplevel control with special attention to the options (see Toplevel Control). If you want to fiddle around with more advanced optimization, such as levels and tags, you may end up having to read everything, sorry.

Next: Description Language, Previous: Bootstrapping, Up: Top

6 Toplevel Control

You typically want to use hunlex through its toplevel control interface. Toplevel control means that you invoke hunlex indirectly through a Makefile to compile your resources.

We envisage typical users of hunlex developing their lexical resources in an input directory and occasionally dump output resources for their analyser into specific target directories for various applications.

If you don't like Makefiles or your system does not have make (how did you compile hunlex, then?), you will then invoke hunlex from a shell and use it via the command-line interface. This is non-typical use and not recommended. The Command-line interface which is almost equivalent in functionality to the Makefile interface is described only for completeness and for people developing alternative wrappers (see Command-line Control).

In fact, you don't actually need to know much about make and Makefile-s to use hunlex. Just follow the steps described in Bootstrapping. We assume that you have a project directory with a Makefile sitting in it in order to try out what is described here.

This document is more like a reference manual that details what you can do with your resources and how you can do it through the Makefile interface. What the resources are and how you can develop your own is described in other chapters (see Files and see Description Language).

Next: Storing your Settings, Up: Toplevel Control

6.1 Verbosity and Debugging

First of all, you need to know how to make your compilation process more verbose.

In order to see what the toplevel Makefile wrapper is doing you have to unset QUIET option. For instance, typing

     $ make QUIET= new resources

will tell you what the Makefile is doing, i.e., what programs it invokes, etc. Unless you are debugging the toplevel control interface of hunlex, you don't want the toplevel to be verbose about what it is doing. So just don't do this.

What you want instead is to make the resource compilation process more verbose, probably because you want to debug your grammar or want hunlex to give you hints what went wrong with your resource compilation.

Verbosity of the hunlex resource compilation can be set with the DEBUG_LEVEL option. Typing

     $ make DEBUG_LEVEL=1

in your sandbox (with empty primary resources) will give you something like this (see Bootstrapping):

Reading morpheme declarations and levels...0 morphemes declared.
Reading phono features...0 phono features declared.
Reading usage qualifiers...0 usage qualifiers declared.
Parsing the grammar...ok
Parsing the lexicon and performing closure on levels... 0 entries read.
Dynamically allocating flags; dumping affix file...ok
Dumping precompiled stems to dictionary file...ok
0.00user 0.00system 0:00.02elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+329minor)pagefaults 0swaps

The first couple of lines give you information about the stages of compilation and are described elsewhere.

The enigmatic last two lines give you information about the time it took hunlex to compile your resources. If you are not interested in this information you can deset it using the TIME option (see 4)

You can choose not to bother with this information and deset the TIME option. Typing, say,

     $ make TIME= new resources

will not measure and display the duration of compiling.

Next: Targets, Previous: Verbosity and Debugging, Up: Toplevel Control

6.2 Storing your Settings

Your favourite settings can be remembered by adding them to your local Makefile in a rather obvious way. Let us assume you want your DEBUG_LEVEL to be set 1 by default and also that you couldn't care less about the time of compilation. In this case you want to have the following in your Makefile:

     DEBUG_LEVEL=1
     TIME=

You can also define your default target (see Targets), i.e., the 'task' that make will carry out if you invoke it without an expicit target. For instance, if you always want to recompile your resources each time you invoke make irrespective of whether your primary resources and/or compile configurations changed, you can add the following line at the top of the file:

     default: new resources

Now, your Makefile looks something like this:

# comments are introduced by a '#'
# my favourite target
default: new resources 
# my favourite settings
DEBUG_LEVEL=1
TIME=
-include /path/to/HunlexMakefile

Next: Options, Previous: Storing your Settings, Up: Toplevel Control

6.3 Targets

The functionality of hunlex is accessed through targets. Targets are arguments of the make command which reads your local Makefile and ultimately consults the systemwise hunlex toplevel Makefile called HunlexMakefile (see Installed Files).

Usually, you will control hunlex through make by typing:

     make options target

where options is a sequence of variable assignments which set your options described below (see Options) and where targets is a sequence of targets. For more on variables and targets you may consult the manual of make.

The available toplevel targets are detailed below:

Next: Special Targets, Up: Targets

6.3.1 Resource Compilation Targets

— Resource Compilation Target: resources

compiles the output resources given the input resources and configuration files. The necessary file locations and options are defined by the relevant variables described below (see Input File Options). This file creates the dictionary and the affix files (by default dictionary.dic and affix.aff, see Output Resources).

— Resource Compilation Target: generate

by setting MIN_LEVEL to a big number, this call generates resources that contain all words of the language precompiled into the dictionary. And the stems of the dictionary without their output annotation (see Annotation) are found in the file *wordlist*.

Next: Test Targets, Previous: Resource Compilation Targets, Up: Targets

6.3.2 Special Targets

— Special Target: new

pretends that the base resources are changed. You need this directive if you want to recompile the resources althouth no primary resource has changed. This might happen because you are using a different configuration option. (If the base resources are unchanged, no compilation would take place, you have to force it with 'new', see make).
          make MIN_LEVEL=3 new resources
     

— Special Target: clean

removes all intermediate temporary files, so that only lexicon, grammar, and the configuration files, and the output resources (affix and dictionary) remain.
Todo: This is not implemented yet.

— Special Target: distclean

removes all non-primary resources, so that only lexicon, grammar, and the configuration files remain.

Previous: Special Targets, Up: Targets

6.3.3 Test Targets

Additional targets for testing are available, these all presuppose that the huntools (see Huntools) is installed and that the executable hunmorph is found in the path. An alternative hunmorph can be used by setting the HUNMORPH option, (see Executable Path Options).

— Test Target: test

tests the resource by making hunmorph read the resources (dic and aff files) and analyze the contents of the file that is value of TEST (see Input File Options). TEST is by default set to the standard input, so after saying
          $ make test
     
you have to type in words in the terminal window (exiting with C-d).
If you want to test by analyzing a file, you have to set the value of TEST.
          $ make TEST=my/favourite/testfile test
     
Test outputs are to stdout, so just pipe it to a file
          $ make TEST=my/favourite/testfile test > test.out 2> test.log
     

— Test Target: testwordlist

will run hunmorph on the wordlist file (see Resource Compilation Targets, generate) and outputs the result on the standard output (so you may want to pipe the result to a file).

— Test Target: realtest

puts hunlex and the analyzer to the test, by creating the resources according to the settings of your makefile, and then run hunmorph on the generated whole wordlist.
Warning: Note that this target first generates all words and then creates the resources again. Running this on huge databases is probably not a good idea.
The way you want to test a bigger database instead is by creating a a set of words that your ideal analyzer has to recognize or correctly analyze and test on that (with test). Realtest is just a quick and dirty shorthand for toy databases to check if everybody is with us.

Previous: Targets, Up: Toplevel Control

6.4 Options

Options of the toplevel are in effect Makefile variables that can be set at the user's will.

(All the command-line options of hunlex can be accessed through the toplevel options are passed to hunlex to regulate the compilation process. The documentation of command line options is found in Command-line Control, but only for the record. All hunlex options are all capital letters (LEXICON) and all command line options begin with a dash and are all small letters but otherwise they are the same (-lexicon)).

All options can be set or reset in your local Makefile (and remembered, see Storing your Settings). These will override the system default. Both the system default and your local default can be overriden by direct command-line variable assignments passed to make, such as the ones shown in this file:

     $ make QUIET= DEBUG_LEVEL=3 OUTPUTDIR=/my/favourite/ouputdir

Listed and explained below are all the hunlex options (all public Makefile variables) that the toplevel control provides for the user to manipulate.

When you see something like variable (value), it means the default value of the variable variable is value.

Next: Verbosity and Debug Options, Up: Options

6.4.1 Executable Path Options

— Option: HUNLEX (hunlex)

The hunlex executable is by default assumed to be found in the path with name hunlex. By default, installation installs hunlex into /usr/local/bin (see Installation). If you want to use (i) an alternative version of hunlex that is not the one found in the path, or (ii) an uninstalled version of hunlex, or (iii) an installed version but the path to which you don't want to include in your path, then you should set which hunlex to use with this variable.
          HUNLEX=/my/favourite/version/of/hunlex
     

— Option: HUNMORPH (hunmorph)

You need the executable hunmorph from the Huntools package (see Huntools) only for testing, if you don't want to test with direct analysis (just want to compile the resources), you don't need to bother.
When used, however, the hunmorph executable is assumed to be found in the path with name hunmorph. If this is not the case, update your path or provide the path to hunmorph with the line
          HUNMORPH=/my/favourite/version/of/hunmorph
     

Next: Input File Options, Previous: Executable Path Options, Up: Options

6.4.2 Verbosity and Debug Options

— Option: QUIET (@ = quiet)

Quiet mode is set by default which means that the workings of the Makefile toplevel won't bore you to death. The compilation debug messages that Hunlex blurps when running can still be displayed independently (see the DEBUG_LEVEL option below). The QUIET option only refers to what the toplevel wrapper invokes (this way of handling Makefile verbosity is an idea nicked from OCamlMakefile by Markus Mottl).

— Option: DEBUG_LEVEL (0)

sets the verbosity of hunlex itself. By default debug level is set to 0. Debug messages are sensitive to the debug level in the range from 0 to 6-ish: the higher the number the more verbose hunlex is about its doings.
0 is non-verbose mode, which means that it only displays (fatal) error messages. If you set DEBUG_LEVEL to say -1, even error messages will be suppressed (only an uncaught exception will be reported in case of fatal errors).
It is typically a good idea to set DEBUG_LEVEL to 2 or 3 and request more if we really want to see what is happening.
Caveat: In fact you won't understand the messages anyway, so the debug blurps just give you an idea of the context where something went wrong with your grammar/lexicon, etc.

Todo: This shouldn't be so and debug messages pertaining to grammar development should be self-evident or well designed and documented. Especially parsing errors and/or compile warnings about the grammar and lexicon should be clear.

Usually you want to create a log by piping the debug output of make (standard error) with your debug messages to a file. This can be done by, for instance by
          $ make DEBUG_LEVEL=5 resources 2> log
     

— Option: TIME (time)

By default with every run of hunlex it is measured how long it takes to compile the resources (unix shell's time command) and this information is displayed. Surely, this is only interesting with big lexicons. If you (i) don't have a time command, (ii) have a different time command, (iii) don't want time measured and displayed, just reset the TIME variable. The option can be unset by the line
          TIME=
     
in your local Makefile.

Next: Output File Options, Previous: Verbosity and Debug Options, Up: Options

6.4.3 Input File Options

The type and use of hunlex input resource files are described in detail elsewhere (see Input Resources). The options by which their locations can be (re)set are listed below:

— Option: LEXICON (grammardir/lexicon)

lexicon file

— Option: GRAMMAR (grammardir/grammar)

grammar file

They can all be set to alternative paths individually. If they are in the same directory, the directory path can also be set via the variable GRAMMARDIR:

— Option: GRAMMARDIR (inputdir)

the directory for the hunlex primary input resource files, which is, by default, set to inputdir, the value of the variable INPUTDIR, see below.

There are three further input resources which need to be present for a hunlex compilation. These are the compilation configuration files.

— Option: USAGE (confdir/usage.conf)

the usage configuration file (see Configuration Files)

— Option: MORPH (confdir/morph.conf)

the morph(eme) configuration file (see Configuration Files)

— Option: PHONO (confdir/phono.conf)

the configuration file (see Configuration Files) for morphophonologic and morphoorthographic features

There are two optional configuration files, the signature and the flags file. By default, the options correspoding to these files are set to the empty string, which tells hunlex not to use a feature structures (see Feature Structures) or custom output flags (see Flags).

— Option: SIGNATURE ('')

The location of the signature file used to process and validate features structures (see Feature Structures, see Configuration Files). If it is set to the empty string (the default), hunlex does not use feature structures.
If you use this file, it makes sense to call it something like fs.conf or signature.conf and store it in confdir with your other configuration files, so the assignment
          SIGNATURE=$(CONFDIR)/fs.conf
     
is an appropriate setting.

— Option: FLAGS ('')

The location of the custom output flags file (see Configuration Files) used to decide which flags are used in the output resources (see Flags). If it is set to the empty string (default), hunlex will use a built-in flagset to determine flaggable characters (see Flags).
If you use this file, it makes sense to call it something like flags.conf and store it in confdir with your other configuration files, so the assignment
          FLAGS=$(CONFDIR)/flags.conf
     
is an appropriate setting.

All configuration files can be set to alternative paths individually. If they are in the same directory, the directory path can also be set via the variable CONFDIR:

— Option: CONFDIR (inputdir)

the directory for the hunlex compilation configuration files, which is, by default, set to inputdir, the value of the variable INPUTDIR, see below.

As explained all input files can be set to alternative paths individually or primary resources together and configuration files together. If all input resources (primary and configuration) are in the same directory, this directory path can also be set via the variable INPUTDIR:

— Option: INPUTDIR (. = current directory)

the directory for all hunlex input resource files, which is by default, set to the currect directory.

A special test file is only used with the Test targets:

— Option: TEST (/dev/stdin)

The value of TEST is a file (well, a file descriptor, to be precise), the contents of which is tested whenever the toplevel test target is called (see Test Targets). By default it is set to the standard input, so testing with test will expect you to type in words in your terminal window.

Next: Resource Compilation Options, Previous: Input File Options, Up: Options

6.4.4 Output File Options

Hunlex's output resources are the affix and the dictionary files (see Output Resources). The options by which their locations can be (re)set are listed below:

— Option: AFF (outputdir/affix.aff)

affix file

— Option: DIC (outputdir/dictionary.dic)

dictionary file

— Option: WORDLIST (outputdir/wordlist)

The wordlist generated by the generate target (see Resource Compilation Targets).

where outputdir (the default directory of the files) is the value of the variable OUTPUTDIR:

— Option: OUTPUTDIR (. = current directory)

the directory for the hunlex output resource files, which is, by default, set to the currect directory

As you can see, the default setting is that all input and output files are located in the current directory under their recommended canonical names. Putting the output resources in the same directory as the primary resources might not be a good idea if you want to compile various types of output resources.

Previous: Output File Options, Up: Options

6.4.5 Resource Compilation Options

— Option: DOUBLE_FLAGS ()

if set, hunlex uses double flags (two-character flags) in the output resources (see Flags).

The following two options regulate the level of morphemes. You find more details about levels in a separate chapter (see Levels).

— Option: MIN_LEVEL (1)

Morphemes of level below MIN_LEVEL are treated as lexical, i.e., are precompiled with the appropriate stems into the dictionary file. By default, only morphemes of level 0 or below are precompiled into the dictionary.

— Option: MAX_LEVEL (10000)

Morphemes with levels higher than the value of MAX_LEVEL are, on the other hand, treated as being on the same (non-lexical) level. By default, only morphemes of level above 10000 are treated as having the same level.

The options below regulate the format of output resources in detail:

— Option: TAG_DELIM ('')

determines the delimiter hunlex puts between individual tags of affixes when tags are merged.
This is interesting if you have a tagging scheme where a morpheme is tagged with a label MORPH1, but in the output you want them clearly delimited, like:
          wordtoanalyze
          >lemma_MORPH1_MORPH2
     
The above is possible if you set
          TAG_DELIM='_'
     

— Option: OUT_DELIM (' ')
— Option: OUT_DELIM_DIC (<TAB>)

sets the delimiter to put between the fields of the affix and dictionary files, respectively. By default it is set to a single space for the affix file and set to <TAB> in the dictionary.
NB: A tab might allow better postprocessing in the affix file and even allow spaces in the tags which might be useful.
At the time of writing the huntools reader only allowed a TAB not a space as delimiter in the dictionary file so change with caution.

— Option: MODE (Analyzer)

the major output mode regulates what information gets output in the affix and dictionary files and how affix entries are conflated.
Warning: This option is not effective at the moment due to the lack of a clear functional specification and it is also unclear how this option should interact with the option STEMINFO (below).

Todo: Clarify this. See warning.

The possible values at the moment are:

Spellchecker
Stemmer
Analyzer
NoMode
all without effect (see warning).

— Option: STEMINFO (LemmaWithTag)

regulates what info the analyzer should output about a word.
This option can take the following values:

Tag only output the tag of a lexical stem (to output the pos tag of the stem)
Lemma only output the lemma of a lexical stem (for stemmers doing lexical indexing)
Stem output the stem allomorph of the stem (e.g., for counting stem variant occurrences?)
LemmaWithTag output lemma with the tag (default, for morphological analysis)
StemWithTag output stem (allomorph) with the tag (?)
NoSteminfo no output for the dictionary (for spell-checker resources).

— Option: FS_INFO ()

regulates if feature structure annotations (see Feature Structures) should be output along with the normal (string type) tags (see Tags). This is extremely useful for debugging purposes. If the manually supplied tag chunks are supposed to yield well-formed features structures in the output annotation of the analyzer, it is a good idea to check whether this is the case. If this option is set to -fs_info (the corresponding command-line option), the feature structures resulting from unification are output along with the tags in the dictionary and the affix file. Typically, this option is used with the generate target (see Resource Compilation Targets) and the second and the third columns of the dictionary file are compared (they are supposed to be identical).
Todo: This process should be added to the set of toplevel test targets.

The affix file specifies a lot of variables to be read by the morphbase routines. Some of these are metadata but some are crucial for suggestions and accent replacement for automatic error correction, see below.

Warning: This part is a disasterously underdevelopped part of hunlex and an outragously ad-hoc part of morphbase as well.

Preambles can be generated to hunlex output files these are meant to be 'official' comment headers about copyright information, etc.

— Option: AFF_PREAMBLE ()
— Option: DIC_PREAMBLE ()

are files to be included as preambles in the affix and dictionary output resources, respectively. By default, they are unset, ie., no preambles will be included into the output resources.
NB: This feature is only available on toplevel control and will never be integral part of the hunlex executable.

— Option: CHAR_CONVERSION_TABLE ()
— Option: REPLACEMENT_TABLE ()

These are the character-conversion table and replacement table to be included into morphbase resources if alternatives (e.g., for spellchecking) or robust error correction is required (see Huntools). These features are documented in the huntools documentation (hopefully, but certainly not here, see Output Resources, see Huntools).
NB: This feature of including these extra files into the affix file is only available through toplevel control and will never be integral part of the hunlex executable.

— Option: AFF_SET (ISO8859-2)

Identifies the character-set for the analyzer reading the affix file. By default, this is set to ISO8859-2, i.e., Eastern European. Maybe this is the 'hun' in hunlex...

— Option: AFF_SETTINGS (confdir/affix_vars.conf)

is the file from which settings for some affix variables are read. If it doesn't exist, no affix variables other than the ones directly managed are dumped into the affix file

Todo: Need to sort these things out.

Some affix file variables are managed by hunlex internally but dumped to the affix file by the toplevel routines.

Todo: This is done at the moment by the toplevel Makefile, but should be integrated into the hunlex executable itself.

— Option: AFF_FORBIDDENWORD (!)
— Option: AFF_ONLYROOT (~)

These two flags will be attached to (i) bound stems and (ii) affix entries which can not be stripped first (i.e., suffixes which cannot end a word, see Flags).

— Option: STEM_GIVEN ()

If this flag is present, it indicates for the stemmer/analyzer that the stem string is to be output or not as part of the annotation. For instance (if STEM_GIVEN flag is 'x'), the following dic file
          go/	[VERB]
          went/x	go[VERB]
     
will result in the following stemming:
          > go
          go[VERB]
          > went
          go[VERB]
     
This makes more compact dictionaries. What information one wants the stemmer and analyzer to output can be configured through hunlex options (see below).
Todo: This flag is not implemented yet (since it is not implemented yet in morphbase, either, but probably will never be implemented since treatment of special flags shouldn't be user customizable above the choice of flaggable characters.

Warning: Make sure the flags given here are consistent with the double flags option and the custom flags file (see the FLAGS variable above, and see Flags).
These options are superfluous and should be automatically managed by hunlex which would write them into the affix file. Very likely to be deprecated soon.

Todo: This needs to be implemented.

Warning: Additional settings that are to be included in the affix file and are crucial part of the resources (partly should be set by hunlex itself) such as compoundflags. I have no idea what to do with these at the moment. The ones I know of are listed here just for the record.

Todo: This needs to be sorted out.

Some of these data are actually global and could even go to the settings preamble (AFF_SETTINGS):

NAME
LANG
HOME
VERSION

These ones should be dynamic metadata

The ones below should clearly be controled and output by hunlex itself. (also ONLYROOT and FORBIDDENWORD, but they are handled by the toplevel, at least).

Ones relating to compounding (compounding is handled very differently by myspell, morphbase and jmorph):

COMPOUNDMIN
COMPOUNDFLAG
COMPOUNDWORD?
COMPOUNDFORBIDFLAG
COMPOUNDSYLLABLE
SYLLABLENUM
COMPOUNDFIRST
COMPOUNDLAST

Warning: Compounding is as yet unsupported by hunlex and should be worked on with high priority.

I have really no idea about the following ones:

TRY
ACCENT
CHECKNUM
WORDCHARS
HU_KOTOHANGZO

Next: Files, Previous: Toplevel Control, Up: Top

7 Description Language

This chapter is about the framework that allows you to describe the morphology and lexicon of a language. Below we specify the syntax and semantics of this description language. The files written in this language (the lexicon and grammar) are the primary resources of hunlex (see Input Resources) and the basis for all compiled output (how this works is described in another chapter, see Toplevel Control).

There are three kinds of statement in this language:

morph definition
macro definition
metadata definition

Only the grammar file can contain macro definitions (see Macros) and metadata definitions (see Metadata) and both the lexicon and the grammar file can contain morph definitions which describe morphological units (affix morphemes, lexemes and their paradigms). In this respect, the syntax of the lexicon and grammar files are identical and, therefore, it is discussed together (see Morphs) are not described separately, although the usefulness (and sometimes even the semantics) of certain expressions might be different in the lexicon and in the grammar.

Next: Macros, Up: Description Language

7.1 Morphs

Morphs are the central entities in the description language. They stand for morphological units of any size and abstractness including affix morphemes, lexemes, paradigms, etc. and are not what linguists call morphs (i.e., a particular occurrence of one morpheme). Morphs are meant to describe an affix morpheme or a lexeme, but in fact, it is up to you what level of abstractness you find useful in your grammar, so you can have individual morphs describing each allomorph of a morpheme or each stem variant of a lexeme. But the point is that morphs support description of variants or allomorphs. Anyway, a morph is basically a collection of rules, variants, etc. that somehow belong together. Ideally, a variant of an affix morpheme is actually an affix allomorph, a concrete affixation rule, while a variant of a lexeme is a stem variant or an exceptional form of the lexeme's paradigm.

Next: Blocks, Up: Morphs

7.1.1 Morph Preamble and Variants

— statement: (MORPH:) preamble, variant0, variant1, ... ;
— preamble: morph-name block...
— variant: block...

A morph statement is introduced by an optional MORPH: keyword. It is a good idea to drop it and start the statement directly with the preamble (in fact, the name of the morph), which is compulsory.

A morph description has a preamble, i.e., a header describing the global properties of the morph, the properties which characterize all of its variants/allomorphs.

After the preamble, one finds the variants one after the other. The preamble and the variants are delimited by a comma.

Finally, the morph definition like all other statements is closed by a semicolon.

The preamble starts with the name of the morph. The name of the morph can be any arbitrary id, a mnemonic string that ideally uniquely identifies the morph. Referring to other morphs is an important in describing how morphemes can be combined: in order for these references to be reliable, the names in the grammar are supposed to be unique. This is not important in the lexicon, where homophonous lemmas can have identical names (however, this is not recommended, since, in such a case, for instance, morphological synthesis would be unable to distinguish two senses especially if they are of the same morphosyntactic category).

The rest of the preamble as well as each individual variant is composed of blocks. Blocks are the ingredients of the description, they specify information such as conditions of rule application, output of a rule, the tag associated with the rule, etc.

In sum, then, morphs have the following structure:

— statement: (MORPH:) morph-name block ... (, block ... )* ;

Blocks are explained in detail in the next subsection.

Previous: Morph Preamble and Variants, Up: Morphs

7.1.2 Blocks

Blocks are the ingredients of the description, they specify information such as conditions of rule application, output of a rule, the tag associated with the rule, etc.

Blocks all have a leading keyword followed by some expressions (arguments) and last till the next keyword or the end of the variant:

— block: KEYWORD argument...

Blocks can come in any order within a variant and can be repeated any number of times. So writing

     KEYWORD: argument0 argument1 argument2 ...

has the same effect as when it is written like

     KEYWORD: argument0 KEYWORD: argument1 KEYWORD: argument2 ...

or even

     KEYWORD: argument0 SOME-OTHER-BLOCKS KEYWORD: argument1 SOME-OTHER-BLOCKS KEYWORD: argument2 ...

or when it is 'included' with a macro (see Macros).

Certain blocks specify information in a cumulative way, so every time they are specified the information is added to the info specified so far. For instance an IF block is cumulative, all the arguments of all the IF blocks of a variant cumulate to give the conditions of rule application, i.e., the rule applies only if all conditions on features are satisfied by the input (see IF block below).

However, other blocks do not specify information that can be interpreted cumulatively, so it does not make sense to have more than one argument with them or specify them more than once for a variant. (They, however, may still be specified in the preamble and overriden in a variant, for instance).

In every case, out of contradictory information, the one given last ”has the last word” overriding previous ones.

So if you write

     CLIP: 1 CLIP: 2

it is the same as

     CLIP: 2

In what follows, blocks are listed and explained one by one.

— block: DEFAULT feature ...

default morphs are used to assign features to inputs unspecified for some features. A morph with a default block just adds extra rules that leave alone inputs which are specified for any of the features to be defaulted. The variants of a morph having a default block in their preamble will assume that neither of the features to be defaulted is present in the input.
So morph DEFAULT: feature0 feature1 , MATCH: x OUT: feature0 ;
is equivalent to
morph , IF: !feature0 feature1 OUT: feature1 , IF: feature0 !feature1 OUT: feature0 , IF: feature0 feature1 OUT: feature0 feature1 , IF: !feature0 !feature1 MATCH: x OUT: feature0
Filters typically want to pass on their whole input by default.

— block: VARIANT variant

this block defines the actual affix or lexis.
The exact shape of variant determines what type of affix, lexis the variant describes:

+aff describes a suffix when the rule applies aff is appended to the end of the input (after possibly clipping some characters)
aff+ describes a prefix when the rule applies aff is appended to the beginning of the input (after possibly clipping some characters)
pref+suff describes a circumfix
when the rule applies pref is appended to the beginning of the input (after possibly clipping some characters) and suff is appended to the end of the input (after possibly clipping some characters)
lexis defines a lexis. This is typically used in the lexicon and used as input to the rules. If the VARIANT keyword is left out, it has to come as the first block of the rule (after the comma closing the preamble or the preceding rule).
If a lexis is used in the grammar, it is meant to stand for a suppletive form. Since it may well be a typo, a warning is given. We encourage the policy to put suppletive paradigmatic exceptions as variants of the lexeme in the lexicon file. Especially since matches are ineffective for lexis rules, therefore conditions on the suppletion should be expressed with features which is much safer anyway.

All the lexis and affix strings can contain any character except whitespace, comma, semcolon, colon?? exclamation mark slash tilde plus sign [^'#' ' ' '\t' '\n' ';' ',' '\r' '!' '/' '+' '~']
there should be a way to allow escapes.
Substitutions (which are special kind of rules) are specified by REPLACE/WITH blocks.

— block: CLIP integer

This block specifies the number of characters that needs to be clipped from one end of the input.
It has no effect if the variant is a lexis or substitution. So you don't use this block in the lexicon.
If no CLIP block is given, no characters are clipped (the integer defaults to zero).

— block: REPLACE pattern
— block: WITH template

These blocks specify a substitution.
pattern is a hunlex regular expression.
tamplate is a replacement string which can contain special symbols '\1', '\2', etc, which reference the bracketed subpatterns in pattern.

— block: MATCH pattern

specifies a match condition on rule application. The rule only applies if the input matches pattern, which is a hunlex regular expression. So you don't use this in the lexicon.
The matched expression defines a match at the edge of the word, the beginning for prefixes and the end for suffixes. You may include special symbols like ^ and $, to make this more explicit.
Match blocks are non-cumulative, but circumfixes allow two matches (one beginning with a ^ and one ending in a $).

— block: IF condition ...

If blocks specify the conditions of rule application. Conditions are either positive conditions (feature name) or negative conditions (NOT feature-name).
The rule only applies if the input has the positive features specified in the IF blocks and doesn't have the negative features specified in the IF block.
IF blocks are therefore cumulative and the conditions are understood conjunctively.

— block: OUT output ...

specify the output conditions of the variant (affix rule or lexis). An output can be a feature or a morph
Features can be restricted to particular morphs.

— block: TAG tag-string

specifies the output tag chunk associated with the variant.

— block: USAGE usage-qualifier ...

specifies usage qualifiers describing the variant.
Cumulative (conjunctive)

— block: FILTER feature ...

tells that the morph in question is a filter which defines fallback rules for lexical features.
This means that the variants are meant to apply only if the input has none of the filtered features.
Has no effect within individual variants or in the lexicon. Only relevant in a morph preamble in the grammar.
Cumulative (conjunctive on the rule conditions)

— block: KEEP feature ...

Defines 'inheritance' of features: a feature mentioned in the KEEP block is an output feature of the result of rule application if and only if the input has the feature. As long as a particular variant applies to an input.
If output features and keep features overlap, output features are meant to override inheritance.
Features which are restricted by the input condition (IF block) are inherited normally, but since they are known, can also be mentioned in the OUT block for clarity.
NB: The thingies following KEEP in a KEEP block are features. They can not be macro names. Don't trick yourself by 'abbreviating' a sequence phonofeatures with a macro and then refer to that in a keep block. Don't forget that macros abbreviate (a series of) blocks, so clearly they can't be nested within a KEEP block.

Cumulative

— block: FREE bool

specifies if the rule application gives a full form. For bound stems or non-closing affixes, it has to be set to false.
By default, variants in the lexicon are NOT-free variants in the grammar are free ???? !!!! is this ok?

— block: FS feature-structure

specifies the feature structure graph to merged when the rule applies. feature-structure is a kr-style features structure description string.

— block: FS feature-structure

— block: PASS bool

Next: Metadata, Previous: Morphs, Up: Description Language

7.2 Macros

— expression: DEFINE macro-name blocks

defines a macro named macro-name. Later (any time after this definition), any time macro-name is encountered it is understood as if it said blocks. blocks is a sequence of any blocks including (other) macro-names. The macro-name appearing elsewhere than its definition has to be already defined.
If a macro-name is a declared morph-name? If a macro-name is a declared feature?

— expression: REGEXP regexp-name regexp

binds regexp-name to a hunlex regular expression, i.e., a regular expression that can contain regular expression macro-names in angle-brackets. regexp-name can be referenced within any regular expression later. An expression is resolved by replacing the substring <regexp-name> with the resolved regexp.
This means, that you have to esacpe you <-s and >-s if they do not delimit regexp names.
As said, you can define regexp-macros using other macros, only at the time of using a regexp-name it has to be defined already (the definition should be earlier in the file), so that it can be resolved at the time of reading the definition.

Previous: Macros, Up: Description Language

7.3 Metadata

Next: Command-line Control, Previous: Description Language, Up: Top

8 Files

There are various files that hunlex processes. Input as well as Output files are described in this chapter. The file names used in this section are just nicknames (which happen to be the default filenames assumed) and can be changed at will with setting toplevel option (see Options).

Next: Output Resources, Up: Files

8.1 Input Resources

There are several types of files hunlex considers and they will all be discussed in turn.

Lexicon and grammar are the two files which are considered the primary resources. These files contain the description of the language's morphology with all the rules for affixation, lexical entries, specifying morphological output annotation (tags), etc., see Primary Resources. Secondly, there are configuration files, which declare the morphemes and features that are considered active by hunlex for a particular compilation. By choosing and adjusting parameters of these features, one can manipulate under- and over-generation of the analyzer (see Resource Compilation Options) and, most importantly, regulate which affixes are merged together to yield the affix-cluster rules dumped into the affix file. The way affixes are merged is crucial for the efficiency of real-time analyzers (see Levels). These files are also described below (see Configuration Files).

Next: Configuration Files, Up: Input Resources

8.1.1 Primary Resources

Primary resources are the files that you are supposed to develop, maintain, extend and that describe your morphology (see Motivation). There are two primary resources: the grammar and the lexicon. These files are described below.

Next: Grammar, Up: Primary Resources

8.1.1.1 Lexicon

The lexicon file (the file name of which is lexicon by default, but can be set through options, see Input File Options) is the repository of lexical entries, containing information about:

lemmas
stem-allomorphs (variants) belonging to the lemma's paradigm
suppletive forms expressing some paradigmatic slot of the lemma
morphological output annotation (tag) of the lemma (and the variants)
sense indices (arbitrary tag to distinguish identical lemmas)
the morphosyntactic and morphophonological features which characterize variants (or lemmas), and which determine its morphological combinations (i.e., which rules apply to it and how).
usage qualifiers of the variants (or lemmas), such as register, usage domain, normative status, formality, etc.

The syntax of the lexicon file is basically the same as that of the grammar, except that it cannot contain macro definitions (see Macros). This syntax of describing morphology is explained in detail in another chapter (see Description Language).

For examples of lexicons, have a look at the zillion examples in the Examples directory that comes with the distribution (see Installed Files).

Previous: Lexicon, Up: Primary Resources

8.1.1.2 Grammar

The grammar file is the other primary resource and also absolutely necessary to describe the morphology of your language. Its name is grammar by default but can be changed by setting toplevel options (see Input File Options). The grammar file specifies:

affix morphemes
affix-allomorphs (variants) belonging to the same morpheme
morphological output annotation (tag) of the affix morpheme (and its variants)
the morphosyntactic and morphophonological features which characterize variants (or morphemes) and which determine its morphological combinations (which rules apply to it and how).
usage features of the variants (or affixes), such as register, usage domain, normative status, formality, etc.
possibly special pseudo affixes, so called filters which assign (default) features to variants based on their form (orthographic patterns) or other features.

The syntax of the grammar file is the same as the one used for the lexicon except that the grammar file can contain macro definitions (see Macros). The syntax and semantics of this description language is explained in detail in another chapter, see Description Language.

For examples of grammar files, have a look at the zillion examples in the Examples directory that comes with the distribution (see Installed Files).

Next: Morpheme Configuration File, Previous: Primary Resources, Up: Input Resources

8.1.2 Configuration Files

Configuration files are the files which mediate between primary resources describing a language and a particular resource created for a particular method, routine, application.

There are three configuration files which tell hunlex which units and features should be included into the output resource from among the ones mentioned in the primary resources. The units (morphemes, features) not declared in these configuration files are considered ineffective by hunlex while reading the primary resources.

The format of these three definition files are the same, each declaring a unit each line (with some parameters) and accept comments starting with '#' lasting till the end of the line.

They are discussed in turn below.

Next: Feature Configuration File, Previous: Configuration Files, Up: Input Resources

8.1.3 Morpheme Configuration File

The morph.conf file is one of the compilation configuration files that determine how hunlex compiles its output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).

It declares the affix morphemes and the filters that are to be used from among the ones that are in the grammar.

Warning: the affix morphemes not listed (or commented out) in this file are ineffective for the compilation (as if they were not in the grammar).

Each line in this file contains the affix morpheme's name and optionally a second field, which gives the level of the morpheme. If no level is given, the affix is assumed to be of level maximum_level (the value of the option MAX_LEVEL, see Resource Compilation Options). Very briefly, levels regulate which affixes will be merged with which other affixes to yield the affix clusters that are dumped as affix rules into the affix file. The odds and ends of levels are described in detail in another chapter (see Levels).

For examples of the rather dull morph.conf files, browse the examples in the Examples directory that comes with the distribution (see Installed Files).

If you have a grammar and you want to declare all the (undeclared) morphs defined in it by including them in the morph.conf. All you do is type

     make DEBUG_LEVEL=1 new resources 2>&1 | grep '(morph skipped)' | cut -d' ' -f1 >> in/morph.conf

in the directory where your local Makefile resides. This will append all the undeclared morphs (one per line) to the morph.conf file. Note, the morphs so declared will be of level maximum_level (see above).

Next: Usage Configuration File, Previous: Morpheme Configuration File, Up: Input Resources

8.1.4 Feature Configuration File

The phono.conf file is one of the compilation configuration files that determine how hunlex compiles the output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).

The phono.conf file is the file simply listing all the features that we want used from among the ones used in the grammar and the lexicon. Very briefly, features are attributes of affixes and lexical entries the presence or absence of which can be a condition on applying an affix rule.

Warning: Features used in the grammar but not mentioned (or commented out) in the phono.conf file will be ignored (as if they were never there) for the present compilation by hunlex when reading the primary resources.

Warning: Features mentioned in phono.conf but never used in the grammar or the lexicon are allowed and maybe should generate a warning, but they don't. This may cause a lot of trouble.

So, phon.conf simply declares the features one on each line and allows the usual comments (with a '#').

For examples of phono.conf files, browse the examples in the Examples directory that comes with the distribution (see Installed Files).

If you have a grammar and you want to declare all the (undeclared) features referred to in the grammar in conditions by including them in the phono.conf. All you do is type

     make DEBUG_LEVEL=1 new resources 2>&1 | grep '(feature skipped)' | cut -d' ' -f1 | sort -u >> in/phono.conf

Previous: Feature Configuration File, Up: Input Resources

8.1.5 Usage Configuration

– The usage.conf file is is one of the compilation configuration files that determine how hunlex compiles the output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).

usage.conf in particular determines which usage qualifiers are allowed for the input units (lexical entries, affixes, filters and the variants thereof) that are included into the resource to be compiled. Units having a usage qualifier that is not listed in this file are ignored for the compilation (as if they were not there).

NB: Usage qualifiers are not first class features. They can not be negated or used as conditions on rule application. They are simply used to categorize rules (affixes and stems) in certain dimensions such as etymology, register, usage domain, normative status, formality, etc.

In addition to declaring allowed usage qualifiers, this file has another function as well. Each line containing the usage qualifier may contain a second field which is a tag associated with that usage feature. If this field is missing, the name of the usage qualifier string is assumed to be its tag. Usage qualifier tags can be output by the analyzer if they are compiled into the resources by hunlex.

This can be configured with the output info option (see Resource Compilation Options).

Warning: This option is not implemented yet.

Todo: This is not implemented yet. I don't even know if this is fine like this. The problem is that they cannot really be just intermixed with the ordinary morphological tags.

Various dimensions of usage information can be made effective by introducing expressions with arbitrary leading keywords (see Description Language). Redefining each of the wanted usage dimensions in the parsing_common.ml file will result in making any one or more of them effective as usage qualifiers. The point is that you can keep a lot of information in the same lexical database. When the keywords it contains are hunlex-ineffective, the expressions they lead are simply ignored.

Caveat: At the moment, for these alternatives, you have to recompile hunlex, with the new keyword associations, see Description Language.

Todo: This could be done online but has very low priority.

For examples of usage.conf files, browse the examples, in the Examples directory that comes with the distribution (see Installed Files).

Previous: Input Resources, Up: Files

8.2 Output Resources

The output of a hunlex resource compilation is an affix file and a dictionary file. In brief, the affix file contains the description of the affix (cluster) rules of the language we analyze, while the dictionary contains the stems the affix rules can apply to. They have more or less the same role as the grammar and lexicon files, the primary resources of hunlex (see Primary Resources). But the affix and dictionary files are resources that are used by real-time word-analysis routines (such as morphbase, myspell or jmorph, see Related Software and Resources). They share commonalities of format with minor idiosyncrasies, some of which are still in the changing.

Hunlex reads a transparent human-maintainable non-redundant morphological grammar description with the lexicon of a language and creates affix and dictionary files tailored to your needs (see Introduction). The ultimate purpose of hunlex is that these output resource files could at last be considered a binary-like secondary (automatically compiled) format, not a primary (maintained) lexical resource.

Therefore the technical specification of these output formats should only concern you here if you want to compile affix and dictionary files for your own (or modifief versions of our own) word-analysis software which also reads the aff/dic files. In such a case, however, you know that format better than I do. All I can say is that the parameters along which the format can be manipulated is supposed to conform with the format of the software listed in see Software that can use the output of Hunlex as input. If you develop some such stuff as well and would like your format to be supported, take a deep breath and consider requesting a feature from the authors see Requesting a New Feature.

In sum, the format of these output resource files are not detailed. Anyway, they are (probably) well documented elsewhere (e.g., myspell manual page). See especially the documentation of huntools and the morphbase library (see Huntools).

Next: Levels, Previous: Files, Up: Top

9 Command-line Control

This chapter is a verbatim include of the hunlex manpage. Command-line control is not the recommended interface to use hunlex, see toplevel control (see Toplevel Control).

HUNLEX(1)			 User Commands			     HUNLEX(1)



NAME
       hunlex - manual page for hunlex 0.3

SYNOPSIS
       hunlex <options>

DESCRIPTION
       Options: (for more see manpage)

       option description (default settings)

       ------------------------------------------

       -synthesis
	      morphological synthesis (no)

       -synth_in
	      synthesize from file (stdin)

       -synth_out
	      synthesize to file (stdout)

       -lexicon
	      lexicon (lexicon)

       -grammar
	      morphological grammar file (grammar)

       -phono phono features file (phono.include)

       -morph morph declarations and levels file (morph.include)

       -usage usage qualifiers and their tags (usage.include)

       -signature
	      feature structure signature (None)

       -aff   output affix file (affix.aff)

       -dic   output dictionary file (dictionary.dic)

       -mode  output mode [Spellchecker|Stemmer|Analyzer|NoMode] (NoMode)

       -steminfo
	      info  to	output	about a word's stem [Tag|Lemma|Stem|LemmaWith-
	      Tag|StemWithTag|NoSteminfo] (NoSteminfo)

       -fs_info
	      output fs in the dictionary for testing purposes (no)

       -tag_delim
	      tag delimiter ('_')

       -out_delim
	      output delimiter (<space>)

       -out_delim_dic
	      output delimiter for dic file (<tab>)

       -double_flags
	      [0-9][^0-9] type double-char flags (no, single char)

       -flags legitimate flag characters file (none, use predefined flags)

       -min_level
	      minimum morph level (0)

       -max_level
	      maximum morph level (1000)

       -debug_level
	      debug level (0)

       -test  testable output (0)

       --version
	      Display version info and exit

       -help  Display this list of options

       --help Display this list of options

SEE ALSO
       The full documentation for hunlex is maintained as  a  Texinfo  manual.
       If  the	info  and hunlex programs are properly installed at your site,
       the command

	      info hunlex

       should give you access to the complete manual.



hunlex 0.3			   May 2005			     HUNLEX(1)

Next: Tags, Previous: Command-line Control, Up: Top

10 Levels

Levels index morphemes and are assigned to morphemes in the morph.conf file (see Morpheme Configuration File).

Levels govern which affixes will be merged together into complex affixes (or affix clusters) and will constitute an affix rule (linguistically correctly, and affix-cluster rule) in the output affix file (see Output Resources). Affix rules in the affix file will be stripped from the analyzed words by the analysis routines in one step (i.e., by one rule-application).

Levels, then, regulate the output resources of hunlex and have no role to play in how you design your grammars. There are no levels in the hunlex grammar and lexicon, the files which describe the morphology of the language (see Primary Resources). Levels make sense only in relation to the compilation process.

This chapter describes why you would want levels, how you manipulate them and what consequences it has on analysis.

Next: Levels and Stems, Up: Levels

10.1 Levels and Affix Rules

Imagine a word has several affixes like dalokban (= dal 'song' + ok 'plural' + ban 'inessive'). Assume that your hunlex grammar correctly describes the plural and inessive morphemes and their combination rules. If you assign these morphemes to different levels, the output resource will contain affix rules expressing the morphemes separately. This means that these affixes are not stripped in one go by the analysis routines using the affix file as their resource.

Some affixes, however, may need to be stripped as a cluster in one go, because some analysis algorithms do not allow any number of consecutive affix-strippings operations or because stripping them in one go is just more optimal for your purposes (see Levels and Optimizing Performance). Therefore the separate affix rules in the input grammar should be merged when they are dumped by hunlex as rules into the affix file. Well, levels regulate which morphemes should be merged with which other morphemes. (To be more precise, they regulate which affix rules expressing which morphemes should be merged with which other which other affix rules expressing which other morphemes.

Since merged affix rules are highly redundant and tedious to maintain, one of the main purposes of hunlex is actually to allow for high flexibility in your choice of merging affixes to create resources optimized for your needs, while at the same time also allow for transparent and non-redundant description for easy maintenance and scalability (see Introduction).

Next: Levels and Ordering, Previous: Levels and Affix Rules, Up: Levels

10.2 Levels and Stems

Levels do not only regulate which affixes are compiled into one affix cluster (an affix rule in the output affix file, see Levels and Affix Rules). They also determine which stems are precompiled into the dictionary (see Output Resources). In particular, all affixes below a so called minimal lexical level (see Levels and Ordering) are precompiled with the stems of the lexicon into the output dictionary.

For instance, taking the example of the previous section, if both the plural and the inessive morpheme are below the minimal level (of on-line-ness), the whole morphologically complex word dalokban will be included in the dictionary file. To learn why youwould want to do such a thing see also Manipulating Levels with Options.

Next: Manipulating Levels with Options, Previous: Levels and Stems, Up: Levels

10.3 Levels and Ordering

The word 'level' is actually rather misleading, since the notion of level we have here has only a very restricted sense of ordering. There is no sense in which (rules expressing) a morpheme of level i can not be applied after (rules expressing) another morpheme of level j where i > j.

There is a sense in which levels do have ordering, however. There is always a minimal level (that is the value of the MIN_LEVEL option, see Resource Compilation Options) below which all morphemes are compiled into the dictionary (i.e., they are merged with the absolute stems in the lexicon and dumped as stems into the dictionary). The default lexical level is 1, meaning that (affix rules expressing) morphemes of level 0 or less are merged with the appropriate stems and the resulting (morphologically complex) words will be entries in the dictionary file (see Levels and Stems.

Since the dictionary file entries are the 'practical' stems of the analysis routines, configuring the level of morphemes gives you the option to adjust the depth of stemming. For instance, if you choose not to want your stemmer to analyze some derivational affix (which you otherwise productively describe with a rule in the grammar), all you have to do is to assign a lexical level to this morpheme in morph.include. Recompiling with this configuration will result in resources with the precompiled entries in the dictionary file.

See also the MAX_LEVEL option, see Resource Compilation Options.

Next: Levels and Optimizing Performance, Previous: Levels and Ordering, Up: Levels

10.4 Manipulating Levels with Options

Next: Levels and No Clusters, Up: Manipulating Levels with Options

10.4.1 Levels and Generation

You don't always have to fiddle manually with assigning alternative levels to each morpheme. For some of the special cases, hunlex provides an option. It is very common that you want to generate all the words your grammar accepts. All you have to do is to set the minimal level to a very large value that is higher than any of the levels you have assigned to morphemes in the morph.conf file (see Morpheme Configuration File). This is done with the MIN_LEVEL option (see Resource Compilation Options). This means to hunlex that all the rules expressing all the morphemes are to be compiled in the dictionary, which results in deriving all the words of the language. This option is also provided as the generate toplevel target (see Targets), in fact

     make generate

is just a shorthand for

     make MIN_LEVEL=100000 new resources

Next: Levels and Steps of Affix Stripping, Previous: Levels and Generation, Up: Manipulating Levels with Options

10.4.2 Levels and No Clusters

In order to create an output in which no two affix rules are merged, it is enough to assign every morpheme to a different level for instance by using the following unix shell command:

     $ cp morph.conf morph.conf.orig
     $ cut -d' ' -f1 morph.conf.orig | nl -nln -s' ' | sed 's/\(.*\) \(.*\)$/\2 \1/g' > morph.conf

Todo: I should provide an option that does this.

With a routine that supports any number of affix stripping operations, such a resource will allow correct analysis. But not with the ones that allow only a finite number of rule applications.

Todo: write on recursion

Geeky note: If rule-application monotonically increases the size of the input, potential recursion is never unbounded recursion since all analysis routines have a fixed buffersize anyway. If not however, if empty strings or clippings make rule application non-monotonic in size, potential recursion may cause actual infinite loops in some uncautious implementations. Boundedness of recursion due to buffersize restrictions is only one sense in which the full intended (implied) generative power of any arbitrary hunlex grammar is not reflected in the analyzer's actual analysis potential.

Previous: Levels and No Clusters, Up: Manipulating Levels with Options

10.4.3 Levels and Steps of Affix Stripping

For myspell style resources where you want only one stage of affix stripping, you should use one lexical and one non-lexical level. Without having to create your alternative morph.conf file, this can easily be done with the combination of the MIN_LEVEL and the MAX_LEVEL options (see Options).

You just set these two options to the same value l, and all morphemes with level equal or smaller then l will be compiled into the dictionary, and all the other morphemes (i.e., affix morphemes with level greater than l) will be merged into clusters (and these affixes will be dumped to the affix file as rules). Implementations like myspell (see Myspell) can only run correctly with such resources given that they allow only one step of suffix stripping.

Todo: This is slightly more compilcated because of prefixes and affixes stripped separately. We should clarify this. And this whole myspell business is actually not tested.

Myspell supports only one stage of affix stripping, the morphbase routines support two and jmorph supports any number (truely recursive).

With an affix file where there are separate affix rules for these affixes, the analyzer would have to do two suffix stripping operations to recognize the word dalokban. Therefore using such a resource, myspell will not recognize this word at all. The morphbase routines will be able to analyze it since they allow two stages of suffix stripping which is just enough and jmorph as well since it allows any number of suffix stripping steps.

So, when you configure which affixes are merged, make sure you have considered the generative capacity of the target analysis routine (how many suffix strippings it can make).

precompile into the dictionary?

Previous: Manipulating Levels with Options, Up: Levels

10.5 Levels and Optimizing Performance

Which affix rules you want to merge and precompile into clusters is entirely up to you and usually a question of optimization. If you choose not to precompile anything, then your affix file will be small, but your analysis may not be optimal for runtime (if it generates the correct analyses at all, see Levels and Steps of Affix Stripping).

If, on the other hand, you precompile all affixes into affix clusters, you might end up with hundreds of megabytes of affix file which is gonna compromise your runtime analysis memory load (though maybe faster for the analysis algorithm, than recursive calls). This last realization led the author of hunspell (see Huntools) to introduce a second step of suffix stripping in the algorithm which was a legacy of the original myspell code with its one level of affix stripping.

Finally, compiling everything in the lexicon is not a very good idea for complex morphologies and big lexicons. Although it may be indespensible for testing on smaller fragments of lexicons/grammars or for creating wordlists (see Test Targets).

Some special affix rules should always be precompiled into the dictionary and not output as affix rules. These rules are the ones that cannot be interpreted as affix rules at all, for instance, rules of substitution or suppletion. These rules are beyond the descriptive capacity of affix files. Therefore all substitutions and suppletions are precompiled (merged with the rules or stems they can be applied to) irrespective of their level. Find more about this.

Next: Flags, Previous: Levels, Up: Top

11 Tags

We call the information that a morphological analyzer is expected to output for an analyzed word a piece of morphological annotation. In more general terms, howver, when we talk about any kind of word-analysis routine such as a spell-checker, stemmer, we call the output information these routines associate with words tags. We want to emphasize here that this piece of output information 'tags' the whole that is analyzed. The tag is used to annotate words in a corpus by decorating a raw text with useful extra information.

NB: Tagging as we use it in no way constitutes a constituent structure, segmentation, etc. of the input word form.

This document describes the ways in which you can associate tags with your morphemes (or individual stem variants and affix rules). These tags should be thought to constitute ingredients of and output tag that an analyzed word containing that morpheme would be. Certainly, not all analysis software can or is supposed to output any useful information about the morphological makeup of the word. For instance, a spell-checker is typically required only to recognize whether a word is correct (usually in a strict normative sense), but a morphological analyzer or a stemmer is supposed to output some information. Since the huntools routines are able to perform full morphological analysis, not just recognition (REFERENCE), adding morphological tags to your rules is worth your while. Nevertheless, if you never ever want to be able to output any useful info (because you only care about spellchecking), you don't really need to read on.

Next: Feature Structures, Up: Tags

11.1 Merging Tags

The output tag associated with a successful analysis of a word is extremely primitively defined by the concatenation of the tags assigned to the rules and the stem which constituted the parse of the word.

NB: It is not clear whether the order of prefixes stems and suffixes should matter in some cases. In the usual case we assume that what the analyzer will do is concatenation of tags in the order of affix-rule stripping.

Todo: What the analyzers do with the tags should be clarified. In fact, both huntools and jmorph do something that is smarter for particular purposes but not reasonably generalizable or even incorrect for the general case.

For this to work, you can assign tags (chunks of output annotation) to any affix variant and stem variant in your grammar and lexicon. This is done with TAG expressions (see Description Language, TAG keyword).

As hunlex merges affixes, it merges their tags accordingly as expected. There are a number of formatting options with which you can influence the way you put together tags. One is the TAG_DELIM option (see Resource Compilation Options), which sets the delimiter between any two tags. If multiple tags are given by TAG expressions, they are also concatenated with this delimiter in the order of their appearance within the rule-block.

Depending on the main output mode of hunlex, various pieces of information can be chosen to be considered as tags to output. This is important if you want to configure your resources so that it will give you a stemmer or a tagger or an analyzer and various other options are available, see Resource Compilation Options.

Previous: Merging Tags, Up: Tags

11.2 Feature Structures

Hunlex also support feature structures as a kind of annotation scheme. This is extremely useful to crossckeck the correctness of your tags. Tags can be quite messy and since they are pieces of strings, they are difficult to check.

Feature structures are structured objects which are checked against a signature (given in the signature file, which is the value of the option SIGNATURE, see Input File Options) and are merged with graph-unification. As annotations to give complex morphological information, they are more expressive and adequate than pieces of tags that are concatenated. Also, feature structures, unlike just arbitrary strings in the tags, are interpretable data structures which one can directly calculate with, say, in a syntactic analyzer using the output of the morphological analyzer.

That said, it has to be added that the analyzers themselves do not support these feature structures. This means that they still manipulate pieces of feature-structure descriptions as strings and glue them together. If you use the feature structures within your hunlex description, however, you can be certain that, even if the analyzer just concatenates them, the resulting analyses describe valid FS-s according to your signature (see also the SIGNATURE option under Input File Options).

Todo: Include a proper description of the extended KR framework of FS-s.

Todo: No support for derivations is implemented yet (it is on the way).

Next: Troubleshooting, Previous: Tags, Up: Top

12 Flags

Flags are used in the output resources (see Output Resources) to index affix rules. Each entry in the dictionary file has a set of flags indicating which affix rules can be applied to it.

So, flags are given by hunlex and written in the affix and dictionary files. There is no such thing as a flag in the hunlex input grammar or lexicon, the files which describe your morphology.

You can specify some aspects of what flags hunlex will assign to affix classes and how. This is what the present chapter is about.

Next: Flaggable Characters, Up: Flags

12.1 Two Forms of Flags

Flags can be a one-character or two-character long.

Myspell (and legacy xspell implementations) can only handle single-character flags. For the general case, this should be ok and is the default. If you are dealing with languages of sensible complexity, this default is ok and you don't need to read this chapter any further.

Double flags are composed of a number as first character and a flaggable character (see Flaggable Characters) as the second, such as '3f' or '9t'. In order to use double flags, use the DOUBLE_FLAGS option (see Resource Compilation Options). Read on to learn why you would use double flags (see Limit on the Number of Flags).

Next: Limit on the Number of Flags, Previous: Two Forms of Flags, Up: Flags

12.2 Flaggable Characters

flaggable characters are characters that hunlex can use as flags (in case of single-character flags) or can be the second character of a double flag (see Two Forms of Flags). All non-whitespace characters are in principle flaggable.

The actual choice of flaggable characters is by default the following 132 characters which are hard-wired in hunlex. I am not in a position to list them here, because... (I bet my hundred forints that some of the characters below are displayed completely differently for you than for me, in any format any display, ranging from your terminal trough your browser to acroread.) They are however, included, in the original texinfo version of this document as a comment (see file doc/texinfo/flags.texinfo, or to be sure in the source code, src/hunlex_wrapper.ml)

Flaggable characters, however, can be customized through hunlex's FLAGS options (see Input File Options). This option takes a filename. The contents of the file is the sequence of characters to be used for flags without any delimiters.

Warning: Make sure you do not include any whitespace in this file (other than a trailing newline), or do not include any character twice. Since characters are not checked for sanity, doing otherwise may result in ill-formed affix files or conflated affix classes. If you use double-flags, do not include numbers as flaggables.

Todo: Why don't we bloody check this? Checking of flaggable characters should be amended in a future version.

rule indexed with?

The association of flags to affix classes takes flags from left to right. This means that if the output requires 35 flags, the first 35 flaggable characters will be used. This is, however, all that can be said: which actual flag comes to which affix class can not be further specified.

Warning: This last sentence is a warning in itself. For people who are used to fiddling with affix files that were manually created (in fact almost all ispell resources), it has to be stressed: hunlex-generated affix files are not to be read by humans and should be considered binary. Associations of flags with particular affix rules/classes are not permanent across various configurations/resource compilations. If you want to post-process affix files, never assume particular flags are meaningful. This is rather obvious once you realize that the affix rules/classes themselves are not consistent across different parametrizations, either (see e.g., levels). This policy is called dynamic flagging.
An exception from under dynamic flagging might be special flags which are fairly consistent since their expression can be customized to particular strings (see Resource Compilation Options, Affix file variables). But this feature will soon cease to exist, so just wipe your tears off your face, be happy that you have a hunlex resource and forget your old flags.

Next: Special Flags, Previous: Flaggable Characters, Up: Flags

12.3 Limit on the Number of Flags

If you use single-character flags (see Two Forms of Flags), the number of flags equals the number of flaggable characters, i.e., the length of the custom flag file (see Resource Compilation Options, flags.conf) or 132, by default (see Flaggable Characters). This is also the maximum number of affix classes you can have in your output resources.

(After flaggable characters will be chosen to express the special flags (see Special Flags), the number of possible affix classes is the number of flaggables minus the number of special flags needed.)

Sometimes this is not enough for languages with hugely complex and lexically idiosynchratic morphology and one has to use double flags (see Two Forms of Flags). You can tell whether you really need this if hunlex resource compilation stops complaining that there are not enough flags (exception Not_enough_flags).

You tell hunlex to use double flags with the appropriately named DOUBLE_FLAGS option (see Resource Compilation Options). With double flags you can have 10 times more affix classes than flaggable characters, i.e., 1320 (from '0a' to '9z' or whatever) with the default flaggables (see Flaggable Characters).

Warning: Double flags are only understood by the morphbase implementations (see Huntools) but not ispell, myspell and (yet) jmorph.
This is a reason why one might use the huntools package. Please tell us uf this is the sole reason you are using huntools.

Caveat: The use of double flags for the morphbase routines is a compile-time option at the moment.

Caveat: When customizing your flaggables (see Flaggable Characters), and using double flags, you can have up to about two thousand affix classes. If this is not enough for you (you get the exception Not_enough_flags), you are likely to have a problem in your grammar (see Troubleshooting). If you are sure it is not a grammar problem, you better choose another language. At any rate, please notify us (see Contact) about this extraordinary case and we might even extend the support for flags even in morphbase on one of our free afternoons (see Requesting a New Feature).

Previous: Limit on the Number of Flags, Up: Flags

12.4 Special Flags

There are special flags in the affix file (the full documentation of special flags is (hopefully) found in the morphbase and jmorph docs). Special flags are special because they do not index affixes, but encode other sorts of information needed by analysis routines. There are a number of flags one can configure through the options (see Options). These are

ROOT_ONLY_FLAG
STEM_GIVEN

Todo: Include the ones below into the implementation and uncomment it from the texinfo document

these are:

WARNING, CAVEAT: if you use two-character flags (-double_flags option), you have to make sure that special flags are also two-characters, otherwise it will lead to ill-formed affix files. If you set flags through the toplevel Makefile's variables, make sure your flags are quoted (otherwise Makefile will resolve the flag '~' to, say, '/home/tron' and you won't understand what went wrong...)

TODO: Special flags should NOT be user-configurable at all. They should be assigned the first possible flags.

Next: Related Software and Resources, Previous: Flags, Up: Top

13 Troubleshooting

Next: Problems running hunlex, Up: Troubleshooting

13.1 Installation Problems

If hunlex wouldn't install, check Prerequisites carefully with special attention to the versions.

There are some hints hidden among the lines of Install which you may have missed.

Next: Resource Compilation Problems, Previous: Installation Problems, Up: Troubleshooting

13.2 Problems running hunlex

If you upgraded from an earlier version, make sure you uninstall the earlier version first (see Uninstall and Reinstall).

If you use hunlex trhough the toplevel control with Makefile (see Toplevel Control), the hunlex executable is by default assumed to be found in the path with name hunlex.

By default, installation installs hunlex into /usr/local/bin (see Installed Files) unless you set another install prefix.

Find out whether the hunlex executable is found in the path by typing

     $ which hunlex

If it is not found, check again where you installed it by looking into the file install_prefix in the toplevel directory of your source distribution. If this file is not there, your installation was not successful.

If you found out your install-prefix, see if install-prefix/bin/hunlex exists. If it does, you can do the following things:

add install-prefix/bin/hunlex to your path by something like:

          PATH=install-prefix/bin:${PATH}

tell the toplevel where to find your hunlex. This you can do by setting the HUNLEX toplevel option (see Options).

Next: Grammar Problems, Previous: Problems running hunlex, Up: Troubleshooting

13.3 Resource Compilation Problems

Previous: Resource Compilation Problems, Up: Troubleshooting

13.4 Grammar Problems

If your grammar seems to overgenerate, first thing is check if you declared the features that you think your grammar is relying on in the phono.conf file.

You may have mispelled some phonofeature, this can be traced by peeping into the debug messages. Ideally you do this by redirecting the output into a log file (with debug level set sufficiently high) and search the file for the term 'skipped'. This is the warning hunlex gives you to let you know that an entity has been skipped.

Next: Variables and Options Index, Previous: Troubleshooting, Up: Top

14 Related Software and Resources

Next: Available resources, Up: Related Software and Resources

14.2.1 The Hungarian Morphdb Project

The HunLex framework is being used in the development of an open-source morphological database (lexicon and grammar) for the Hungarian language in a collaboration between the Hungarian Academy of Sciences, Research Institute for Linguistics and the Budapest Institute of Technology, Media Education and Research Center Natural Language Processing Lab. This database aspires to be the most complete and accurate account of Hungarian morphology published so far, and is the result of merging several well-respected electronic resources http://lab.mokk.bme.hu.

Previous: The Hungarian Morphdb Project, Up: Available resources

14.2.2 The English Morphdb Project

Previous: Available resources, Up: Related Software and Resources

14.3 Hunlex's relatives

Up: Hunlex's relatives

14.3.1 XFST, TWOLC, LEXC

For xfst, twolc, lexc, see

http://www.xrce.xerox.com/competencies/content-analysis/fst/home.en.html

http://www.stanford.edu/~laurik/fsmbook/home.html

Next: Description Language Index, Previous: Related Software and Resources, Up: Top

Variables and Options Index

AFF (outputdir/affix.aff): Output File Options
AFF_FORBIDDENWORD (!): Resource Compilation Options
AFF_ONLYROOT (~): Resource Compilation Options
AFF_PREAMBLE (): Resource Compilation Options
AFF_SET (ISO8859-2): Resource Compilation Options
AFF_SETTINGS (confdir/affix_vars.conf): Resource Compilation Options
CHAR_CONVERSION_TABLE (): Resource Compilation Options
CONFDIR (inputdir): Input File Options
DEBUG_LEVEL: Verbosity and Debugging
DEBUG_LEVEL (0): Verbosity and Debug Options
DIC (outputdir/dictionary.dic): Output File Options
DIC_PREAMBLE (): Resource Compilation Options
DOUBLE_FLAGS (): Resource Compilation Options
FLAGS (''): Input File Options
FS_INFO (): Resource Compilation Options
GRAMMAR (grammardir/grammar): Input File Options
GRAMMARDIR (inputdir): Input File Options
HUNLEX (hunlex): Executable Path Options
HUNMORPH (hunmorph): Executable Path Options
INPUTDIR (. = current directory): Input File Options
LEXICON (grammardir/lexicon): Input File Options
MAX_LEVEL (10000): Resource Compilation Options
MIN_LEVEL (1): Resource Compilation Options
MODE (Analyzer): Resource Compilation Options
MORPH (confdir/morph.conf): Input File Options
OUT_DELIM (' '): Resource Compilation Options
OUT_DELIM_DIC (<TAB>): Resource Compilation Options
OUTPUTDIR (. = current directory): Output File Options
PHONO (confdir/phono.conf): Input File Options
QUIET: Verbosity and Debugging
QUIET (@ = quiet): Verbosity and Debug Options
REPLACEMENT_TABLE (): Resource Compilation Options
SIGNATURE (''): Input File Options
STEM_GIVEN (): Resource Compilation Options
STEMINFO (LemmaWithTag): Resource Compilation Options
TAG_DELIM (''): Resource Compilation Options
TEST (/dev/stdin): Input File Options
TIME: Verbosity and Debugging
TIME (time): Verbosity and Debug Options
USAGE (confdir/usage.conf): Input File Options
WORDLIST (outputdir/wordlist): Output File Options

Next: Files Index, Previous: Variables and Options Index, Up: Top

Description Language Index

macro definitions: Macros

Next: Concept Index, Previous: Description Language Index, Up: Top

Files Index

affix file: Output Resources
affix file: Output File Options
affix settings file: Resource Compilation Options
character conversion table: Resource Compilation Options
character replacement table: Resource Compilation Options
configuration files: Configuration Files
configuration files: Input File Options
configutation directory: Input File Options
dictionary file: Output Resources
dictionary file: Output File Options
flags file: Flaggable Characters
flags file: Input File Options
flags.conf: Flaggable Characters
flags.conf: Input File Options
fs.conf: Input File Options
grammar directory: Input File Options
grammar file: Grammar
grammar file: Input File Options
hunlex: Installed Files
hunlex.1: Installed Files
HunlexMakefile: Targets
HunlexMakefile: Toplevel Control
HunlexMakefile: Bootstrapping
HunlexMakefile: Installed Files
input resources directory: Input File Options
install_prefix: Uninstall and Reinstall
lexicon file: Lexicon
lexicon file: Input File Options
Makefile: Toplevel Control
Makefile, hunlex toplevel: Installed Files
Makefile, local: Bootstrapping
morph.conf: Levels and Generation
morph.conf: Morpheme Configuration File
morph.conf: Input File Options
output resources directory: Output File Options
phono.conf: Feature Configuration File
phono.conf: Input File Options
preambles for affix and dictionary files: Resource Compilation Options
primary resource files: Primary Resources
signature file: Input File Options
test file: Input File Options
usage configuration file: Usage Configuration File
usage.conf: Usage Configuration File
usage.conf: Input File Options
wordlist file: Output File Options

Next: Frequently Asked Questions, Previous: Files Index, Up: Top

Concept Index

affix: Output File Options
affix clusters: Morpheme Configuration File
affix file, variables in: Resource Compilation Options
affix rules (in the affix file): Morpheme Configuration File
affix rules, conditioning the application of: Feature Configuration File
affix rules, merging affix rules in the affix file: Levels
allomorph, setting stem allomorph to output: Resource Compilation Options
block: Morph Preamble and Variants
block, the blocks used within morph statements: Morph Preamble and Variants
character set, setting character set in the affix file: Resource Compilation Options
command-line interface: Toplevel Control
compounding, affix file variables for compounding: Resource Compilation Options
condition: Feature Configuration File
configuration of resource compilation: Configuration Files
cumulative, blocks: Blocks
debugging: Verbosity and Debug Options
debugging resource compilation: Verbosity and Debug Options
default settings: Bootstrapping
default target: Storing your Settings
default target: Bootstrapping
default values of options: Options
delimiter: Resource Compilation Options
delimiter, tag: Resource Compilation Options
description language: Hunlex
dictionary: Output File Options
double flags: Two Forms of Flags
feature structure: Input File Options
feature structures: Feature Structures
feature structures: Resource Compilation Options
feature, declaring features for compilation: Feature Configuration File
flags: Flags
flags: Input File Options
flags, double: Two Forms of Flags
flags, limit on the number of: Limit on the Number of Flags
flags, setting double flags: Resource Compilation Options
flags, setting special flags through toplevel options: Resource Compilation Options
flags, special: Special Flags
generating all the words of the language: Levels and Generation
generation: Levels and Generation
grammar: Grammar
hunlex: Hunlex
hunlex, invoking: Toplevel Control
hunmorph: Executable Path Options
huntools: Levels and Steps of Affix Stripping
huntools: Resource Compilation Options
huntools: Executable Path Options
huntools: Hunlex
input resources: Hunlex
jmorph: Levels and Steps of Affix Stripping
lemma, setting output lemma: Resource Compilation Options
level, setting maximal level: Resource Compilation Options
level, setting minimal level: Resource Compilation Options
levels: Levels
levels: Resource Compilation Options
levels, settings the level of a morpheme: Morpheme Configuration File
lexical resources: Motivation
lexicon: Lexicon
LGPL: License
license: License
macro definition: Description Language
macros: Macros
make, GNU: Prerequisites
make, hunlex installation: Install
make, toplevel control: Toplevel Control
Makefile: Toplevel Control
Makefile variables: Options
Makefile, hunlex installation: Install
metadata definition: Description Language
metadata, output metadata into affix file: Resource Compilation Options
minimal level: Levels and Generation
mode, setting mode: Resource Compilation Options
morph: Morphs
morph definition: Description Language
morphemes, configuring: Morpheme Configuration File
morphological analysis: Hunlex
morphological analyzer: Hunlex
morphological annotation: Tags
myspell: Levels and Steps of Affix Stripping
ocaml: Prerequisites
ocaml-make: Prerequisites
OCamlMakefile: Prerequisites
options, default value: Options
options, executable path: Executable Path Options
options, remembering: Storing your Settings
options, toplevel: Options
output information, of a word-analysis routine: Tags
output options: Merging Tags
output resources: Output Resources
output resources: Hunlex
output, tags to output: Resource Compilation Options
preamble, the header part of a morph statement: Morphs
preambles, for affix and dictionary files: Resource Compilation Options
primary resources: Hunlex
resource compilation: Verbosity and Debugging
resource compilation, configuration of: Configuration Files
resource, primary: Primary Resources
resource-specification language: Hunlex
resources, input: Input Resources
resources, input: Hunlex
resources, lexical: Motivation
resources, output: Output Resources
resources, output: Hunlex
resources, primary: Hunlex
resources, secondary: Hunlex
settings, storing: Storing your Settings
signature: Input File Options
special flags: Special Flags
spell-checking, settings for: Resource Compilation Options
spellchecker: Hunlex
spellchecking: Hunlex
spellchecking, resources for myspell: Levels and Steps of Affix Stripping
statement: Description Language
stemming stemmer: Hunlex
stemming, settings for: Resource Compilation Options
tags: Tags
tags: Resource Compilation Options
tags, formatting output: Merging Tags
tags, merging tags: Merging Tags
tags, settings output tags: Resource Compilation Options
targets: Targets
targets, special: Special Targets
targets, test: Test Targets
test targets: Test Targets
test, a file to test your grammar on: Input File Options
test, testing resources: Test Targets
testing compiled resources: Input File Options
testing tags: Input File Options
time: Verbosity and Debug Options
toplevel control: Toplevel Control
toplevel options: Options
usage configuration: Usage Configuration File
usage qualifiers, declaring: Usage Configuration File
variables, affix file: Resource Compilation Options
variables, in affix file: Resource Compilation Options
variables, Makefile: Options
variant, as part of a morph statement: Morphs
verbosity: Verbosity and Debug Options
verbosity: Verbosity and Debugging

Previous: Concept Index, Up: Top

Frequently Asked Questions

- What options are available for the toplevel control?: Options
- How can I display/undisplay the time it takes hunlex to compile the resources?: Verbosity and Debugging
- How can I invoke an alternative version of hunlex (rather then the systemwide one)?: Executable Path Options
- How can I remember the options on the toplevel?: Storing your Settings
- How can I set what tags are output by the analyzer?: Resource Compilation Options
- How can I test if my grammar compiles?: Test Targets
- How can I test if my resources are working?: Test Targets
- Is there a way to dump all the words my grammar generates?: Resource Compilation Targets
- What can I do through the toplevel control at all?: Targets
- What is a Makefile?: Toplevel Control
- What is a target?: Targets
- What is toplevel control?: Toplevel Control
- What targets are available through HunlexMakefile?: Targets
- How can I make compilation more verbose with the toplevel control?: Verbosity and Debugging
- How do I compile my resources with hunlex?: Resource Compilation Targets
Are there any Hunlex resources available for any language?: Available resources
Are there software with similar functionality?: XFST etc.
Can I generate all the words of the language?: Levels and Generation
Can I use hunlex from a shell?: Command-line Control
Can I use hunlex on the command-line?: Command-line Control
Can I use the macros I defined in the grammar in the lexicon? Yes.: Macros
Can one iterate the same type of blocks?: Morph Preamble and Variants
Can you recommend a morphological analyzer?: Jmorph
Can you recommend a morphological analyzer?: Huntools
Can you recommend a spellchecker?: Huntools
Can you recommend a stemmer?: Huntools
How are tags calculated by the analyzer?: Merging Tags
How can I associate tags with usage qualifiers?: Usage Configuration File
How can I configure resource compilation?: Configuration Files
How can I configure usage qualifiers?: Usage Configuration File
How can I configure what the output tags of my analyzer will be?: Merging Tags
How can I contact you?: Contact
How can I declare usage qualifiers?: Usage Configuration File
How can I reset which usage qualifier dimension do I want in the output?: Usage Configuration File
How can I select which morphemes are included in the output?: Morpheme Configuration File
How do I compile resources for the myspell spellchecker?: Levels and Steps of Affix Stripping
How do I define a macro in the grammar?: Macros
How do I install the hunlex?: Install
How do I reinstall hunlex?: Uninstall and Reinstall
How do I specify the morphology of the language?: Grammar
How do I start using hunlex?: Bootstrapping
How do I uninstall hunlex?: Uninstall and Reinstall
How do I upgrade hunlex?: Uninstall and Reinstall
How do you specify the words of the language?: Lexicon
How many arguments a block can have?: Morph Preamble and Variants
How many blocks can one have in a variant?: Morph Preamble and Variants
How many flags (affix classes) can I have?: Limit on the Number of Flags
How many steps of affix stripping do I want to have?: Levels and Steps of Affix Stripping
I am a keen hacker/stupid user with sleepless nights/long afternoons at work to be wasted/spent usefully. How can I contribute to your work?: Contribution
I am selling this product to people for money. Is this OK? OK with us. Ask your customers.: License
I am using hunlex for commercial purposes to make money. Is this OK? OK with us. Ask your neighbour.: License
I am using hunlex, but desparately lacking a feature X. Can I request it?: Requesting a New Feature
I found a bug/strange feature/some stupid mistake. What should I do?: LicenseExtra
I found hunlex cool/useful/useless. How can I let the authors know?: Praises
I used hunlex and report it in a paper. What reference do I use?: Reference
In what sense are levels ordered?: Levels and Ordering
Is Hunlex just a converter?: Configurable Compilations
Under what copyright restrictions can I distribute linguistic resources compiled or to be compiled with hunlex?: License
Under what copyright restrictions can I redistribute this software?: License
Under what restrictions can I modify this software?: License
Under what restrictions can I use this software?: License
What are feature structures good for?: Feature Structures
What are feature structures?: Feature Structures
What are levels?: Levels
What are special flags?: Special Flags
What are the configuration files?: Configuration Files
What are the primary resources for hunlex?: Primary Resources
What are usage qualifiers?: Usage Configuration File
What do I need if I want to install this package?: Prerequisites
What do I want to install?: Installation
What do levels mean for affixes and the affix file?: Levels and Affix Rules
What do levels mean for stems in the lexicon and the dictionary?: Levels and Stems
What does a flag look like?: Two Forms of Flags
What does hunlex do with the tags?: Merging Tags
What files are relevant for hunlex?: Files
What files are the input for hunlex?: Input Resources
What gets installed when I install hunlex?: Installed Files
What is a block?: Morph Preamble and Variants
What is a dictionary file?: Output Resources
What is a flag?: Flags
What is a flaggable character?: Flaggable Characters
What is a variant?: Morph Preamble and Variants
What is an affix file?: Output Resources
What is Hunlex?: Installation
What is Hunlex?: Hunlex
What is in the grammar file?: Grammar
What is in the lexicon file?: Lexicon
What is morphological annotation?: Tags
What is the difference between features and usage qualifiers?: Usage Configuration File
What is the grammar file?: Grammar
What is the lexicon file?: Lexicon
What is the License of this software?: License
What is the morph.conf file?: Morpheme Configuration File
What is the phono.conf file?: Feature Configuration File
What is the usage.conf file?: Usage Configuration File
What output files does hunlex create?: Output Resources
What platforms are supported by Hunlex?: Supported Platforms
Where can I find the sources?: Download
Where do the tags come from?: Merging Tags
Which affix rules do I want to merge and which words do I want to: Levels and Steps of Affix Stripping
Which characters can be flags?: Flaggable Characters
Which files actually describe the language?: Primary Resources
Which flags will be in the affix file and which flag is a particular: Flaggable Characters
Which softwares can use the output of Hunlex?: Software that can use the output of Hunlex as input
Who are you?: Authors
With which options can I manipulate levels (and thereby affix-merging)?: Manipulating Levels with Options

HunLex - Reference Manual

Table of Contents