This document presents the HunLex morphological resource specification framework and precompilation tool which is being developed as part of the Budapest Institute of Technology Media Education and Research Center's HunTools Natural Language Processing Toolkit http://lab.mokk.bme.hu
HunLex offers a description language, ie., a formalism for specifying a base lexicon and morphological rules which describe a language's morphology. This description which is stored in textual format serves as your primary resources that represents your knowledge about the morphology and lexicon of the language in question.
Now, providing a resource-specification language is rather useless in itself. Hunlex is able to process these primary resources and create the type of resources that are used by some real-time word-level analysis tools. If you create these from your primary resources you might call them secondary resources. These provide the the language-specific knowledge to a variety of word-level analysis tools.
At present, most importantly, Hunlex provides the language specific resources for the HunTools word-level analysis toolkit see Huntools. This package contains the MorphBase library of word-analysis routines such as spell-checker, stemmer, morphological analyzer/generator and their standalone executable wrappers. Therefore, your single Hunlex description of your favourite language will enable you to perform spell-checking, stemming, and morphological analysis for that language, which is more than useful.
In addition to the HunTools routines, other software which use ispell-type resources will be able to use Hunlex's output. Among these are myspell, an open-source spell-checker (also used in Open Office http://www.openoffice.org, see Myspell), or jmorph, a superfast java morphological analyzer (see Jmorph).
This document describes how you can create your primary resources and what you can (make Hunlex) do with them.
Note: This document is not intended to describe how to use any of these real-time tools, what they are good for. See the above links to learn more about them.
In particular, this document provides you with:
TODO: not yet
The motivation behind HunLex came from two opposing types of requirements lexical resources are supposed to fulfill:
The constraints in (i) favour one central, redundancy-free, abstract, but transparent specification, while the ones in (ii) require possibly multiple application-specific, potentially redundant, optimized formats.
In order to reconcile these two opposing requirements, HunLex introduces an offline layer into the word-analysis workflow, which mediates between two levels of resources:
The primary resources are supposed to reasonably designed to help human maintanance, and the secondary ones are supposed to optimize very different things ranging from file size, performance with the tool that uses it, coverage, robustness, verbosity, normative strictness depending on who uses it for what purpose.
HunLex is used to compile the primary resources into a particular application-specific format see Output Resources. This resource compilation phase is an offline process which is highly configurable so that users can fine-tune the output resources according to their needs.
By introducing this layer of offline resource compilation, maintenance, extendability, portability of lexical resources is possible without compromising your performance on specific word-analysis tasks.
Providing the environment for a sensible primary resource specification framework and managing the offline precompilation process are the raison d'être behind Hunlex.
Configuration allows you to adjust the compilation of resources along various dimensions:
It is licensed under LGPL, which roughly means the following.
There are no restrictions on downloading it other than your bandwidth and our slothful ways of making things available.
There are no restrictions on use either other than its deficiencies, clumsy features and outragous bugs. However, this can be amended, because there are no restrictions on modifying it either. See also Contribution.
Freedom of use implies that any resources that you created, compiled with the mediation of Hunlex is yours and you hold the right to distribute it in any way. Consider telling us about this great news, see Contact.
What is more, there are no restrictions on redistributing this software or any modified version of it.
For some legalese telling you the same, read the License http://creativecommons.org/licenses/LGPL/2.1/
Todo: Shall we not include the License?
See License.
If you find a bug or an undesireable feature or anything that is worth a couple of lines ranting at the authors, please go ahead and send a bugreport on the MOKK Lab bugzilla page at http://lab.mokk.bme.hu or send a mail to me (see Contact).
So you are using hunlex and find yourself realizing that you would need a certain feature desparately which happens not to be implemented. Go ahead and request it from the authors (see Contact) or sit silently and hope!
So you found hunlex cool and/or useful and would like the authors to hear about that. How nice is that! See Contact.
Hunlex is open source development, so developpers are welcome to contribute to make it better in any imaginable way. Contact us (see Contact) to work out the details of how and what you would want to contribute to Hunlex.
For the context of the whole huntools kit, use
@InProceedings{szoszablya_saltmil:04, author = {L\'aszl\'o N\'emeth and Viktor Tr\'on and P\'eter Hal\'acsy and Andr\'as Kornai and Andr\'as Rung and Istv\'an Szakad\'at}, title = {Leveraging the open-source ispell codebase for minority language analysis}, booktitle = {Proceedings of SALTMIL 2004}, year = 2004, organization = {European Language Resources Association}, url = {http://lab.mokk.bme.hu/} }
A very brief intro to hunlex with a one-page English resumé.
@InProceedings{hunlex_mszny:04, author = {Tr\'on, Viktor}, title = {HunLex - a description framework and resource compilation tool for morphological dictionaries}, booktitle = {II. Magyar Sz\'am\'it\'og\'epes Nyelv\'eszeti Konferencia}, institution = {Szegedi Tudom\'anyegyetem}, address = {Szeged, Hungary} year = 2004 }
These and other papers can be downloaded from the MOKK Lab publications page at http://lab.mokk.bme.hu
The author of hunlex and this document is Viktor Trón. He can be mailed to on
v.tron@ed.ac.uk
Hopefully more can be found on MOKK Lab's pages at http://lab.mokk.bme.hu.
So you want to install the hunlex toolkit (see Introduction) from the hunlex source distribution. This document describes what and how you can install with this distribution.
The latest version of the hunlex source distribution is always available from the MOKK LAB website at http://lab.mokk.bme.hu or, if all else fails, by mailing to me v.tron@ed.ac.uk.
The hunlex executable in principle runs on any platform for which there
is an ocaml
compiler (see Prerequisites). This includes
all Linuxes, unices, MS Windows, etc.
Warning: This package has not been tested on platforms other than linux.
Hunlex is written in the ocaml programming language http://www.ocaml.org/. OCaml compilers are extremely easy to install and are available for various platforms and downloadable in various package formats for free from http://caml.inria.fr/ocaml/distrib.html.
You will need ocaml version >=3.08 to compile hunlex.
ocaml-make OCamlMakefile (i.e.,
ocaml-make
) is needed for the installation of hunlex and is available from Markus Mottl's homepage at
http://www.ai.univie.ac.at/~markus/home/ocaml_sources.html#OCamlMakefile
(I used version 6.19. writing on 8.1.2004).For OCamlMakefile you will need ocaml and GNU make. (for ocaml-make version 6.19 you will need GNU make version >= 3.80)
NB: Most probably earlier versions ofocaml-make
andGNU make
should also work but have not been tested yet.
You don't need anything else to use hunlex (but a little patience).
Hunlex is installed in the good old way, i.e., by typing
$ make && sudo make install
in the toplevel directory of the unpacked distribution. Read no further if you know what I am talking about or if you trust some God.
The hunlex distribution is available in a source tarball called hunlex.tgz. First you have to unpack it by typing
$ tar xzvf hunlex.tgz
Then, you enter the toplevel directory of the unpacked distribution with
$ cd hunlex
To compile it, simply type
$ make
in the toplevel directory of the distribution.
To install it (on what gets installed, see Installed Files), type
$ make install
Well, by default this would want to install things under /usr/local, so you have to have admin permissions. If you are not root but you are in the sudoers file with the appropriate rights, you type:
$ sudo make install
You can change the location of the installation by changing the install prefix path with
$ sudo make PREFIX=/my/favourite/path install
Changing the location of installation for individual install targets individually is not recommended but easy-peasy if you have a clue about make and Makefile-s. To do this you have to change the relevant Makefile-s in the subdirectories of the distribution. See Installed Files.
If it works, great! Go ahead to Bootstrapping.
If you have problems, doubleckeck that you have the prerequisites (see Prerequisites). If you think you followed the instructions but still have problems, submit a bug report (see Submitting a Bug Report).
If you are upgrading an earlier version of hunlex, you may want to uninstall the earlier one first (see Uninstall and Reinstall).
The install prefix is remembered in the source distribution in the file
install_prefix. So after you cd
into the toplevel directory of the distribution,
you can uninstall hunlex by typing
$ make uninstall
You can reinstall it with
$ make reinstall
at any time if you make modifications to the code or compile options.
Warning: Note that if you fiddle with changing the location of individual install targets, uninstall and resinstall will not work correctly.
The following files and directories are installed, paths are relative to the install prefix (see Install):
the executable which can be run on the command line (see Command-line Control)
is the Makefile that defines the toplevel control of hunlex (see Toplevel Control). This file is to be include-ed into your local Makefile to give you a Makefile-style wrapper for calling hunlex (see Bootstrapping and Toplevel Control).
Note that HunlexMakefile will assume that the hunlex executable is found in your path. Make sure that install-prefix/bin is in the path (usually /usr/local/bin is in the PATH.
is a directory containing hunlex documentation. Various documents in various formats are found under this directory including a replica of this document.
TODO: this is not yet the case
is the hunlex man page describes the command-line use of hunlex (also see Command-line Control. Command-line use of hunlex is not the recommended way of using it for the general user. Instead, use hunlex through the toplevel control described in a chapter (see Toplevel Control).
Todo: there is no man page yet
So you installed hunlex and its running smoothly.
This section leads you through the first steps and gives you hints on how you set out working with hunlex.
Create your sandbox directory.
Change to it.
Create your own local Makefile. This will be your connection to the hunlex toplevel control. For your Makefile to understand hunlex predefined toplevel targets (see Targets), you have to include (not insert) the hunlex systemwide Makefile. So you create a Makefile with the following content:
-include /path/to/HunlexMakefile
where /path/to/HunlexMakefile is the path to HunlexMakefile which is supposed to be installed on your system (see Installed Files), by default under /usr/local/lib/HunlexMakefile.
Now, you are ready to test things for yourself. In order to see if all is well, type
$ make
at your prompt in the same sandbox directory.
In fact, you will always type the make command to control hunlex. If you don't give arguments to make, a so-called default action (target, see Targets) is assumed. The default target is resources which creates the output resources according to the default settings (see Options). Toplevel control assumes by default that all its necessary resources are found in the current directory (see Input File Options). If this is not the case, because the files do not exist, the compulsory ones are created and the compilation runs creating the output resources.
Surely, the missing files are created without contents and your output resources will be empty as well. However, this vacuous run will test whether hunlex (and toplevel control) is working properly.
Now if you list your directory, you should see:
$ ls affix.aff grammar Makefile phono.conf dictionary.dic lexicon morph.conf usage.conf
If this is not the case, go to see Troubleshooting.
The meaning of these files in your directory are explained in detail in another chapter (see Files).
If you type make
(or the equivalent make resources
again, your resources will not be compiled again, since the input
resources did not change. If you still want to compile your resources
again, you type
$ make new resources
which forces toplevel to recompile although no input files changed (see Special Targets).
Now.
If you want to develop (toy around with) your own data and create resources, the next step is to fill in the input files. Read on to learn more about files (see Files) and then about the hunlex morphological resource specification language (see Description Language). Since you want to test your creation, you ultimately have to learn about toplevel control (see Toplevel Control) and gradually about the advanced issues in the chapters that follow these.
If you already have your hunlex-resources describing your favourite language ready and you want to compile specific output resources from it with hunlex, you better read about toplevel control with special attention to the options (see Toplevel Control). If you want to fiddle around with more advanced optimization, such as levels and tags, you may end up having to read everything, sorry.
You typically want to use hunlex through its toplevel control interface. Toplevel control means that you invoke hunlex indirectly through a Makefile to compile your resources.
We envisage typical users of hunlex developing their lexical resources in an input directory and occasionally dump output resources for their analyser into specific target directories for various applications.
If you don't like Makefiles or your system does not have make (how did you compile hunlex, then?), you will then invoke hunlex from a shell and use it via the command-line interface. This is non-typical use and not recommended. The Command-line interface which is almost equivalent in functionality to the Makefile interface is described only for completeness and for people developing alternative wrappers (see Command-line Control).
In fact, you don't actually need to know much about make and Makefile-s to use hunlex. Just follow the steps described in Bootstrapping. We assume that you have a project directory with a Makefile sitting in it in order to try out what is described here.
This document is more like a reference manual that details what you can do with your resources and how you can do it through the Makefile interface. What the resources are and how you can develop your own is described in other chapters (see Files and see Description Language).
First of all, you need to know how to make your compilation process more verbose.
In order to see what the toplevel Makefile wrapper is doing you have to unset QUIET option. For instance, typing
$ make QUIET= new resources
will tell you what the Makefile is doing, i.e., what programs it invokes, etc. Unless you are debugging the toplevel control interface of hunlex, you don't want the toplevel to be verbose about what it is doing. So just don't do this.
What you want instead is to make the resource compilation process more verbose, probably because you want to debug your grammar or want hunlex to give you hints what went wrong with your resource compilation.
Verbosity of the hunlex resource compilation can be set with the DEBUG_LEVEL option. Typing
$ make DEBUG_LEVEL=1
in your sandbox (with empty primary resources) will give you something like this (see Bootstrapping):
Reading morpheme declarations and levels...0 morphemes declared. Reading phono features...0 phono features declared. Reading usage qualifiers...0 usage qualifiers declared. Parsing the grammar...ok Parsing the lexicon and performing closure on levels... 0 entries read. Dynamically allocating flags; dumping affix file...ok Dumping precompiled stems to dictionary file...ok 0.00user 0.00system 0:00.02elapsed 12%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+329minor)pagefaults 0swaps
The first couple of lines give you information about the stages of compilation and are described elsewhere.
The enigmatic last two lines give you information about the time it took hunlex to compile your resources. If you are not interested in this information you can deset it using the TIME option (see 4)
You can choose not to bother with this information and deset the TIME option. Typing, say,
$ make TIME= new resources
will not measure and display the duration of compiling.
Your favourite settings can be remembered by adding them to your local Makefile in a rather obvious way. Let us assume you want your DEBUG_LEVEL to be set 1 by default and also that you couldn't care less about the time of compilation. In this case you want to have the following in your Makefile:
DEBUG_LEVEL=1 TIME=
You can also define your default target (see Targets), i.e., the 'task' that make will carry out if you invoke it without an expicit target. For instance, if you always want to recompile your resources each time you invoke make irrespective of whether your primary resources and/or compile configurations changed, you can add the following line at the top of the file:
default: new resources
Now, your Makefile looks something like this:
# comments are introduced by a '#' # my favourite target default: new resources # my favourite settings DEBUG_LEVEL=1 TIME= -include /path/to/HunlexMakefile
The functionality of hunlex is accessed through targets. Targets are arguments of the make command which reads your local Makefile and ultimately consults the systemwise hunlex toplevel Makefile called HunlexMakefile (see Installed Files).
Usually, you will control hunlex through make by typing:
make options target
where options is a sequence of variable assignments which set your options described below (see Options) and where targets is a sequence of targets. For more on variables and targets you may consult the manual of make.
The available toplevel targets are detailed below:
compiles the output resources given the input resources and configuration files. The necessary file locations and options are defined by the relevant variables described below (see Input File Options). This file creates the dictionary and the affix files (by default dictionary.dic and affix.aff, see Output Resources).
by setting MIN_LEVEL to a big number, this call generates resources that contain all words of the language precompiled into the dictionary. And the stems of the dictionary without their output annotation (see Annotation) are found in the file *wordlist*.
pretends that the base resources are changed. You need this directive if you want to recompile the resources althouth no primary resource has changed. This might happen because you are using a different configuration option. (If the base resources are unchanged, no compilation would take place, you have to force it with 'new', see make).
make MIN_LEVEL=3 new resources
removes all intermediate temporary files, so that only lexicon, grammar, and the configuration files, and the output resources (affix and dictionary) remain.
Todo: This is not implemented yet.
removes all non-primary resources, so that only lexicon, grammar, and the configuration files remain.
Additional targets for testing are available, these all presuppose that the huntools (see Huntools) is installed and that the executable hunmorph is found in the path. An alternative hunmorph can be used by setting the HUNMORPH option, (see Executable Path Options).
tests the resource by making hunmorph read the resources (dic and aff files) and analyze the contents of the file that is value of TEST (see Input File Options). TEST is by default set to the standard input, so after saying
$ make testyou have to type in words in the terminal window (exiting with C-d).
If you want to test by analyzing a file, you have to set the value of TEST.
$ make TEST=my/favourite/testfile testTest outputs are to stdout, so just pipe it to a file
$ make TEST=my/favourite/testfile test > test.out 2> test.log
will run hunmorph on the wordlist file (see Resource Compilation Targets, generate) and outputs the result on the standard output (so you may want to pipe the result to a file).
puts hunlex and the analyzer to the test, by creating the resources according to the settings of your makefile, and then run hunmorph on the generated whole wordlist.
Warning: Note that this target first generates all words and then creates the resources again. Running this on huge databases is probably not a good idea.The way you want to test a bigger database instead is by creating a a set of words that your ideal analyzer has to recognize or correctly analyze and test on that (with test). Realtest is just a quick and dirty shorthand for toy databases to check if everybody is with us.
Options of the toplevel are in effect Makefile variables that can be set at the user's will.
(All the command-line options of hunlex can be accessed through the toplevel options are passed to hunlex to regulate the compilation process. The documentation of command line options is found in Command-line Control, but only for the record. All hunlex options are all capital letters (LEXICON) and all command line options begin with a dash and are all small letters but otherwise they are the same (-lexicon)).
All options can be set or reset in your local Makefile (and remembered, see Storing your Settings). These will override the system default. Both the system default and your local default can be overriden by direct command-line variable assignments passed to make, such as the ones shown in this file:
$ make QUIET= DEBUG_LEVEL=3 OUTPUTDIR=/my/favourite/ouputdir
Listed and explained below are all the hunlex options (all public Makefile variables) that the toplevel control provides for the user to manipulate.
When you see something like variable (value), it means the default value of the variable variable is value.
The hunlex executable is by default assumed to be found in the path with name hunlex. By default, installation installs hunlex into /usr/local/bin (see Installation). If you want to use (i) an alternative version of hunlex that is not the one found in the path, or (ii) an uninstalled version of hunlex, or (iii) an installed version but the path to which you don't want to include in your path, then you should set which hunlex to use with this variable.
HUNLEX=/my/favourite/version/of/hunlex
You need the executable hunmorph from the Huntools package (see Huntools) only for testing, if you don't want to test with direct analysis (just want to compile the resources), you don't need to bother.
When used, however, the hunmorph executable is assumed to be found in the path with name hunmorph. If this is not the case, update your path or provide the path to hunmorph with the line
HUNMORPH=/my/favourite/version/of/hunmorph
Quiet mode is set by default which means that the workings of the Makefile toplevel won't bore you to death. The compilation debug messages that Hunlex blurps when running can still be displayed independently (see the DEBUG_LEVEL option below). The QUIET option only refers to what the toplevel wrapper invokes (this way of handling Makefile verbosity is an idea nicked from OCamlMakefile by Markus Mottl).
sets the verbosity of hunlex itself. By default debug level is set to 0. Debug messages are sensitive to the debug level in the range from 0 to 6-ish: the higher the number the more verbose hunlex is about its doings.
0 is non-verbose mode, which means that it only displays (fatal) error messages. If you set DEBUG_LEVEL to say -1, even error messages will be suppressed (only an uncaught exception will be reported in case of fatal errors).
It is typically a good idea to set DEBUG_LEVEL to 2 or 3 and request more if we really want to see what is happening.
Caveat: In fact you won't understand the messages anyway, so the debug blurps just give you an idea of the context where something went wrong with your grammar/lexicon, etc.Todo: This shouldn't be so and debug messages pertaining to grammar development should be self-evident or well designed and documented. Especially parsing errors and/or compile warnings about the grammar and lexicon should be clear.Usually you want to create a log by piping the debug output of make (standard error) with your debug messages to a file. This can be done by, for instance by
$ make DEBUG_LEVEL=5 resources 2> log
By default with every run of hunlex it is measured how long it takes to compile the resources (unix shell's time command) and this information is displayed. Surely, this is only interesting with big lexicons. If you (i) don't have a time command, (ii) have a different time command, (iii) don't want time measured and displayed, just reset the TIME variable. The option can be unset by the line
TIME=in your local Makefile.
The type and use of hunlex input resource files are described in detail elsewhere (see Input Resources). The options by which their locations can be (re)set are listed below:
They can all be set to alternative paths individually. If they are in the same directory, the directory path can also be set via the variable GRAMMARDIR:
the directory for the hunlex primary input resource files, which is, by default, set to inputdir, the value of the variable INPUTDIR, see below.
There are three further input resources which need to be present for a hunlex compilation. These are the compilation configuration files.
the configuration file (see Configuration Files) for morphophonologic and morphoorthographic features
There are two optional configuration files, the signature and the flags file. By default, the options correspoding to these files are set to the empty string, which tells hunlex not to use a feature structures (see Feature Structures) or custom output flags (see Flags).
The location of the signature file used to process and validate features structures (see Feature Structures, see Configuration Files). If it is set to the empty string (the default), hunlex does not use feature structures.
If you use this file, it makes sense to call it something like fs.conf or signature.conf and store it in confdir with your other configuration files, so the assignment
SIGNATURE=$(CONFDIR)/fs.confis an appropriate setting.
The location of the custom output flags file (see Configuration Files) used to decide which flags are used in the output resources (see Flags). If it is set to the empty string (default), hunlex will use a built-in flagset to determine flaggable characters (see Flags).
If you use this file, it makes sense to call it something like flags.conf and store it in confdir with your other configuration files, so the assignment
FLAGS=$(CONFDIR)/flags.confis an appropriate setting.
All configuration files can be set to alternative paths individually. If they are in the same directory, the directory path can also be set via the variable CONFDIR:
the directory for the hunlex compilation configuration files, which is, by default, set to inputdir, the value of the variable INPUTDIR, see below.
As explained all input files can be set to alternative paths individually or primary resources together and configuration files together. If all input resources (primary and configuration) are in the same directory, this directory path can also be set via the variable INPUTDIR:
the directory for all hunlex input resource files, which is by default, set to the currect directory.
A special test file is only used with the Test targets:
The value of TEST is a file (well, a file descriptor, to be precise), the contents of which is tested whenever the toplevel test target is called (see Test Targets). By default it is set to the standard input, so testing with test will expect you to type in words in your terminal window.
Hunlex's output resources are the affix and the dictionary files (see Output Resources). The options by which their locations can be (re)set are listed below:
The wordlist generated by the generate target (see Resource Compilation Targets).
where outputdir (the default directory of the files) is the value of the variable OUTPUTDIR:
the directory for the hunlex output resource files, which is, by default, set to the currect directory
As you can see, the default setting is that all input and output files are located in the current directory under their recommended canonical names. Putting the output resources in the same directory as the primary resources might not be a good idea if you want to compile various types of output resources.
if set, hunlex uses double flags (two-character flags) in the output resources (see Flags).
The following two options regulate the level of morphemes. You find more details about levels in a separate chapter (see Levels).
Morphemes of level below MIN_LEVEL are treated as lexical, i.e., are precompiled with the appropriate stems into the dictionary file. By default, only morphemes of level 0 or below are precompiled into the dictionary.
Morphemes with levels higher than the value of MAX_LEVEL are, on the other hand, treated as being on the same (non-lexical) level. By default, only morphemes of level above 10000 are treated as having the same level.
The options below regulate the format of output resources in detail:
determines the delimiter hunlex puts between individual tags of affixes when tags are merged.
This is interesting if you have a tagging scheme where a morpheme is tagged with a label MORPH1, but in the output you want them clearly delimited, like:
wordtoanalyze >lemma_MORPH1_MORPH2The above is possible if you set
TAG_DELIM='_'
sets the delimiter to put between the fields of the affix and dictionary files, respectively. By default it is set to a single space for the affix file and set to <TAB> in the dictionary.
NB: A tab might allow better postprocessing in the affix file and even allow spaces in the tags which might be useful.At the time of writing the huntools reader only allowed a TAB not a space as delimiter in the dictionary file so change with caution.
the major output mode regulates what information gets output in the affix and dictionary files and how affix entries are conflated.
Warning: This option is not effective at the moment due to the lack of a clear functional specification and it is also unclear how this option should interact with the option STEMINFO (below).Todo: Clarify this. See warning.The possible values at the moment are:
all without effect (see warning).
- Spellchecker
- Stemmer
- Analyzer
- NoMode
regulates what info the analyzer should output about a word.
This option can take the following values:
- Tag only output the tag of a lexical stem (to output the pos tag of the stem)
- Lemma only output the lemma of a lexical stem (for stemmers doing lexical indexing)
- Stem output the stem allomorph of the stem (e.g., for counting stem variant occurrences?)
- LemmaWithTag output lemma with the tag (default, for morphological analysis)
- StemWithTag output stem (allomorph) with the tag (?)
- NoSteminfo no output for the dictionary (for spell-checker resources).
regulates if feature structure annotations (see Feature Structures) should be output along with the normal (string type) tags (see Tags). This is extremely useful for debugging purposes. If the manually supplied tag chunks are supposed to yield well-formed features structures in the output annotation of the analyzer, it is a good idea to check whether this is the case. If this option is set to -fs_info (the corresponding command-line option), the feature structures resulting from unification are output along with the tags in the dictionary and the affix file. Typically, this option is used with the generate target (see Resource Compilation Targets) and the second and the third columns of the dictionary file are compared (they are supposed to be identical).
Todo: This process should be added to the set of toplevel test targets.
The affix file specifies a lot of variables to be read by the morphbase routines. Some of these are metadata but some are crucial for suggestions and accent replacement for automatic error correction, see below.
Warning: This part is a disasterously underdevelopped part of hunlex and an outragously ad-hoc part of morphbase as well.
Preambles can be generated to hunlex output files these are meant to be 'official' comment headers about copyright information, etc.
are files to be included as preambles in the affix and dictionary output resources, respectively. By default, they are unset, ie., no preambles will be included into the output resources.
NB: This feature is only available on toplevel control and will never be integral part of the hunlex executable.
These are the character-conversion table and replacement table to be included into morphbase resources if alternatives (e.g., for spellchecking) or robust error correction is required (see Huntools). These features are documented in the huntools documentation (hopefully, but certainly not here, see Output Resources, see Huntools).
NB: This feature of including these extra files into the affix file is only available through toplevel control and will never be integral part of the hunlex executable.
Identifies the character-set for the analyzer reading the affix file. By default, this is set to ISO8859-2, i.e., Eastern European. Maybe this is the 'hun' in hunlex...
is the file from which settings for some affix variables are read. If it doesn't exist, no affix variables other than the ones directly managed are dumped into the affix file
Todo: Need to sort these things out.
Some affix file variables are managed by hunlex internally but dumped to the affix file by the toplevel routines.
Todo: This is done at the moment by the toplevel Makefile, but should be integrated into the hunlex executable itself.
These two flags will be attached to (i) bound stems and (ii) affix entries which can not be stripped first (i.e., suffixes which cannot end a word, see Flags).
If this flag is present, it indicates for the stemmer/analyzer that the stem string is to be output or not as part of the annotation. For instance (if STEM_GIVEN flag is 'x'), the following dic file
go/ [VERB] went/x go[VERB]will result in the following stemming:
> go go[VERB] > went go[VERB]This makes more compact dictionaries. What information one wants the stemmer and analyzer to output can be configured through hunlex options (see below).
Todo: This flag is not implemented yet (since it is not implemented yet in morphbase, either, but probably will never be implemented since treatment of special flags shouldn't be user customizable above the choice of flaggable characters.
Warning: Make sure the flags given here are consistent with the double flags option and the custom flags file (see the FLAGS variable above, and see Flags).These options are superfluous and should be automatically managed by hunlex which would write them into the affix file. Very likely to be deprecated soon.
Todo: This needs to be implemented.
Warning: Additional settings that are to be included in the affix file and are crucial part of the resources (partly should be set by hunlex itself) such as compoundflags. I have no idea what to do with these at the moment. The ones I know of are listed here just for the record.
Todo: This needs to be sorted out.
Some of these data are actually global and could even go to the settings preamble (AFF_SETTINGS):
These ones should be dynamic metadata
The ones below should clearly be controled and output by hunlex itself. (also ONLYROOT and FORBIDDENWORD, but they are handled by the toplevel, at least).
Ones relating to compounding (compounding is handled very differently by myspell, morphbase and jmorph):
Warning: Compounding is as yet unsupported by hunlex and should be worked on with high priority.
I have really no idea about the following ones:
This chapter is about the framework that allows you to describe the morphology and lexicon of a language. Below we specify the syntax and semantics of this description language. The files written in this language (the lexicon and grammar) are the primary resources of hunlex (see Input Resources) and the basis for all compiled output (how this works is described in another chapter, see Toplevel Control).
There are three kinds of statement in this language:
Only the grammar file can contain macro definitions (see Macros) and metadata definitions (see Metadata) and both the lexicon and the grammar file can contain morph definitions which describe morphological units (affix morphemes, lexemes and their paradigms). In this respect, the syntax of the lexicon and grammar files are identical and, therefore, it is discussed together (see Morphs) are not described separately, although the usefulness (and sometimes even the semantics) of certain expressions might be different in the lexicon and in the grammar.
Morphs are the central entities in the description language. They stand for morphological units of any size and abstractness including affix morphemes, lexemes, paradigms, etc. and are not what linguists call morphs (i.e., a particular occurrence of one morpheme). Morphs are meant to describe an affix morpheme or a lexeme, but in fact, it is up to you what level of abstractness you find useful in your grammar, so you can have individual morphs describing each allomorph of a morpheme or each stem variant of a lexeme. But the point is that morphs support description of variants or allomorphs. Anyway, a morph is basically a collection of rules, variants, etc. that somehow belong together. Ideally, a variant of an affix morpheme is actually an affix allomorph, a concrete affixation rule, while a variant of a lexeme is a stem variant or an exceptional form of the lexeme's paradigm.
A morph statement is introduced by an optional MORPH: keyword. It is a good idea to drop it and start the statement directly with the preamble (in fact, the name of the morph), which is compulsory.
A morph description has a preamble, i.e., a header describing the global properties of the morph, the properties which characterize all of its variants/allomorphs.
After the preamble, one finds the variants one after the other. The preamble and the variants are delimited by a comma.
Finally, the morph definition like all other statements is closed by a semicolon.
The preamble starts with the name of the morph. The name of the morph can be any arbitrary id, a mnemonic string that ideally uniquely identifies the morph. Referring to other morphs is an important in describing how morphemes can be combined: in order for these references to be reliable, the names in the grammar are supposed to be unique. This is not important in the lexicon, where homophonous lemmas can have identical names (however, this is not recommended, since, in such a case, for instance, morphological synthesis would be unable to distinguish two senses especially if they are of the same morphosyntactic category).
The rest of the preamble as well as each individual variant is composed of blocks. Blocks are the ingredients of the description, they specify information such as conditions of rule application, output of a rule, the tag associated with the rule, etc.
In sum, then, morphs have the following structure:
Blocks are explained in detail in the next subsection.
Blocks are the ingredients of the description, they specify information such as conditions of rule application, output of a rule, the tag associated with the rule, etc.
Blocks all have a leading keyword followed by some expressions (arguments) and last till the next keyword or the end of the variant:
Blocks can come in any order within a variant and can be repeated any number of times. So writing
KEYWORD: argument0 argument1 argument2 ...
has the same effect as when it is written like
KEYWORD: argument0 KEYWORD: argument1 KEYWORD: argument2 ...
or even
KEYWORD: argument0 SOME-OTHER-BLOCKS KEYWORD: argument1 SOME-OTHER-BLOCKS KEYWORD: argument2 ...
or when it is 'included' with a macro (see Macros).
Certain blocks specify information in a cumulative way, so every time they are specified the information is added to the info specified so far. For instance an IF block is cumulative, all the arguments of all the IF blocks of a variant cumulate to give the conditions of rule application, i.e., the rule applies only if all conditions on features are satisfied by the input (see IF block below).
However, other blocks do not specify information that can be interpreted cumulatively, so it does not make sense to have more than one argument with them or specify them more than once for a variant. (They, however, may still be specified in the preamble and overriden in a variant, for instance).
In every case, out of contradictory information, the one given last ”has the last word” overriding previous ones.
So if you write
CLIP: 1 CLIP: 2
it is the same as
CLIP: 2
In what follows, blocks are listed and explained one by one.
default morphs are used to assign features to inputs unspecified for some features. A morph with a default block just adds extra rules that leave alone inputs which are specified for any of the features to be defaulted. The variants of a morph having a default block in their preamble will assume that neither of the features to be defaulted is present in the input.
So morph DEFAULT: feature0 feature1 , MATCH: x OUT: feature0 ;
is equivalent to
morph , IF: !feature0 feature1 OUT: feature1 , IF: feature0 !feature1 OUT: feature0 , IF: feature0 feature1 OUT: feature0 feature1 , IF: !feature0 !feature1 MATCH: x OUT: feature0
Filters typically want to pass on their whole input by default.
this block defines the actual affix or lexis.
The exact shape of variant determines what type of affix, lexis the variant describes:
- +aff describes a suffix when the rule applies aff is appended to the end of the input (after possibly clipping some characters)
- aff+ describes a prefix when the rule applies aff is appended to the beginning of the input (after possibly clipping some characters)
- pref+suff describes a circumfix
when the rule applies pref is appended to the beginning of the input (after possibly clipping some characters) and suff is appended to the end of the input (after possibly clipping some characters)
- lexis defines a lexis. This is typically used in the lexicon and used as input to the rules. If the VARIANT keyword is left out, it has to come as the first block of the rule (after the comma closing the preamble or the preceding rule).
If a lexis is used in the grammar, it is meant to stand for a suppletive form. Since it may well be a typo, a warning is given. We encourage the policy to put suppletive paradigmatic exceptions as variants of the lexeme in the lexicon file. Especially since matches are ineffective for lexis rules, therefore conditions on the suppletion should be expressed with features which is much safer anyway.
All the lexis and affix strings can contain any character except whitespace, comma, semcolon, colon?? exclamation mark slash tilde plus sign [^'#' ' ' '\t' '\n' ';' ',' '\r' '!' '/' '+' '~']
there should be a way to allow escapes.
Substitutions (which are special kind of rules) are specified by REPLACE/WITH blocks.
This block specifies the number of characters that needs to be clipped from one end of the input.
It has no effect if the variant is a lexis or substitution. So you don't use this block in the lexicon.
If no CLIP block is given, no characters are clipped (the integer defaults to zero).
These blocks specify a substitution.
pattern is a hunlex regular expression.
tamplate is a replacement string which can contain special symbols '\1', '\2', etc, which reference the bracketed subpatterns in pattern.
specifies a match condition on rule application. The rule only applies if the input matches pattern, which is a hunlex regular expression. So you don't use this in the lexicon.
The matched expression defines a match at the edge of the word, the beginning for prefixes and the end for suffixes. You may include special symbols like ^ and $, to make this more explicit.
Match blocks are non-cumulative, but circumfixes allow two matches (one beginning with a ^ and one ending in a $).
If blocks specify the conditions of rule application. Conditions are either positive conditions (feature name) or negative conditions (NOT feature-name).
The rule only applies if the input has the positive features specified in the IF blocks and doesn't have the negative features specified in the IF block.
IF blocks are therefore cumulative and the conditions are understood conjunctively.
specify the output conditions of the variant (affix rule or lexis). An output can be a feature or a morph
Features can be restricted to particular morphs.
specifies usage qualifiers describing the variant.
Cumulative (conjunctive)
tells that the morph in question is a filter which defines fallback rules for lexical features.
This means that the variants are meant to apply only if the input has none of the filtered features.
Has no effect within individual variants or in the lexicon. Only relevant in a morph preamble in the grammar.
Cumulative (conjunctive on the rule conditions)
Defines 'inheritance' of features: a feature mentioned in the KEEP block is an output feature of the result of rule application if and only if the input has the feature. As long as a particular variant applies to an input.
If output features and keep features overlap, output features are meant to override inheritance.
Features which are restricted by the input condition (IF block) are inherited normally, but since they are known, can also be mentioned in the OUT block for clarity.
NB: The thingies following KEEP in a KEEP block are features. They can not be macro names. Don't trick yourself by 'abbreviating' a sequence phonofeatures with a macro and then refer to that in a keep block. Don't forget that macros abbreviate (a series of) blocks, so clearly they can't be nested within a KEEP block.Cumulative
specifies if the rule application gives a full form. For bound stems or non-closing affixes, it has to be set to false.
By default, variants in the lexicon are NOT-free variants in the grammar are free ???? !!!! is this ok?
specifies the feature structure graph to merged when the rule applies. feature-structure is a kr-style features structure description string.
defines a macro named macro-name. Later (any time after this definition), any time macro-name is encountered it is understood as if it said blocks. blocks is a sequence of any blocks including (other) macro-names. The macro-name appearing elsewhere than its definition has to be already defined.
If a macro-name is a declared morph-name? If a macro-name is a declared feature?
binds regexp-name to a hunlex regular expression, i.e., a regular expression that can contain regular expression macro-names in angle-brackets. regexp-name can be referenced within any regular expression later. An expression is resolved by replacing the substring <regexp-name> with the resolved regexp.
This means, that you have to esacpe you <-s and >-s if they do not delimit regexp names.
As said, you can define regexp-macros using other macros, only at the time of using a regexp-name it has to be defined already (the definition should be earlier in the file), so that it can be resolved at the time of reading the definition.
There are various files that hunlex processes. Input as well as Output files are described in this chapter. The file names used in this section are just nicknames (which happen to be the default filenames assumed) and can be changed at will with setting toplevel option (see Options).
There are several types of files hunlex considers and they will all be discussed in turn.
Lexicon and grammar are the two files which are considered the primary resources. These files contain the description of the language's morphology with all the rules for affixation, lexical entries, specifying morphological output annotation (tags), etc., see Primary Resources. Secondly, there are configuration files, which declare the morphemes and features that are considered active by hunlex for a particular compilation. By choosing and adjusting parameters of these features, one can manipulate under- and over-generation of the analyzer (see Resource Compilation Options) and, most importantly, regulate which affixes are merged together to yield the affix-cluster rules dumped into the affix file. The way affixes are merged is crucial for the efficiency of real-time analyzers (see Levels). These files are also described below (see Configuration Files).
Primary resources are the files that you are supposed to develop, maintain, extend and that describe your morphology (see Motivation). There are two primary resources: the grammar and the lexicon. These files are described below.
The lexicon file (the file name of which is lexicon by default, but can be set through options, see Input File Options) is the repository of lexical entries, containing information about:
The syntax of the lexicon file is basically the same as that of the grammar, except that it cannot contain macro definitions (see Macros). This syntax of describing morphology is explained in detail in another chapter (see Description Language).
For examples of lexicons, have a look at the zillion examples in the Examples directory that comes with the distribution (see Installed Files).
The grammar file is the other primary resource and also absolutely necessary to describe the morphology of your language. Its name is grammar by default but can be changed by setting toplevel options (see Input File Options). The grammar file specifies:
The syntax of the grammar file is the same as the one used for the lexicon except that the grammar file can contain macro definitions (see Macros). The syntax and semantics of this description language is explained in detail in another chapter, see Description Language.
For examples of grammar files, have a look at the zillion examples in the Examples directory that comes with the distribution (see Installed Files).
Configuration files are the files which mediate between primary resources describing a language and a particular resource created for a particular method, routine, application.
There are three configuration files which tell hunlex which units and features should be included into the output resource from among the ones mentioned in the primary resources. The units (morphemes, features) not declared in these configuration files are considered ineffective by hunlex while reading the primary resources.
The format of these three definition files are the same, each declaring a unit each line (with some parameters) and accept comments starting with '#' lasting till the end of the line.
They are discussed in turn below.
The morph.conf file is one of the compilation configuration files that determine how hunlex compiles its output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).
It declares the affix morphemes and the filters that are to be used from among the ones that are in the grammar.
Warning: the affix morphemes not listed (or commented out) in this file are ineffective for the compilation (as if they were not in the grammar).
Each line in this file contains the affix morpheme's name and optionally a second field, which gives the level of the morpheme. If no level is given, the affix is assumed to be of level maximum_level (the value of the option MAX_LEVEL, see Resource Compilation Options). Very briefly, levels regulate which affixes will be merged with which other affixes to yield the affix clusters that are dumped as affix rules into the affix file. The odds and ends of levels are described in detail in another chapter (see Levels).
For examples of the rather dull morph.conf files, browse the examples in the Examples directory that comes with the distribution (see Installed Files).
If you have a grammar and you want to declare all the (undeclared) morphs defined in it by including them in the morph.conf. All you do is type
make DEBUG_LEVEL=1 new resources 2>&1 | grep '(morph skipped)' | cut -d' ' -f1 >> in/morph.conf
in the directory where your local Makefile resides. This will append all the undeclared morphs (one per line) to the morph.conf file. Note, the morphs so declared will be of level maximum_level (see above).
The phono.conf file is one of the compilation configuration files that determine how hunlex compiles the output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).
The phono.conf file is the file simply listing all the features that we want used from among the ones used in the grammar and the lexicon. Very briefly, features are attributes of affixes and lexical entries the presence or absence of which can be a condition on applying an affix rule.
Warning: Features used in the grammar but not mentioned (or commented out) in the phono.conf file will be ignored (as if they were never there) for the present compilation by hunlex when reading the primary resources.
Warning: Features mentioned in phono.conf but never used in the grammar or the lexicon are allowed and maybe should generate a warning, but they don't. This may cause a lot of trouble.
So, phon.conf simply declares the features one on each line and allows the usual comments (with a '#').
For examples of phono.conf files, browse the examples in the Examples directory that comes with the distribution (see Installed Files).
If you have a grammar and you want to declare all the (undeclared) features referred to in the grammar in conditions by including them in the phono.conf. All you do is type
make DEBUG_LEVEL=1 new resources 2>&1 | grep '(feature skipped)' | cut -d' ' -f1 | sort -u >> in/phono.conf
– The usage.conf file is is one of the compilation configuration files that determine how hunlex compiles the output resources (aff and dic, see Output Resources) from the primary resources (lexicon and grammar, see Primary Resources).
usage.conf in particular determines which usage qualifiers are allowed for the input units (lexical entries, affixes, filters and the variants thereof) that are included into the resource to be compiled. Units having a usage qualifier that is not listed in this file are ignored for the compilation (as if they were not there).
NB: Usage qualifiers are not first class features. They can not be negated or used as conditions on rule application. They are simply used to categorize rules (affixes and stems) in certain dimensions such as etymology, register, usage domain, normative status, formality, etc.
In addition to declaring allowed usage qualifiers, this file has another function as well. Each line containing the usage qualifier may contain a second field which is a tag associated with that usage feature. If this field is missing, the name of the usage qualifier string is assumed to be its tag. Usage qualifier tags can be output by the analyzer if they are compiled into the resources by hunlex.
This can be configured with the output info option (see Resource Compilation Options).
Warning: This option is not implemented yet.
Todo: This is not implemented yet. I don't even know if this is fine like this. The problem is that they cannot really be just intermixed with the ordinary morphological tags.
Various dimensions of usage information can be made effective by introducing expressions with arbitrary leading keywords (see Description Language). Redefining each of the wanted usage dimensions in the parsing_common.ml file will result in making any one or more of them effective as usage qualifiers. The point is that you can keep a lot of information in the same lexical database. When the keywords it contains are hunlex-ineffective, the expressions they lead are simply ignored.
Caveat: At the moment, for these alternatives, you have to recompile hunlex, with the new keyword associations, see Description Language.
Todo: This could be done online but has very low priority.
For examples of usage.conf files, browse the examples, in the Examples directory that comes with the distribution (see Installed Files).
The output of a hunlex resource compilation is an affix file and a dictionary file. In brief, the affix file contains the description of the affix (cluster) rules of the language we analyze, while the dictionary contains the stems the affix rules can apply to. They have more or less the same role as the grammar and lexicon files, the primary resources of hunlex (see Primary Resources). But the affix and dictionary files are resources that are used by real-time word-analysis routines (such as morphbase, myspell or jmorph, see Related Software and Resources). They share commonalities of format with minor idiosyncrasies, some of which are still in the changing.
Hunlex reads a transparent human-maintainable non-redundant morphological grammar description with the lexicon of a language and creates affix and dictionary files tailored to your needs (see Introduction). The ultimate purpose of hunlex is that these output resource files could at last be considered a binary-like secondary (automatically compiled) format, not a primary (maintained) lexical resource.
Therefore the technical specification of these output formats should only concern you here if you want to compile affix and dictionary files for your own (or modifief versions of our own) word-analysis software which also reads the aff/dic files. In such a case, however, you know that format better than I do. All I can say is that the parameters along which the format can be manipulated is supposed to conform with the format of the software listed in see Software that can use the output of Hunlex as input. If you develop some such stuff as well and would like your format to be supported, take a deep breath and consider requesting a feature from the authors see Requesting a New Feature.
In sum, the format of these output resource files are not detailed. Anyway, they are (probably) well documented elsewhere (e.g., myspell manual page). See especially the documentation of huntools and the morphbase library (see Huntools).
This chapter is a verbatim include of the hunlex manpage. Command-line control is not the recommended interface to use hunlex, see toplevel control (see Toplevel Control).
HUNLEX(1) User Commands HUNLEX(1) NAME hunlex - manual page for hunlex 0.3 SYNOPSIS hunlex <options> DESCRIPTION Options: (for more see manpage) option description (default settings) ------------------------------------------ -synthesis morphological synthesis (no) -synth_in synthesize from file (stdin) -synth_out synthesize to file (stdout) -lexicon lexicon (lexicon) -grammar morphological grammar file (grammar) -phono phono features file (phono.include) -morph morph declarations and levels file (morph.include) -usage usage qualifiers and their tags (usage.include) -signature feature structure signature (None) -aff output affix file (affix.aff) -dic output dictionary file (dictionary.dic) -mode output mode [Spellchecker|Stemmer|Analyzer|NoMode] (NoMode) -steminfo info to output about a word's stem [Tag|Lemma|Stem|LemmaWith- Tag|StemWithTag|NoSteminfo] (NoSteminfo) -fs_info output fs in the dictionary for testing purposes (no) -tag_delim tag delimiter ('_') -out_delim output delimiter (<space>) -out_delim_dic output delimiter for dic file (<tab>) -double_flags [0-9][^0-9] type double-char flags (no, single char) -flags legitimate flag characters file (none, use predefined flags) -min_level minimum morph level (0) -max_level maximum morph level (1000) -debug_level debug level (0) -test testable output (0) --version Display version info and exit -help Display this list of options --help Display this list of options SEE ALSO The full documentation for hunlex is maintained as a Texinfo manual. If the info and hunlex programs are properly installed at your site, the command info hunlex should give you access to the complete manual. hunlex 0.3 May 2005 HUNLEX(1)
Levels index morphemes and are assigned to morphemes in the morph.conf file (see Morpheme Configuration File).
Levels govern which affixes will be merged together into complex affixes (or affix clusters) and will constitute an affix rule (linguistically correctly, and affix-cluster rule) in the output affix file (see Output Resources). Affix rules in the affix file will be stripped from the analyzed words by the analysis routines in one step (i.e., by one rule-application).
Levels, then, regulate the output resources of hunlex and have no role to play in how you design your grammars. There are no levels in the hunlex grammar and lexicon, the files which describe the morphology of the language (see Primary Resources). Levels make sense only in relation to the compilation process.
This chapter describes why you would want levels, how you manipulate them and what consequences it has on analysis.
Imagine a word has several affixes like dalokban (= dal 'song' + ok 'plural' + ban 'inessive'). Assume that your hunlex grammar correctly describes the plural and inessive morphemes and their combination rules. If you assign these morphemes to different levels, the output resource will contain affix rules expressing the morphemes separately. This means that these affixes are not stripped in one go by the analysis routines using the affix file as their resource.
Some affixes, however, may need to be stripped as a cluster in one go, because some analysis algorithms do not allow any number of consecutive affix-strippings operations or because stripping them in one go is just more optimal for your purposes (see Levels and Optimizing Performance). Therefore the separate affix rules in the input grammar should be merged when they are dumped by hunlex as rules into the affix file. Well, levels regulate which morphemes should be merged with which other morphemes. (To be more precise, they regulate which affix rules expressing which morphemes should be merged with which other which other affix rules expressing which other morphemes.
Since merged affix rules are highly redundant and tedious to maintain, one of the main purposes of hunlex is actually to allow for high flexibility in your choice of merging affixes to create resources optimized for your needs, while at the same time also allow for transparent and non-redundant description for easy maintenance and scalability (see Introduction).
Levels do not only regulate which affixes are compiled into one affix cluster (an affix rule in the output affix file, see Levels and Affix Rules). They also determine which stems are precompiled into the dictionary (see Output Resources). In particular, all affixes below a so called minimal lexical level (see Levels and Ordering) are precompiled with the stems of the lexicon into the output dictionary.
For instance, taking the example of the previous section, if both the plural and the inessive morpheme are below the minimal level (of on-line-ness), the whole morphologically complex word dalokban will be included in the dictionary file. To learn why youwould want to do such a thing see also Manipulating Levels with Options.
The word 'level' is actually rather misleading, since the notion of level we have here has only a very restricted sense of ordering. There is no sense in which (rules expressing) a morpheme of level i can not be applied after (rules expressing) another morpheme of level j where i > j.
There is a sense in which levels do have ordering, however. There is always a minimal level (that is the value of the MIN_LEVEL option, see Resource Compilation Options) below which all morphemes are compiled into the dictionary (i.e., they are merged with the absolute stems in the lexicon and dumped as stems into the dictionary). The default lexical level is 1, meaning that (affix rules expressing) morphemes of level 0 or less are merged with the appropriate stems and the resulting (morphologically complex) words will be entries in the dictionary file (see Levels and Stems.
Since the dictionary file entries are the 'practical' stems of the analysis routines, configuring the level of morphemes gives you the option to adjust the depth of stemming. For instance, if you choose not to want your stemmer to analyze some derivational affix (which you otherwise productively describe with a rule in the grammar), all you have to do is to assign a lexical level to this morpheme in morph.include. Recompiling with this configuration will result in resources with the precompiled entries in the dictionary file.
See also the MAX_LEVEL option, see Resource Compilation Options.
You don't always have to fiddle manually with assigning alternative levels to each morpheme. For some of the special cases, hunlex provides an option. It is very common that you want to generate all the words your grammar accepts. All you have to do is to set the minimal level to a very large value that is higher than any of the levels you have assigned to morphemes in the morph.conf file (see Morpheme Configuration File). This is done with the MIN_LEVEL option (see Resource Compilation Options). This means to hunlex that all the rules expressing all the morphemes are to be compiled in the dictionary, which results in deriving all the words of the language. This option is also provided as the generate toplevel target (see Targets), in fact
make generate
is just a shorthand for
make MIN_LEVEL=100000 new resources
In order to create an output in which no two affix rules are merged, it is enough to assign every morpheme to a different level for instance by using the following unix shell command:
$ cp morph.conf morph.conf.orig $ cut -d' ' -f1 morph.conf.orig | nl -nln -s' ' | sed 's/\(.*\) \(.*\)$/\2 \1/g' > morph.conf
Todo: I should provide an option that does this.
With a routine that supports any number of affix stripping operations, such a resource will allow correct analysis. But not with the ones that allow only a finite number of rule applications.
Todo: write on recursion
Geeky note: If rule-application monotonically increases the size of the input, potential recursion is never unbounded recursion since all analysis routines have a fixed buffersize anyway. If not however, if empty strings or clippings make rule application non-monotonic in size, potential recursion may cause actual infinite loops in some uncautious implementations. Boundedness of recursion due to buffersize restrictions is only one sense in which the full intended (implied) generative power of any arbitrary hunlex grammar is not reflected in the analyzer's actual analysis potential.
For myspell style resources where you want only one stage of affix stripping, you should use one lexical and one non-lexical level. Without having to create your alternative morph.conf file, this can easily be done with the combination of the MIN_LEVEL and the MAX_LEVEL options (see Options).
You just set these two options to the same value l, and all morphemes with level equal or smaller then l will be compiled into the dictionary, and all the other morphemes (i.e., affix morphemes with level greater than l) will be merged into clusters (and these affixes will be dumped to the affix file as rules). Implementations like myspell (see Myspell) can only run correctly with such resources given that they allow only one step of suffix stripping.
Todo: This is slightly more compilcated because of prefixes and affixes stripped separately. We should clarify this. And this whole myspell business is actually not tested.
Myspell supports only one stage of affix stripping, the morphbase routines support two and jmorph supports any number (truely recursive).
With an affix file where there are separate affix rules for these affixes, the analyzer would have to do two suffix stripping operations to recognize the word dalokban. Therefore using such a resource, myspell will not recognize this word at all. The morphbase routines will be able to analyze it since they allow two stages of suffix stripping which is just enough and jmorph as well since it allows any number of suffix stripping steps.
So, when you configure which affixes are merged, make sure you have considered the generative capacity of the target analysis routine (how many suffix strippings it can make).
precompile into the dictionary?
Which affix rules you want to merge and precompile into clusters is entirely up to you and usually a question of optimization. If you choose not to precompile anything, then your affix file will be small, but your analysis may not be optimal for runtime (if it generates the correct analyses at all, see Levels and Steps of Affix Stripping).
If, on the other hand, you precompile all affixes into affix clusters, you might end up with hundreds of megabytes of affix file which is gonna compromise your runtime analysis memory load (though maybe faster for the analysis algorithm, than recursive calls). This last realization led the author of hunspell (see Huntools) to introduce a second step of suffix stripping in the algorithm which was a legacy of the original myspell code with its one level of affix stripping.
Finally, compiling everything in the lexicon is not a very good idea for complex morphologies and big lexicons. Although it may be indespensible for testing on smaller fragments of lexicons/grammars or for creating wordlists (see Test Targets).
Some special affix rules should always be precompiled into the dictionary and not output as affix rules. These rules are the ones that cannot be interpreted as affix rules at all, for instance, rules of substitution or suppletion. These rules are beyond the descriptive capacity of affix files. Therefore all substitutions and suppletions are precompiled (merged with the rules or stems they can be applied to) irrespective of their level. Find more about this.
We call the information that a morphological analyzer is expected to output for an analyzed word a piece of morphological annotation. In more general terms, howver, when we talk about any kind of word-analysis routine such as a spell-checker, stemmer, we call the output information these routines associate with words tags. We want to emphasize here that this piece of output information 'tags' the whole that is analyzed. The tag is used to annotate words in a corpus by decorating a raw text with useful extra information.
NB: Tagging as we use it in no way constitutes a constituent structure, segmentation, etc. of the input word form.
This document describes the ways in which you can associate tags with your morphemes (or individual stem variants and affix rules). These tags should be thought to constitute ingredients of and output tag that an analyzed word containing that morpheme would be. Certainly, not all analysis software can or is supposed to output any useful information about the morphological makeup of the word. For instance, a spell-checker is typically required only to recognize whether a word is correct (usually in a strict normative sense), but a morphological analyzer or a stemmer is supposed to output some information. Since the huntools routines are able to perform full morphological analysis, not just recognition (REFERENCE), adding morphological tags to your rules is worth your while. Nevertheless, if you never ever want to be able to output any useful info (because you only care about spellchecking), you don't really need to read on.
The output tag associated with a successful analysis of a word is extremely primitively defined by the concatenation of the tags assigned to the rules and the stem which constituted the parse of the word.
NB: It is not clear whether the order of prefixes stems and suffixes should matter in some cases. In the usual case we assume that what the analyzer will do is concatenation of tags in the order of affix-rule stripping.
Todo: What the analyzers do with the tags should be clarified. In fact, both huntools and jmorph do something that is smarter for particular purposes but not reasonably generalizable or even incorrect for the general case.
For this to work, you can assign tags (chunks of output annotation) to any affix variant and stem variant in your grammar and lexicon. This is done with TAG expressions (see Description Language, TAG keyword).
As hunlex merges affixes, it merges their tags accordingly as expected. There are a number of formatting options with which you can influence the way you put together tags. One is the TAG_DELIM option (see Resource Compilation Options), which sets the delimiter between any two tags. If multiple tags are given by TAG expressions, they are also concatenated with this delimiter in the order of their appearance within the rule-block.
Depending on the main output mode of hunlex, various pieces of information can be chosen to be considered as tags to output. This is important if you want to configure your resources so that it will give you a stemmer or a tagger or an analyzer and various other options are available, see Resource Compilation Options.
Hunlex also support feature structures as a kind of annotation scheme. This is extremely useful to crossckeck the correctness of your tags. Tags can be quite messy and since they are pieces of strings, they are difficult to check.
Feature structures are structured objects which are checked against a signature (given in the signature file, which is the value of the option SIGNATURE, see Input File Options) and are merged with graph-unification. As annotations to give complex morphological information, they are more expressive and adequate than pieces of tags that are concatenated. Also, feature structures, unlike just arbitrary strings in the tags, are interpretable data structures which one can directly calculate with, say, in a syntactic analyzer using the output of the morphological analyzer.
That said, it has to be added that the analyzers themselves do not support these feature structures. This means that they still manipulate pieces of feature-structure descriptions as strings and glue them together. If you use the feature structures within your hunlex description, however, you can be certain that, even if the analyzer just concatenates them, the resulting analyses describe valid FS-s according to your signature (see also the SIGNATURE option under Input File Options).
Todo: Include a proper description of the extended KR framework of FS-s.
Todo: No support for derivations is implemented yet (it is on the way).
Flags are used in the output resources (see Output Resources) to index affix rules. Each entry in the dictionary file has a set of flags indicating which affix rules can be applied to it.
So, flags are given by hunlex and written in the affix and dictionary files. There is no such thing as a flag in the hunlex input grammar or lexicon, the files which describe your morphology.
You can specify some aspects of what flags hunlex will assign to affix classes and how. This is what the present chapter is about.
Flags can be a one-character or two-character long.
Myspell (and legacy xspell implementations) can only handle single-character flags. For the general case, this should be ok and is the default. If you are dealing with languages of sensible complexity, this default is ok and you don't need to read this chapter any further.
Double flags are composed of a number as first character and a flaggable character (see Flaggable Characters) as the second, such as '3f' or '9t'. In order to use double flags, use the DOUBLE_FLAGS option (see Resource Compilation Options). Read on to learn why you would use double flags (see Limit on the Number of Flags).
flaggable characters are characters that hunlex can use as flags (in case of single-character flags) or can be the second character of a double flag (see Two Forms of Flags). All non-whitespace characters are in principle flaggable.
The actual choice of flaggable characters is by default the following 132 characters which are hard-wired in hunlex. I am not in a position to list them here, because... (I bet my hundred forints that some of the characters below are displayed completely differently for you than for me, in any format any display, ranging from your terminal trough your browser to acroread.) They are however, included, in the original texinfo version of this document as a comment (see file doc/texinfo/flags.texinfo, or to be sure in the source code, src/hunlex_wrapper.ml)
Flaggable characters, however, can be customized through hunlex's FLAGS options (see Input File Options). This option takes a filename. The contents of the file is the sequence of characters to be used for flags without any delimiters.
Warning: Make sure you do not include any whitespace in this file (other than a trailing newline), or do not include any character twice. Since characters are not checked for sanity, doing otherwise may result in ill-formed affix files or conflated affix classes. If you use double-flags, do not include numbers as flaggables.
Todo: Why don't we bloody check this? Checking of flaggable characters should be amended in a future version.
The association of flags to affix classes takes flags from left to right. This means that if the output requires 35 flags, the first 35 flaggable characters will be used. This is, however, all that can be said: which actual flag comes to which affix class can not be further specified.
Warning: This last sentence is a warning in itself. For people who are used to fiddling with affix files that were manually created (in fact almost all ispell resources), it has to be stressed: hunlex-generated affix files are not to be read by humans and should be considered binary. Associations of flags with particular affix rules/classes are not permanent across various configurations/resource compilations. If you want to post-process affix files, never assume particular flags are meaningful. This is rather obvious once you realize that the affix rules/classes themselves are not consistent across different parametrizations, either (see e.g., levels). This policy is called dynamic flagging.An exception from under dynamic flagging might be special flags which are fairly consistent since their expression can be customized to particular strings (see Resource Compilation Options, Affix file variables). But this feature will soon cease to exist, so just wipe your tears off your face, be happy that you have a hunlex resource and forget your old flags.
If you use single-character flags (see Two Forms of Flags), the number of flags equals the number of flaggable characters, i.e., the length of the custom flag file (see Resource Compilation Options, flags.conf) or 132, by default (see Flaggable Characters). This is also the maximum number of affix classes you can have in your output resources.
(After flaggable characters will be chosen to express the special flags (see Special Flags), the number of possible affix classes is the number of flaggables minus the number of special flags needed.)
Sometimes this is not enough for languages with hugely complex and lexically idiosynchratic morphology and one has to use double flags (see Two Forms of Flags). You can tell whether you really need this if hunlex resource compilation stops complaining that there are not enough flags (exception Not_enough_flags).
You tell hunlex to use double flags with the appropriately named DOUBLE_FLAGS option (see Resource Compilation Options). With double flags you can have 10 times more affix classes than flaggable characters, i.e., 1320 (from '0a' to '9z' or whatever) with the default flaggables (see Flaggable Characters).
Warning: Double flags are only understood by the morphbase implementations (see Huntools) but not ispell, myspell and (yet) jmorph.This is a reason why one might use the huntools package. Please tell us uf this is the sole reason you are using huntools.
Caveat: The use of double flags for the morphbase routines is a compile-time option at the moment.
Caveat: When customizing your flaggables (see Flaggable Characters), and using double flags, you can have up to about two thousand affix classes. If this is not enough for you (you get the exception Not_enough_flags), you are likely to have a problem in your grammar (see Troubleshooting). If you are sure it is not a grammar problem, you better choose another language. At any rate, please notify us (see Contact) about this extraordinary case and we might even extend the support for flags even in morphbase on one of our free afternoons (see Requesting a New Feature).
There are special flags in the affix file (the full documentation of special flags is (hopefully) found in the morphbase and jmorph docs). Special flags are special because they do not index affixes, but encode other sorts of information needed by analysis routines. There are a number of flags one can configure through the options (see Options). These are
Todo: Include the ones below into the implementation and uncomment it from the texinfo documentthese are:
WARNING, CAVEAT: if you use two-character flags (-double_flags option), you have to make sure that special flags are also two-characters, otherwise it will lead to ill-formed affix files. If you set flags through the toplevel Makefile's variables, make sure your flags are quoted (otherwise Makefile will resolve the flag '~' to, say, '/home/tron' and you won't understand what went wrong...)
TODO: Special flags should NOT be user-configurable at all. They should be assigned the first possible flags.
If hunlex wouldn't install, check Prerequisites carefully with special attention to the versions.
There are some hints hidden among the lines of Install which you may have missed.
If you upgraded from an earlier version, make sure you uninstall the earlier version first (see Uninstall and Reinstall).
If you use hunlex trhough the toplevel control with Makefile (see Toplevel Control), the hunlex executable is by default assumed to be found in the path with name hunlex.
By default, installation installs hunlex into /usr/local/bin (see Installed Files) unless you set another install prefix.
Find out whether the hunlex executable is found in the path by typing
$ which hunlex
If it is not found, check again where you installed it by looking into the file install_prefix in the toplevel directory of your source distribution. If this file is not there, your installation was not successful.
If you found out your install-prefix, see if install-prefix/bin/hunlex exists. If it does, you can do the following things:
PATH=install-prefix/bin:${PATH}
or
If your grammar seems to overgenerate, first thing is check if you declared the features that you think your grammar is relying on in the phono.conf file.
You may have mispelled some phonofeature, this can be traced by peeping into the debug messages. Ideally you do this by redirecting the output into a log file (with debug level set sufficiently high) and search the file for the term 'skipped'. This is the warning hunlex gives you to let you know that an entity has been skipped.
The HunLex framework is being used in the development of an open-source morphological database (lexicon and grammar) for the Hungarian language in a collaboration between the Hungarian Academy of Sciences, Research Institute for Linguistics and the Budapest Institute of Technology, Media Education and Research Center Natural Language Processing Lab. This database aspires to be the most complete and accurate account of Hungarian morphology published so far, and is the result of merging several well-respected electronic resources http://lab.mokk.bme.hu.
http://www.xrce.xerox.com/competencies/content-analysis/fst/home.en.html
or
http://www.stanford.edu/~laurik/fsmbook/home.html
AFF (
outputdir/affix.aff)
: Output File OptionsAFF_FORBIDDENWORD (!)
: Resource Compilation OptionsAFF_ONLYROOT (~)
: Resource Compilation OptionsAFF_PREAMBLE ()
: Resource Compilation OptionsAFF_SET (ISO8859-2)
: Resource Compilation OptionsAFF_SETTINGS (
confdir/affix_vars.conf)
: Resource Compilation OptionsCHAR_CONVERSION_TABLE ()
: Resource Compilation OptionsCONFDIR (
inputdir)
: Input File OptionsDEBUG_LEVEL
: Verbosity and DebuggingDEBUG_LEVEL (0)
: Verbosity and Debug OptionsDIC (
outputdir/dictionary.dic)
: Output File OptionsDIC_PREAMBLE ()
: Resource Compilation OptionsDOUBLE_FLAGS ()
: Resource Compilation OptionsFLAGS ('')
: Input File OptionsFS_INFO ()
: Resource Compilation OptionsGRAMMAR (
grammardir/grammar)
: Input File OptionsGRAMMARDIR (
inputdir)
: Input File OptionsHUNLEX (hunlex)
: Executable Path OptionsHUNMORPH (hunmorph)
: Executable Path OptionsINPUTDIR (. =
current directory)
: Input File OptionsLEXICON (
grammardir/lexicon)
: Input File OptionsMAX_LEVEL (10000)
: Resource Compilation OptionsMIN_LEVEL (1)
: Resource Compilation OptionsMODE (Analyzer)
: Resource Compilation OptionsMORPH (
confdir/morph.conf)
: Input File OptionsOUT_DELIM (' ')
: Resource Compilation OptionsOUT_DELIM_DIC (<TAB>)
: Resource Compilation OptionsOUTPUTDIR (. =
current directory)
: Output File OptionsPHONO (
confdir/phono.conf)
: Input File OptionsQUIET
: Verbosity and DebuggingQUIET (@ = quiet)
: Verbosity and Debug OptionsREPLACEMENT_TABLE ()
: Resource Compilation OptionsSIGNATURE ('')
: Input File OptionsSTEM_GIVEN ()
: Resource Compilation OptionsSTEMINFO (LemmaWithTag)
: Resource Compilation OptionsTAG_DELIM ('')
: Resource Compilation OptionsTEST (/dev/stdin)
: Input File OptionsTIME
: Verbosity and DebuggingTIME (time)
: Verbosity and Debug OptionsUSAGE (
confdir/usage.conf)
: Input File OptionsWORDLIST (
outputdir/wordlist)
: Output File Options