hunnorm

hunhtml
-------

Compiling
---------

make

Install
-------

cp bin/* /usr/bin

Usage:
------

ls *html | hunhtml

Hunhtml make raw test files in source directory
with a special diagnostic extension:

for exampe:

example.html.u--

(u means unicode raw text)


Filters in hunhtml
===================

1. htmlcat - concatenate files to an XML-like file.

Usage: htmlcat wait for file list in standard input

Example: ls *.html | htmlcat >concatenated_html.hunml

Example for concatenated document:






/proba/valami



<[CDATA[
Text, text, text...
]]>13635289912101858631970626463217024222481





/proba/valami2



<[CDATA[
More text.
]]>13635289912101858631970626463217024222481



The SPLITCODE need for flex filters (see splitcode.h).

2. htmlsplit - split concatenated HUNML files

Usage: htmlsplit [extension] 
element.

Information is a three-character code.
First two character means the followings:

1- ISO-8859-1
2- ISO-8859-2
15 Windows-1252
25 Windows-1250
1w ISO-8859-1, but really Windows-1252
2w ISO-8859-2, but really Windows-1250
u- UTF-8

3rd character means:
- not contains Hungarian accented characters
m contains

4. htmldetag - clear HTML tags and special elements
(STYLE, CODE etc.) in a HUNML file.

5. htmlformat - clear additional white spaces,
empty lines and paragraphs in a HUNML file.

6. htmlfreq - counts (absolute and document) word frequencies.

(In our project we used a modified htmlfreq for processing of
(Huntoken-sentence-tokenized and concatenated files). See
htmlfreq.cxx.old.

Capitalized words in first position of sencences counts separated
(get a * sign, for example: Hawaii* in output means sentence-beginning
Hawaii form.

7. htmlcount - counts right/bad word ratio with Hunspell spell checker
in Huntoken-tokenized and concatenated files. In our project we have
classified web pages by spelling quality.

Németh László 

2005-05-11