hunhtml
-------
Compiling
---------
make
Install
-------
cp bin/* /usr/bin
Usage:
------
ls *html | hunhtml
Hunhtml make raw test files in source directory
with a special diagnostic extension:
for exampe:
example.html.u--
(u means unicode raw text)
Filters in hunhtml
===================
1. htmlcat - concatenate files to an XML-like file.
Usage: htmlcat wait for file list in standard input
Example: ls *.html | htmlcat >concatenated_html.hunml
Example for concatenated document:
/proba/valami
<[CDATA[
Text, text, text...
]]>13635289912101858631970626463217024222481
/proba/valami2
<[CDATA[
More text.
]]>13635289912101858631970626463217024222481
The SPLITCODE need for flex filters (see splitcode.h).
2. htmlsplit - split concatenated HUNML files
Usage: htmlsplit [extension]
element.
Information is a three-character code.
First two character means the followings:
1- ISO-8859-1
2- ISO-8859-2
15 Windows-1252
25 Windows-1250
1w ISO-8859-1, but really Windows-1252
2w ISO-8859-2, but really Windows-1250
u- UTF-8
3rd character means:
- not contains Hungarian accented characters
m contains
4. htmldetag - clear HTML tags and special elements
(STYLE, CODE etc.) in a HUNML file.
5. htmlformat - clear additional white spaces,
empty lines and paragraphs in a HUNML file.
6. htmlfreq - counts (absolute and document) word frequencies.
(In our project we used a modified htmlfreq for processing of
(Huntoken-sentence-tokenized and concatenated files). See
htmlfreq.cxx.old.
Capitalized words in first position of sencences counts separated
(get a * sign, for example: Hawaii* in output means sentence-beginning
Hawaii form.
7. htmlcount - counts right/bad word ratio with Hunspell spell checker
in Huntoken-tokenized and concatenated files. In our project we have
classified web pages by spelling quality.
Németh László
2005-05-11