hunhtml ------- Compiling --------- make Install ------- cp bin/* /usr/bin Usage: ------ ls *html | hunhtml Hunhtml make raw test files in source directory with a special diagnostic extension: for exampe: example.html.u-- (u means unicode raw text) Filters in hunhtml =================== 1. htmlcat - concatenate files to an XML-like file. Usage: htmlcat wait for file list in standard input Example: ls *.html | htmlcat >concatenated_html.hunml Example for concatenated document:The SPLITCODE need for flex filters (see splitcode.h). 2. htmlsplit - split concatenated HUNML files Usage: htmlsplit [extension] /proba/valami <[CDATA[ Text, text, text... ]]>13635289912101858631970626463217024222481 /proba/valami2 <[CDATA[ More text. ]]>13635289912101858631970626463217024222481 element. Information is a three-character code. First two character means the followings: 1- ISO-8859-1 2- ISO-8859-2 15 Windows-1252 25 Windows-1250 1w ISO-8859-1, but really Windows-1252 2w ISO-8859-2, but really Windows-1250 u- UTF-8 3rd character means: - not contains Hungarian accented characters m contains 4. htmldetag - clear HTML tags and special elements (STYLE, CODE etc.) in a HUNML file. 5. htmlformat - clear additional white spaces, empty lines and paragraphs in a HUNML file. 6. htmlfreq - counts (absolute and document) word frequencies. (In our project we used a modified htmlfreq for processing of (Huntoken-sentence-tokenized and concatenated files). See htmlfreq.cxx.old. Capitalized words in first position of sencences counts separated (get a * sign, for example: Hawaii* in output means sentence-beginning Hawaii form. 7. htmlcount - counts right/bad word ratio with Hunspell spell checker in Huntoken-tokenized and concatenated files. In our project we have classified web pages by spelling quality. Németh László 2005-05-11