[thelist] LAN search tool

Tony Crockford tonyc at boldfish.co.uk
Wed Dec 12 17:05:11 CST 2001


> Are you saying it has support for MS Word documents and PDFs?
> 
> spinhead


Yes 

But it's another step in the indexing process:

See:

http://www.htdig.org/files/contrib/parsers/


Sample external converter script for ht://Dig 3.1.4 and above, that
converts MS-Word, PDF or PostScript files to text (in HTML form) so
they can be indexed.  Uses the "catdoc" program to extract text from
Word documents, "pdftotext" to extract text from PDFs, and "ps2ascii"
to extract text from PostScript.

Written by Gilles Detillieux, based on the parse_word_doc.pl script
by Jesse op den Brouw <MSQL_User at st.hhs.nl>.

External converters have two advantages over external parsers.  They
are easier to write, and the parsing is done in a more consistent way
for all document types.





More information about the thelist mailing list