[thelist] Local search engine for a website

Diane Soini dianesoini at earthlink.net
Thu Jul 1 21:59:19 CDT 2004


On Thursday, July 1, 2004, at 02:05 AM, thelist-request at lists.evolt.org 
wrote:
>
> Well, with some additional searching I have found the following search
> engines that I could use for searching within a website:
>
> Swish-e
> mnoGoSearch
> htDig
> Fluid Dynamics Search Engine
> ASPSeek

There is also phpdig.

I've worked pretty extensively with htdig. It works well and seems to 
have a lot of community involvement, (mostly geeks). I found it a bit 
difficult to configure, mostly because there are a million options and 
it takes some trial and error to get a configuration that'll work for 
you.

If you report to anyone who will insist on determining which pages show 
up first no matter the actual merit of the content you may end up 
disappointing them a lot. htdig indexes the site based on the content 
of the page and any metadata and stores the info in a Berkley database. 
If the content isn't in the page, or if the page cannot be reached 
through a normal link, it won't index it. It can do phrase searches, 
stemming, boolean expression searching and can search in many 
languages. But it won't stem a word that isn't in a dictionary, so be 
careful about jargon and acronyms in your content.

At the company I work for we use it for English, French, German, Dutch, 
Norwegian, Polish, Spanish, and Portuguese (br). It does not do Chinese 
or Japanese. It tends to mangle HTML entities in the search results. I 
could fix that ok with javascript except for Polish, a Latin 2 
character set. Results for Polish can get a bit mangled. But it indexes 
and searches Polish just fine.

If you use htdig use the PDFtoHTML parser not convdoc or any other. You 
won't get meaningful results on PDFs unless you use PDFtoHTML in 
combination with ensuring all PDFs have metadata, particularly Title, 
Subject and to a lesser extent, Keywords. I edited the parser script to 
produce different meta tags than html pages, and then configured htdig 
to ignore the meta tags in the html docs, but count the ones in the 
PDFs. That way I can display search word in context for html docs and 
use the Subject meta data for pdfs on the results page. Important if 
the pdfs have long tables of contents.

phpdig is similar, but lighter weight, to htdig. It uses a mysql 
database and has a web based admin feature while htdig is command line. 
I don't think phpdig does as good a job of reindexing as htdig does. 
phpdig is easy to install. I have not installed htdig, but it'll help 
if you understand things like "configure" and "make" commands.

Probably way more than you wanted to know.

> Any recommendations for or against any of these?
>
> Thanks again,
> Amy
***
Don't be afraid to try something new. An amateur built the ark. 
Professionals built the Titanic. -unknown



More information about the thelist mailing list