[thelist] Local search engine for a website
Diane Soini
dianesoini at earthlink.net
Thu Jul 1 21:59:19 CDT 2004
On Thursday, July 1, 2004, at 02:05 AM, thelist-request at lists.evolt.org
wrote:
>
> Well, with some additional searching I have found the following search
> engines that I could use for searching within a website:
>
> Swish-e
> mnoGoSearch
> htDig
> Fluid Dynamics Search Engine
> ASPSeek
There is also phpdig.
I've worked pretty extensively with htdig. It works well and seems to
have a lot of community involvement, (mostly geeks). I found it a bit
difficult to configure, mostly because there are a million options and
it takes some trial and error to get a configuration that'll work for
you.
If you report to anyone who will insist on determining which pages show
up first no matter the actual merit of the content you may end up
disappointing them a lot. htdig indexes the site based on the content
of the page and any metadata and stores the info in a Berkley database.
If the content isn't in the page, or if the page cannot be reached
through a normal link, it won't index it. It can do phrase searches,
stemming, boolean expression searching and can search in many
languages. But it won't stem a word that isn't in a dictionary, so be
careful about jargon and acronyms in your content.
At the company I work for we use it for English, French, German, Dutch,
Norwegian, Polish, Spanish, and Portuguese (br). It does not do Chinese
or Japanese. It tends to mangle HTML entities in the search results. I
could fix that ok with javascript except for Polish, a Latin 2
character set. Results for Polish can get a bit mangled. But it indexes
and searches Polish just fine.
If you use htdig use the PDFtoHTML parser not convdoc or any other. You
won't get meaningful results on PDFs unless you use PDFtoHTML in
combination with ensuring all PDFs have metadata, particularly Title,
Subject and to a lesser extent, Keywords. I edited the parser script to
produce different meta tags than html pages, and then configured htdig
to ignore the meta tags in the html docs, but count the ones in the
PDFs. That way I can display search word in context for html docs and
use the Subject meta data for pdfs on the results page. Important if
the pdfs have long tables of contents.
phpdig is similar, but lighter weight, to htdig. It uses a mysql
database and has a web based admin feature while htdig is command line.
I don't think phpdig does as good a job of reindexing as htdig does.
phpdig is easy to install. I have not installed htdig, but it'll help
if you understand things like "configure" and "make" commands.
Probably way more than you wanted to know.
> Any recommendations for or against any of these?
>
> Thanks again,
> Amy
***
Don't be afraid to try something new. An amateur built the ark.
Professionals built the Titanic. -unknown
More information about the thelist
mailing list