[thelist] XML based site search

Jonathon Thomas jthomas at firstnet.net.uk
Wed May 8 09:31:01 CDT 2002


Nik

You're really going to hammer your server going through tens of XML
files everytime somebody requests a search.

In the same way that search engines build indexes of words and where
they occur, you can use XML to periodically (automatically) build one or
more index files for your site. Just imagine how slow Google would be if
it had to run through every page on the web when responding to a query!

This isn't the place to go into great detail about the data structures
and algorithms to achieve this, but a simple and effective way would be:

1) Have one XML index file for every letter in the alphabet.
2) In each of these index files, insert an occurrence of each word in
your site, followed by the XML content files in which that word appears.
For example:

<word value="television">
	<page>modern-entertainment.xml</page>
	<page>cathode-ray-technology.xml</page>
</word>

You can then examine the words entered by the user in a search to
determine the first letters of each, load the appropriate index files
("z.xml" for the word "zebra", for example), find out if the requested
word exists in the index, and point to the appropriate XML file.

Major downfall of this approach - it won't let you use phrases.

I've not gone into much detail on how to generate the index, but I'm
sure there are some DOM methods that would allow you to extract words
from XML reasonably easily.

Cheers

Jon

-----Original Message-----
From: thelist-admin at lists.evolt.org
[mailto:thelist-admin at lists.evolt.org] On Behalf Of Nik Schramm
Sent: 08 May 2002 12:16
To: thelist at lists.evolt.org
Subject: [thelist] XML based site search

Hi everyone,

I've been lurking for a good while and now I have a question for you
all:

I'm building a text-heavy site using ASP.NET (C#) and XML. All content
on the site is stored in xml files, HTML output is achieved via dynamic
XSL transformations. Now I want to add a "search this site" function and
although in theory this doesn't seem to difficult to achieve, I was
wondering if anyone here had experience in doing this and could pass on
some tips in order to optimize performance. The way I *think* I would do
it is:

1. On the search page provide options for keyword entry and a way to
limit the search to sections of the site.
2. Open the xml files for the target section sequentially and search for
occurences of said keyword.
3. If one is found, output the url to the file via a buffered response,
then continue on to the next XML file
4. etc.

Does that make sense, or is there a better way I'm overlooking ?

/nik

www.industriality.com - candy for the inner eye

--
For unsubscribe and other options, including
the Tip Harvester and archive of thelist go to:
http://lists.evolt.org Workers of the Web, evolt !




More information about the thelist mailing list