[thelist] Relevancy Algorithm

Kelly Hallman khallman at wrack.org
Thu Mar 6 03:43:01 CST 2003


On Wed, 5 Mar 2003, Rob Smith wrote:

> You know, if all fails read the instructions:
> ...clause should be specified this way: CONTAINS (column, '"text*"')
> The asterisk matches zero, one, or more characters (of the root word or
> words in the word or phrase)...

I got all excited when I saw the subject "Relevancy Algorithm"... then it
seems to have boiled down to a discussion of how to do LIKE on large text
fields in MS SQL ... I haven't followed that closely on the specifics so
forgive me if that is an oversimplification of CONTAINS/CONTAINSTABLE/etc

I don't know what data you are searching, but a simple search like this
doesn't really address relevance.. at least not when you have a
sufficiently large number of items to search.

Quite interested to know if anyone has any thoughts on actual
relevancy/ranking algorithms?  Some of my research into that topic
indicated to me that some fairly advanced math is involved, and it wasn't
clear to me how it would translate into code (again, the math..).

Currently my approach has been a keyword algorithm I developed which
suffices for my needs but is still rudimentary.  It takes a list of
user-defined keywords and a document, gives each word a weight based on
both it's position in the list and it's frequency in the document.

The resulting information is stored into a table of keywords and ranks for
those words, one for each document/keyword combination.  Then a fairly
complex aggregate query is built from the search terms that yields results
in ranked order.  On the amount of documents I'm testing with, it seems to
work as intended, and it seems efficient enough (though I have begun to
run into some snafus with that query when I try to optimize/improve it).

Anyway, if anyone has any comments on that subject, I am all ears.
--Kelly

> When <prefix_term> is a phrase, each word contained in the phrase is
> considered to be a separate prefix. Therefore, a query specifying a prefix
> term of "local wine *" matches any rows with the text of "local winery",
> "locally wined and dined", and so on.
> ...
>
> So I try
>
> SELECT     ProductAuto, ProductID, ProductDesc
> FROM         dbo.Product FT_TBL
> WHERE     CONTAINS(ProductID, '"*15140*"')

--
Kelly Hallman
http://wrack.org/




More information about the thelist mailing list