[thelist] Scanning for many strings in many texts

Hassan Schroeder hassan at webtuitive.com
Thu Oct 13 10:40:28 CDT 2005


manuel.gonzalez.noriega at gmail.com wrote:

> Say you got lots of texts and you want to scan those texts to find
> ocurrences of some strings. To clarify, you have lots of sport news
> stories on one hand and lots of players names on the other and you
> want to classify the news stories by the players that are mentioned on
> them.
> 
> What would be the best strategy in this case? I can think of some very
> brute force solutions, like looping through the names and for every
> one of them do a fulltext search on the stories, but this obviously is
> very primitive and just wont scale.

Exactly. Real search engines create *indexes* of the content in a
document repository. That happens only when content is added to the
repository, *not* each time someone wants an article referencing
"Joe Montana".

There are dedicated enterprise-level (and -price!) solutions from
various vendors (Verity, Autonomy, etc.). MySQL offers a full-text
search capability at a more popular pricepoint :-)

HTH,
-- 
Hassan Schroeder ----------------------------- hassan at webtuitive.com
Webtuitive Design ===  (+1) 408-938-0567   === http://webtuitive.com

                          dream.  code.




More information about the thelist mailing list