[thelist] Scanning for many strings in many texts

Matthias Willerich matthias at die-legendaeren.de
Thu Oct 13 16:34:41 CDT 2005


Manuel,
that problem really interests me, sadly I'm not an expert.

I'm thinking about it from 2 sides:

1) simple keyword search and beyond: If it was a classical OR search, ie
search for "this" OR "that", you could use in_array:
bool in_array ( array needle, array haystack [, bool strict])
But I assume that this is not at all helpful...or is it? put your search
keywords in the needle array, and search through all your stories, one by
one in the haystack array. That seems crazy in terms of server power,
although I've never tried it. And the outcome is that you know that...some
keywords are in identifiable stories... ok, I'm not sure if that's enough,
and it doesn't help reducing the search amount.
2) It's kind of a reverse tagging. If your searches are more or less
repetetive, you could incremently store them in some kind of optimized
index. But is that better than simply indexing a table? With the Fulltext
search (you're talking 'MATCH (columns) AGAINST (keywords)', right?), you'd
get a proximity of how good the match is, but...that again leaves you with
your M amount of searches.

Just read that you mentioned inverted indexes. While I can imagine what it
is, I have no clue of how to use it... an example? Please keep me updated on
what you decided to do.

Cheers,
Matthias




More information about the thelist mailing list