[thelist] Scanning for many strings in many texts

Thu Oct 13 12:41:15 CDT 2005

On 13/10/05, Hassan Schroeder <hassan at webtuitive.com> wrote:
> manuel.gonzalez.noriega at gmail.com wrote:
>
> Exactly. Real search engines create *indexes* of the content in a
> document repository. That happens only when content is added to the
> repository, *not* each time someone wants an article referencing
> "Joe Montana".
>
> There are dedicated enterprise-level (and -price!) solutions from
> various vendors (Verity, Autonomy, etc.). MySQL offers a full-text
> search capability at a more popular pricepoint :-)
>

Hey Hassan,

thanks. I'm aware of inverted indexes, Mysql full-text search and
general search strategies. I was just wondering if scanning for
predefined strings was somewhat different  than usual user-submitted
queries.

In fact, It has just occured to me that if I have a schema like

'documents' (id, document)
'terms'  (id, term)
'terms_documents' (term_id, document_id)
'relevant_strings' (id, string)

an simple sql statement joining the tables would return the desired
set of documents. Perfect except that it wouldn't work for multi-term
strings ('Foo Bar' vs. 'Bar')

So, if I have 3 million different strings to seek in 100 documents, I
have to make 300 million full-text searches? I really hope I'm just
being dull :)

--
Manuel
a veces :) a veces :(
pero siempre trabajando duro para Simplelógica: apariencia,
experiencia y comunicación en la web.
http://simplelogica.net # (+34) 985 22 12 65

¡Ah! y escribiendo en Logicola: http://logicola.simplelogica.net