[thelist] text keys in MySQL

Anthony Baratta Anthony at Baratta.com
Mon Jul 1 22:28:01 CDT 2002

At 06:03 PM 7/1/2002, Christopher Joseph wrote:

>am trying to build a boolean search over mysql using php. Think the php is
>OK but the SQL doesn't seem to be returning any records.
>this is the query -->
>SELECT artid,faqid,title,content,
>match (title,content) against ('XYZ') as relevance
>FROM faqcont
>match (title,content) against ('XYZ')>0
>HAVING relevance>0
>ORDER BY relevance DESC

First off, according to the documentation your Match statement is a bit
"wrong". There is no need to have > 0 with the Match() Against() nor the
Having relevance > 0.


SELECT * FROM articles
WHERE MATCH (title,body) AGAINST ('database');


Also the docs say this:


"MySQL uses a very simple parser to split text into words. A ``word'' is
any sequence of characters consisting of letters, numbers, `'', and `_'.
Any ``word'' that is present in the stopword list or is just too short (3
characters or less) is ignored.

Every correct word in the collection and in the query is weighted according
to its significance in the query or collection. This way, a word that is
present in many documents will have lower weight (and may even have a zero
weight), because it has lower semantic value in this particular collection.
Otherwise, if the word is rare, it will receive a higher weight. The
weights of the words are then combined to compute the relevance of the row.

Such a technique works best with large collections (in fact, it was
carefully tuned this way). For very small tables, word distribution does
not reflect adequately their semantic value, and this model may sometimes
produce bizarre results.

mysql> SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('MySQL');
Empty set (0.00 sec)

The search for the word MySQL produces no results in the above example,
because that word is present in more than half the rows. As such, it is
effectively treated as a stopword (that is, a word with zero semantic
value). This is the most desirable behaviour -- a natural language query
should not return every second row from a 1GB table.

A word that matches half of rows in a table is less likely to locate
relevant documents. In fact, it will most likely find plenty of irrelevant
documents. We all know this happens far too often when we are trying to
find something on the Internet with a search engine. It is with this
reasoning that such rows have been assigned a low semantic value in this
particular dataset."


Therefore only have two records will equal any match at 50% and therefore
no data will be returned.
Anthony Baratta
Keyboard Jockeys

"Conformity is the refuge of the unimaginative."

More information about the thelist mailing list