[thelist] [TIP] - Use UTF-8 whenever possible, or get used to extra doses of caffeine.

VOLKAN ÖZÇELİK volkan.ozcelik at gmail.com
Thu May 11 02:32:58 CDT 2006


> VOLKAN ÖZÇELIK offered up a nice tip extolling the use of UTF-8:


Thanks.


What happens on the search if they get the diacritical marks incorrect
> (They type the "O" with a circumflex instead of a diaeresis or trema)
> or leave the marks off entirely?  No match?


Well I guess it depends on the type of application. I generally expect the
users to type the extended characters correctly.

Just out of curiosity I did several searches on google to test it:

note that you may not see what I write due to encoding issues :) That's why
I use ozcelik in my signatures (but not ÖZÇELİK) You can have an idea even
so.

Anyway for the keyword: "bahçecilik" (Tr: gardening)
googling "bahceçilik" and "bahçecilik" (and even "bahceçilİk") gave relevant
results.
googling "bahcecilik" also gave moslty similar and relevant results.

To reassure, I searched for several other words and keyphrases as well and
the outcomes were all relevant.
...

It is easy to write a conversion function that takes
"bahçeçİlİk" and converts it to "bahcecilik"

I guess the result can be achieved with something like:

SELECT (WHATEVER) FROM ADozenOfJoinedTables
WHERE field LIKE "bahçeçİlİk" OR field LIKE "%bahcecilik%"

Or may be you can duplicate data search the duplicated data (which you
possibly encode with iso-8859-1 -- c# Encoding class can convert streams
from one encoding to another, I'm sure other languages also have things
similar)
After searching the duped data, you get indexes, and retrieve the actual
data by looking at those indexes.

imho, the second approach will be faster (using LIKE operator excessively
can kill performance) at a cost of extra storage space.

I am not sure, but I'll be glad to hear, if there are other relevant
solutions to this.

anyway, encoding is, and will always be, a pain in the rear, whether it is
UTF or ISO.


HTH,
-- 
Volkan Ozcelik
+>Yep! I'm blogging! : http://www.volkanozcelik.com/volkanozcelik/blog/
+> My projects/studies/trials/errors : http://www.sarmal.com/


More information about the thelist mailing list