[thelist] [TIP] - Use UTF-8 whenever possible, or get used to extra doses of caffeine.

VOLKAN ÖZÇELİK volkan.ozcelik at gmail.com
Thu May 11 10:32:13 CDT 2006


Hi Ron,


> Well yeah, getting users to type things correctly would certainly be
> helpful.   ;-P


Actually getting users to do anything correctly is a huge problem.
End users are excellent on detecting bugs and security leaks.
If there is a task to be done. A thousand users will do it in a thousand
different ways.

The best test is the test done on production isn't it :) ? (note that the
thread title contains "extra caffeine")


Thanks for taking the question seriously ... But I suspect Google does some
> pretty fancy stuff and doesn't represent
> the kind of results you would get building a simple "SELECT <stuff> from
> UTF-8 Back-End WHERE [user input] = <field>".


I agree. The way google deals things will be far too much complicated than
an SQL query.
Google works with thousands of mysterious pigeons to search the web
(http://www.google.com/technology/pigeonrank.html)


However, that is still an interesting idea about maintaining a 'shadow'
> search field with the characters adjusted ... I'll
> have to think about that some more.


Well using stuff lik "covering indexes", "full text indexes" etc. will ease
the pain again at a cost of extra storage.

When it comes to storage, using ISO-885x-x is advantegeous to using UTF.(one
of them is 1 bytes, the other 2 bytes per character)

For single-byte code pages (e.g. ISO 8859-1, Windows 1252), if the field we
store data on average has 20 characters, the gain is 10 characters per
record, and at one byte per character, that becomes 10 bytes per record.
If the database had 1 million records, the savings would be 10 million
bytes!

Anyway, storage is cheap nowadays, so you may say who cares :)


Cheers,
-- 
Volkan Ozcelik
+>Yep! I'm blogging! : http://www.volkanozcelik.com/volkanozcelik/blog/
+> My projects/studies/trials/errors : http://www.sarmal.com/



More information about the thelist mailing list