[thelist] [TIP] - Use UTF-8 whenever possible, or get used to extra doses of caffeine.

Info@internetvraagbaak.nl info at internetvraagbaak.nl
Thu May 11 03:08:06 CDT 2006


Maybe form a different angle.. i am not a programmer really but we had to do 
a project once with many many records
in a mysqldbase. It was not really well organisted or sanitized but ok.  The 
Search they had was one that would run through ALL the records.... and yes.. 
records contained huge amounts of information: plain text, word doc pasted 
text et cet.. in one table_field.
yes it looked messy ;-)

So we decided to do a Searchindex, run it with a cronjob once in a while 
that would scan all new records...and add words&references
 to the search index.
IF you do create an index like that for a it is as far as i can see easier 
to do some character replacement.

Still in global searchengines we can hardly get particular information when 
you cannot write cyrillic for instance...
Also one of the reasons why i am not in favour of domainnames with special 
characters in it....

correct me when i am wrong;

jeroen



>> VOLKAN ÖZÇELIK offered up a nice tip extolling the use of UTF-8:
>
>
> Thanks.
>
>
> What happens on the search if they get the diacritical marks incorrect
>> (They type the "O" with a circumflex instead of a diaeresis or trema)
>> or leave the marks off entirely?  No match?
>
>
> Well I guess it depends on the type of application. I generally expect the
> users to type the extended characters correctly.
>
> Just out of curiosity I did several searches on google to test it:
>
> note that you may not see what I write due to encoding issues :) That's 
> why
> I use ozcelik in my signatures (but not ÖZÇELİK) You can have an idea even
> so.
>
> Anyway for the keyword: "bahçecilik" (Tr: gardening)
> googling "bahceçilik" and "bahçecilik" (and even "bahceçilİk") gave 
> relevant
> results.
> googling "bahcecilik" also gave moslty similar and relevant results.
>
> To reassure, I searched for several other words and keyphrases as well and
> the outcomes were all relevant.
> ...
>
> It is easy to write a conversion function that takes
> "bahçeçİlİk" and converts it to "bahcecilik"
>
> I guess the result can be achieved with something like:
>
> SELECT (WHATEVER) FROM ADozenOfJoinedTables
> WHERE field LIKE "bahçeçİlİk" OR field LIKE "%bahcecilik%"
>
> Or may be you can duplicate data search the duplicated data (which you
> possibly encode with iso-8859-1 -- c# Encoding class can convert streams
> from one encoding to another, I'm sure other languages also have things
> similar)
> After searching the duped data, you get indexes, and retrieve the actual
> data by looking at those indexes.
>
> imho, the second approach will be faster (using LIKE operator excessively
> can kill performance) at a cost of extra storage space.
>
> I am not sure, but I'll be glad to hear, if there are other relevant
> solutions to this.
>
> anyway, encoding is, and will always be, a pain in the rear, whether it is
> UTF or ISO.
>
>
> HTH,
> -- 
> Volkan Ozcelik
> +>Yep! I'm blogging! : http://www.volkanozcelik.com/volkanozcelik/blog/
> +> My projects/studies/trials/errors : http://www.sarmal.com/
> -- 
>
> * * Please support the community that supports you.  * *
> http://evolt.org/help_support_evolt/
>
> For unsubscribe and other options, including the Tip Harvester
> and archives of thelist go to: http://lists.evolt.org
> Workers of the Web, evolt !
>
>
>
> -- 
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.392 / Virus Database: 268.5.6/336 - Release Date: 10-5-2006
>
> 




More information about the thelist mailing list