[thelist] [TIP] - Use UTF-8 whenever possible, or get used to extra doses of caffeine.

Judah McAuley judah at wiredotter.com
Thu May 11 12:23:46 CDT 2006


Luther, Ron wrote:
> I was thinking more of a case where the data contained several
> spellings for the same item ... Let's say we're looking at purchase
> order data from around the world and that for POs entered in the US,
> buyers type in "battery", but that for POs entered in Brazil, buyers
> enter "bẫttery".  [I know, probably a retarded example - this is for
> illustrative purposes only.]
> 
> Now, unless you write your own extremely fancy search engine, I
> suspect that a single standard search initiated by an end user will
> only return either the US entered 'battery' POs -or- the Brazilian.
> My guess is that the user will expect the data to be complete,
> ["Hell, I entered 'battery' didn't I?"]

If I were running into this situation, I'd look at enterprise level 
search solutions like Verity. It looks like Verity got bought out by 
Autonomy and their K2 server is now called IDOL K2. I know that K2 is 
multilingual and stores all its internal information in UTF-8 format.

I haven't done any multilingual searching with it, but based off your 
example I could do a similarity search for "battery" and then "bẫttery" 
would come up with a high percentage match even without taking 
translation into account. I would hope that the beefier applications out 
there would also take dictionaries and compile them together to do 
synonym searches across languages.

For pure SQL hacks, I know that with full text searches in MS SQL 
Server, you can specify an accent insensitve search, so "cafe" and 
"café" would be matches if you searched for either one. When you get 
into completely different languages though, "rojo" and "red", for 
instance, you'll need something more than SQL I think.

Judah




More information about the thelist mailing list