[thelist] [TIP] - Use UTF-8 whenever possible, or get used to extra doses of caffeine.
Judah McAuley
judah at wiredotter.com
Thu May 11 12:23:46 CDT 2006
Luther, Ron wrote:
> I was thinking more of a case where the data contained several
> spellings for the same item ... Let's say we're looking at purchase
> order data from around the world and that for POs entered in the US,
> buyers type in "battery", but that for POs entered in Brazil, buyers
> enter "bẫttery". [I know, probably a retarded example - this is for
> illustrative purposes only.]
>
> Now, unless you write your own extremely fancy search engine, I
> suspect that a single standard search initiated by an end user will
> only return either the US entered 'battery' POs -or- the Brazilian.
> My guess is that the user will expect the data to be complete,
> ["Hell, I entered 'battery' didn't I?"]
If I were running into this situation, I'd look at enterprise level
search solutions like Verity. It looks like Verity got bought out by
Autonomy and their K2 server is now called IDOL K2. I know that K2 is
multilingual and stores all its internal information in UTF-8 format.
I haven't done any multilingual searching with it, but based off your
example I could do a similarity search for "battery" and then "bẫttery"
would come up with a high percentage match even without taking
translation into account. I would hope that the beefier applications out
there would also take dictionaries and compile them together to do
synonym searches across languages.
For pure SQL hacks, I know that with full text searches in MS SQL
Server, you can specify an accent insensitve search, so "cafe" and
"café" would be matches if you searched for either one. When you get
into completely different languages though, "rojo" and "red", for
instance, you'll need something more than SQL I think.
Judah
More information about the thelist
mailing list