[thelist] Very Large MySQL Table

Luther, Ron Ron.Luther at hp.com
Tue Feb 10 08:33:22 CST 2009


Fred Jones asked:

>>I have now gotten a .com zone file from Verisign, which
>>is basically a list of all registered .com domain names. 
>>There are ~ 185,000,000 domain names 

>>We want to run searches basically to find domain names 
>>with certain words or phrases.


Hi Fred,

Whoa!  Interesting problem.  You're gonna want a little more horsepower for that.  ;-)  If you have a business model that would support the investment you might want to take a look at setting up a dedicate server running something geared to bigger datasets, like Redbrick or Terradata.

[I'm also trying to think ahead a bit here too ... if you build a lot of pre-chewed-Charlie custom index side tables (concentrating on your key markets of course) ... then it's gonna be quite the PITA to replicate that processing next month or next week when you try to load the change file updates to keep your db current.]

But a couple suggestions or things to maybe think about for a moment anyway:


(1) I'm not sure whether you have the Unicode or the ASCII IDNs (or both) in your file. [A]  But, if we can assume you are only interested in 'English' words and phrases you might be able to write a purge script (or 75) to delete entries in other character sets like Cyrillic [B] or entries like R23.com that don't contain any word or phrases.

To do that, you're probably going to need some kind of 'dictionary' attack that allows you to kill entries that are words in Czech but not in English.  Or entries using diacritical characters (if you have the Unicode IDNs).  Parsing by entry length might make that go faster.

It's also going to be interesting to see how you cope with identifying and not killing entries with 'text-speak' words or phrases (like the ever popular *bizba6.com* site!)


(2) Smaller files.  Pre-processing might help you out here.  If your customers were automobile dealerships (for example) you might pre-process a script to generate a side file of all entries containing "Ford", "FOMOCO", "Lincoln", or "Mercury" and only use that smaller file when processing requests from your Ford dealership customers.


(3) Research specialized 'ranking' algorithms for use in indexing.  You may want to think about creating your own 'relevance index' ... adding more weight to key portions of the search phrase appearing in sequence [rating 'Fred' 'Jones' higher than 'Jones' 'Fred'] ... or closer to the beginning of the entry, etc.


Lastly -- I would play devil's advocate and question the value of the effort here.  Domain names are not meta keywords.  Do domains names like evolt.org or 37signals.com tell you anything about what those sites do?  What they provide?  Who they compete with?

But, OTOH, if it's just for a single customer ... 'find out how many people have registered domain names containing the phrase "metal halide coatings" so we can see if that would be a market differentiator for us' ... 'what about "copper hydride coatings"?' ... then screw it:  If those queries take 16 hours to run - then that's what they take to run.


Good Luck,
RonL.


[A]  http://www.dns.pl/english/registration.html#8

[B] Although ... before doing so ... Since you have this cool resource already at hand and all ... maybe you could shoot off some stats to evolt's "i18n" list on the counts and percentages of registered domain names by character set?  _IF_ it could be done, that might be a really cool thing for us to have sitting around.




More information about the thelist mailing list