[thelist] Very Large MySQL Table

Matt Warden mwarden at gmail.com
Sat Mar 7 17:29:43 CST 2009


> On Mon, Feb 9, 2009 at 5:38 PM, Fred Jones <fredthejonester at gmail.com>wrote:
>
>> You may recall that I recently asked about running searches against
>> domain names. I have now gotten a .com zone file from Verisign, which
>> is basically a list of all registered .com domain names. I imported it
>> into my local MySQL. There are ~ 185,000,000 domain names and it's too
>> much for my workstation to run searches on. Just running count(*)
>> takes 10 minutes. It's 6G of data.
>>
>> We want to run searches basically to find domain names with certain
>> words or phrases.
>>
>> We don't have a dedicated server for this but his web account does
>> have disk space for 6Gs, but I am wondering if we will be able to run
>> searches. His idea was to break it up into 8 pieces, and then to run
>> each search 8 times, against just that one eighth of the full list.
>>
>> Anyone have any better idea?

I have one: don't split it up into 8 pieces.

The search you are describing will require a full table scan. Doing 1
scan of N rows will be faster than 8 scans of N/8 rows. The only thing
that would save you is if you only cared about domains that started
with a particular word, in which case you could use an index. However,
that does not sound like your requirement.

The only thing I can think of off the top of my head is adding a
column num_chars where you update yourtable set
num_chars=length(domainname). This will take a long time, but you only
have to do it once. Then index this num_chars column. You can then
limit the search space based on what could possibly have the word you
are looking for. This will be less effective for small words (db may
not even use the index) and more effective for large words, obviously.
You'd just have to test it out.

Another wild idea that might squeeze a little more performance out
(but you'd have to look at your data, which I can't): split each
domain name up by any numbers and store this index of parts in a
separate table, one row for each part, with a foreign key back to the
domain name table. You could then do the num_chars strategy on the
parts, and you might get a better result that way. Obviously this
doesn't help if the domain names don't tend to have numbers in them,
which I'm guessing the majority don't.

If you had a lot of space, you could try storing substrings: domain,
omain, main, ain, in. You would then link that back to the original
domain table in a M-M relationship. This would allow you to index the
substrings and perform an index range scan on your search string.
However, this is going to balloon your row count, and you may very
well be worse off with an index range that has more rows than your
original table. *IF* your search words are always English words, then
you could significantly reduce the count of your substrings by
removing any that are not found in a set of English dictionary words.
This would probably knock out 75% of your rows and leave you with
essentially an index of all English words mapped to your domain names,
which is your ideal situation.

Quite a bit of work, and depending on your requirements, may not even
be adequate. But just some ideas to mull over...

-- 
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com


This email proudly and graciously contributes to entropy.



More information about the thelist mailing list