[thelist] Very Large MySQL Table

Mon Feb 9 17:24:35 CST 2009

> You may recall that I recently asked about running searches against
> domain names. I have now gotten a .com zone file from Verisign, which
> is basically a list of all registered .com domain names. I imported it
> into my local MySQL. There are ~ 185,000,000 domain names and it's too
> much for my workstation to run searches on. Just running count(*)
> takes 10 minutes. It's 6G of data.
> 
> We want to run searches basically to find domain names with certain
> words or phrases.
> 
> We don't have a dedicated server for this but his web account does
> have disk space for 6Gs, but I am wondering if we will be able to run
> searches. His idea was to break it up into 8 pieces, and then to run
> each search 8 times, against just that one eighth of the full list.
> 
> Anyone have any better idea?

Unless you have 8 processors and the partitions are not on separate
platters, running 8 queries instead of 1 will not help - they will just
each take 8 times longer because they now have 1/8th of the processor
time.

I don't know MySQL very well, but the time taken to do the count(*)
sounds like MySQL doesn't have any statistics (or at least that they
have not built yet) - it should be a quick lookup of the statistics data
rather than a full table scan!
If the former, I would look at using a database that had statistics
built-in.

How often will you be updating this data? If rarely (which I assume is
the case), you might also want to consider using a database engine that
is optimised for read-only databases such as Monet. These index their
data to the hilt by default, so your 6Gb database will grow perhaps even
by 4-5x in size, but the query time is very much reduced as a result.

If you want to use a general purpose db engine, then I would recommend
analysing exactly what your search parameters will be, and design
indexes on the data that will allow searches to be done on the indexes
rather than the actual data. Basically try to ensure that the left-most
part of the index is used in the search (i.e. if searching for domain
names that started with "the", then having an index based on the domain
name would allow the database engine to jump directly to the index pages
that hold domains starting with "the" and would only traverse the pages
with the required domain names in them.)

You might also want to look into the recent movement toward cloud
computing - effectively giving you those 8 (or more) parallel processors
:)

I hope this helps a little
Chris