> On Mon, Feb 9, 2009 at 5:38 PM, Fred Jones <fredthejonester at gmail.com>wrote: > >> You may recall that I recently asked about running searches against >> domain names. I have now gotten a .com zone file from Verisign, which >> is basically a list of all registered .com domain names. I imported it >> into my local MySQL. There are ~ 185,000,000 domain names and it's too >> much for my workstation to run searches on. Just running count(*) >> takes 10 minutes. It's 6G of data. >> >> We want to run searches basically to find domain names with certain >> words or phrases. >> >> We don't have a dedicated server for this but his web account does >> have disk space for 6Gs, but I am wondering if we will be able to run >> searches. His idea was to break it up into 8 pieces, and then to run >> each search 8 times, against just that one eighth of the full list. >> >> Anyone have any better idea? I have one: don't split it up into 8 pieces. The search you are describing will require a full table scan. Doing 1 scan of N rows will be faster than 8 scans of N/8 rows. The only thing that would save you is if you only cared about domains that started with a particular word, in which case you could use an index. However, that does not sound like your requirement. The only thing I can think of off the top of my head is adding a column num_chars where you update yourtable set num_chars=length(domainname). This will take a long time, but you only have to do it once. Then index this num_chars column. You can then limit the search space based on what could possibly have the word you are looking for. This will be less effective for small words (db may not even use the index) and more effective for large words, obviously. You'd just have to test it out. Another wild idea that might squeeze a little more performance out (but you'd have to look at your data, which I can't): split each domain name up by any numbers and store this index of parts in a separate table, one row for each part, with a foreign key back to the domain name table. You could then do the num_chars strategy on the parts, and you might get a better result that way. Obviously this doesn't help if the domain names don't tend to have numbers in them, which I'm guessing the majority don't. If you had a lot of space, you could try storing substrings: domain, omain, main, ain, in. You would then link that back to the original domain table in a M-M relationship. This would allow you to index the substrings and perform an index range scan on your search string. However, this is going to balloon your row count, and you may very well be worse off with an index range that has more rows than your original table. *IF* your search words are always English words, then you could significantly reduce the count of your substrings by removing any that are not found in a set of English dictionary words. This would probably knock out 75% of your rows and leave you with essentially an index of all English words mapped to your domain names, which is your ideal situation. Quite a bit of work, and depending on your requirements, may not even be adequate. But just some ideas to mull over... -- Matt Warden Cincinnati, OH, USA http://mattwarden.com This email proudly and graciously contributes to entropy.