On Mon, May 5, 2008 at 12:37 PM, Joshua Olson <joshua at waetech.com> wrote: > Hi guys, > > A client of mine has a database of business, numbering in the 20k+. In this > database, there are some definite duplicates, but they're not all exactly > the same. For example, the zip code may be wrong, the city name may be > wrong, the business name may be spelled differently (missing words like > "the", for example), and of course the addresses may be wrong--some have the > street names spelled out, others have the street names abbreviated... Some > include the suite number, others do not. What a mess. > > Any ideas on techniques I can use to produce a list of possible duplicates > that a human would then discern? - Soundex/NYSIIS (http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System) - Substring comparison - Weighted identifier comparison (FEIN has high weight, name has less high weight, street address has less high weight, zip has low weight, zip extension has very low weight, etc.) - Combination of the above Basically, you have to determine how many man hours you want to spend on manual review. The more hours, the more relaxed your duplication identification can be, and the more accurate you will be (to an extent -- if you have too large of a data set, the manual review will suffer). If you want very few false positives, then your identification needs to be tight and many duplicates will not be included in the set to be reviewed. -- Matt Warden Cincinnati, OH, USA http://mattwarden.com This email proudly and graciously contributes to entropy.