[thelist] Duplicate listings detection

Mon May 5 13:31:55 CDT 2008

On Mon, May 5, 2008 at 12:37 PM, Joshua Olson <joshua at waetech.com> wrote:
> Hi guys,
>
>  A client of mine has a database of business, numbering in the 20k+.  In this
>  database, there are some definite duplicates, but they're not all exactly
>  the same.  For example, the zip code may be wrong, the city name may be
>  wrong, the business name may be spelled differently (missing words like
>  "the", for example), and of course the addresses may be wrong--some have the
>  street names spelled out, others have the street names abbreviated... Some
>  include the suite number, others do not.  What a mess.
>
>  Any ideas on techniques I can use to produce a list of possible duplicates
>  that a human would then discern?

- Soundex/NYSIIS
(http://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System)
- Substring comparison
- Weighted identifier comparison (FEIN has high weight, name has less
high weight, street address has less high weight, zip has low weight,
zip extension has very low weight, etc.)
- Combination of the above

Basically, you have to determine how many man hours you want to spend
on manual review. The more hours, the more relaxed your duplication
identification can be, and the more accurate you will be (to an extent
-- if you have too large of a data set, the manual review will
suffer). If you want very few false positives, then your
identification needs to be tight and many duplicates will not be
included in the set to be reviewed.

-- 
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com

This email proudly and graciously contributes to entropy.