[thelist] Duplicate listings detection

Joshua Olson joshua at waetech.com
Mon May 5 11:37:26 CDT 2008


Hi guys,

A client of mine has a database of business, numbering in the 20k+.  In this
database, there are some definite duplicates, but they're not all exactly
the same.  For example, the zip code may be wrong, the city name may be
wrong, the business name may be spelled differently (missing words like
"the", for example), and of course the addresses may be wrong--some have the
street names spelled out, others have the street names abbreviated... Some
include the suite number, others do not.  What a mess.

Any ideas on techniques I can use to produce a list of possible duplicates
that a human would then discern?

Thanks.

<><><><><><><><><><>
Joshua L. Olson
WAE Technologies, Inc.
Augusta, Georgia Web Design
http://www.waetech.com/
Phone: 706.210.0168
Fax: 707.988.0168
Private Enterprise Number: 28752

Portfolio:
http://www.waetech.com/design/portfolio/

Monitor bandwidth usage on IIS6 in real-time:
http://www.waetech.com/services/iisbm/ 



More information about the thelist mailing list