[thelist] Duplicate listings detection

Mon May 5 21:29:58 CDT 2008

Hi Joshua,

"Joshua Olson" <joshua at waetech.com> wrote
> Hi guys,
>
> A client of mine has a database of business, numbering in the 20k+.  In
> this database, there are some definite duplicates, but they're not all
> exactly the same.  For example, the zip code may be wrong, the city name
> may be wrong, the business name may be spelled differently (missing words
> like "the", for example), and of course the addresses may be wrong--some
> have the street names spelled out, others have the street names
> abbreviated... Some include the suite number, others do not.  What a 
> mess.
>
> Any ideas on techniques I can use to produce a list of possible
> duplicates that a human would then discern?
>
> Thanks.

Back in a previous millennium... :-) I used to work for a direct mail
company.  We helped advertisers rent mailing lists from various different
sources (magazine subscriptions, warranty card returners, utility company
customers, different mail order company's
customers-who-bought-stuff-like-you're-selling lists).  For 18 months,
"de-duping" mailing addresses was my life.

We (claimed we) did a good job of finding lots of long lists of people
who'd be likely to buy whatever our clients might happen to be selling, but
as you might guess, when a client pays to send tens of thousands of pieces
of snail mail (we just called it "mail", back then -- go figure <smirk>)
through the post office, then they really *hated* it when even just a
slightly measurable fraction of those recipients received two (or seven)
copies of the exact same offer at the same time.

So we also (claimed we) did a good job of deduping those
oh-so-slightly-and-maddeningly-different lists of prospective buyers (you
never got more than one AOL disk in the mail at the same time, did you?
Even though about 75% of the organization's CPU cycles were dedicated
solely to the task of efficiently (and accurately) de-duping these enormous
lists of postal mailing addresses, it was a very tricky and error-prone
task!

The problem was that we had a *huge* likelihood of duplication -- some
lists even contained *mostly* dupes you see, because if we did (part one
of) our job well and we found, say 10 completely different mailing lists,
all from different and far-flung corners of the marketing world, and they
all really *were* people with the same interests (and spending habits) as
their target market demographic says will buy from them, then the chances
were excellent that many the *same* people would be on several of those
very different company's mailing lists (people who subscribe to Hunting
Magazines, also tend to send in the warranty cards for their night-vision
scopes, and buy expensive name brand hiking boots from mail order
catalogs -- d'oh!), and of course those different companies store their
data in different types of databases, with different field-lengths and
validation, almost all of which (back then) entered the database via manual
data entry, transcribing from hand-written order forms and such.  All fuzzy
messy wet human analog issues which made (part two of) our job, the task of
de-duplification all that much harder.

When you go into such a task, with the foreknowledge that there are many
redundant names/addresses on and among all these different databases, then
you can get very aggressive at finding them.  And aggressive we got,
because interestingly, in the (ethical, above board, not the shady
bottom-feeder end of the) list-rental business, you pay a one-time fee for
the right to mail your offer to their list just one time.  Then you have to
discard it.  (As another off-topic aside, another (part 3) of the business
is that everyone in the chain-of-list-custody typically enforces that rule
by "seeding" all the lists that go through their hands, which means
sprinkling in the names and addresses of some spies who send you *all*
their junk mail, so you can test whether a mailer secretly tries to mail
the list a second time, or sell it, or otherwise disclose it to an
unauthorized third party.) But the point was that the lists have pay rates
per thousand names and the advertiser gets a credit for any that were
unusable due to being dupes, because they didn't even mail the name once
(from that list).  So if we eliminated 75% the names on a list as dupes,
they loved us that much more, not just because we saved them 75% of the
list rental fee, but because that meant that the other 25% of the names had
a much higher "affinity" or likelihood of being interested in the offer,
because so many people on that list were on other lists of people with
(presumably) similar interests.

Anyway, in answer to your question -- what was your question again?  Oh
right, techniques.  Well, by far one of the simplest, easiest to implement
and most effective techniques is to use U.S. postal address standardization
software (or, more likely, a postal address standardization *service*).

This is a system, that turns:

  2514 Ridgecrest Drive
  Augusta, GA 30907

into:

  2514 RIDGECREST DR
  AUGUSTA,GA 30907-4810

Maybe not rocket science, but as you know from your database, they get
thorny sometimes.  For instance, when filling out web forms (or warranty
cards, or magazine subscriptions) people use all sorts of variations and
abbreviations (as you also noted) make typos.  Good postal address
standardization can reduce most of the common various variations to their
least common denominator:

For instance whether you enter "Ridgecrest Drive", "Ridgecrest Dr.", just
"Ridgecrest", or misspell it "Rigdecrest", it does the "fuzzy math"
necessary to standardize it properly, the same as above to a canonical form
that only a postal worker could love.

It's cool to watch it reduce variations like

  1101 Northwest 14th Street, Suite 803
  Washington, DC 20050

(even without the typo in the zipcode!) correctly to:

  1101 14TH ST NW STE 803
  WASHINGTON,DC 20005-5601

It parses floor numbers and suite numbers out of an office building
address, knows every street type (from Avenue, Boulevard, Circle and Drive
to Plaza, Road, Street, and Way) and all their abbreviations (and
misspellings) commonly found in the wild.

As you noticed, just correcting the zip codes can make a big difference.
Of course nothing is perfect, but it can catch a lot of the dupes, and flag
many more as likely dupes, and it's not all *that* expensive.  I've used
TeleAtlas's EZ-Locate service http://geocode.com/ on and off over the years
for this sort of thing.  It also gives you back Latitude and Longitude
coordinates (in many cases, accurate enough to target a cruise missile to
the front porch) so its useful for "Find Your Nearest..." and visual
googley push-pinning and thought-bubbling map applications, too!

TeleAtlas charges for each API hit.  You can sign up and get a free account
with 100 hits to test drive their web-form-based (and/or the downloadable
API-based) service.  After that you prepay your account with a credit card.
1000 hits is $50 and they get up to about 3 times cheaper as you buy larger
quantities.  For instance you can get 31,000 geocodes for $500, pre-process
every address in your ~20K database and have credits for almost 10K hits
leftover, to test new addresses before they are inserted into the database
(to make sure they're not dupes, too).

There may be free options around these days, too!  I've heard that you can
correct and geocode a generous number of mailing addresses per day using
the Google maps API now, for free dollars per address :-)

Mapquest had a free offering too for a while but I don't know if that's
still around.  And there's a perl module (powered by the year 2000 U.S.
Census data, IIRC) available from http://geocoder.us/ (and CPAN, of course)
but I don't know how accurate these anticapitalistic options could possibly
be, heheh.

Anyway, happy deduplifcating your db!

-dave