[thelist] DB design question

Joshua Olson joshua at waetech.com
Fri Dec 6 17:44:02 CST 2002


Greetings.

I've recently done some interfacing with MLS (Multiple listing services) and
my experience has been that Agents tend to submit bad data all the time and
tend to forget what they've submitted.  Furthermore, they sometimes submit
the same property multiple times in different property categories, counties
(if it's on county boundaries), or even as different cities if the property
falls within an unincorporated city.

In the project we needed to identify duplicate listings and remove the
duplicates.  The kicker was that any of the deleted listings had to show up
when someone search for a particular listing number but only the undeleted
one should show up if they searched by criteria (size, price, number of
baths, etc).  The published data for any listing was supposed to be the
_newest_ data for any of the duplicate listings.  This added complexity
because sometimes the agent would originally submit 3 listings for a
property but would only submit the price changes for one of the set.  Talk
about a nightmare.

We ended up creating the concept of listing groups.  Each group had an
autonumber id field.  Each listing within a group was assumed to be the same
property, regardless of the individual data fields.  Before we published to
live, we simply deleted all but the "latest" listing for each group from the
set of all listings.  Then, we would push the cropped dataset to the live
table.  The live table drove the website.

The system started by identifying duplicates automatically based on street
address and agent.  After that process ran, all singleton listings go
through a manual process where they are either marked as correct (indeed
being single) or were merged into an existing group.  This system worked
fairly well.  And, since we recorded all manual merging decisions, the same
listing was never manually considered twice.  On the subsequent runs it
would automatically be assigned to the correct group.

I've also found that the amenities very rarely changed with listing
services.  The best bet is to try to reproduce their data on your server as
the nightly updates get incorporated.  This will most likely be a flat
table.  Then, extract those you wish to publish (maybe the non-rental,
non-pending sale, non-offmarket, etc) and run your processes on that subset
while leaving the original "mirror" of the listing services data intact.

Sorry if I went tangent for a while.

-joshua




More information about the thelist mailing list