[thelist] SQL: finding "new" records in a table

Tue Sep 4 12:43:34 CDT 2007

On 9/4/07, Judah McAuley <judah at wiredotter.com> wrote:
> > I would consider connecting directly to the main DB yourself every hour
> > and loop through an SQL result set, instead of the CSV.
>
> For a variety of fascinating reasons, this isn't really an option. It is
> a great suggestion, however.

Even if it were technically possible for you to do this, I would not
suggest it. It is much easier to handle exceptions (network outages,
failed records, etc.) if you bulk export, transfer file, and bulk
import.

> > I am no expert over here, but I would suggest that the only alternative
> > is of course to compare the records, that is the CSV record versus your
> > DB record, to see if an update is necessary or not.
>
> Agreed. My question then, is what is the most efficient way to do this?
> I could load the csv file into a temp table and then find all the rows
> that need inserts pretty easily via a join.
>
> What of the other rows though? What is the most efficient way to see if
> two rows are identical for 12 columns?

What you are discussing is actually a pretty common scenario. Here's
what I would suggest, which (not that this makes it legitimate) is how
this is usually handled:

1) bulk export all records where there is the potential for necessary
updates/inserts
2) transfer the export file by whatever means appropriate (secure FTP,
middleware, other)
3) bulk import all records from the file into a staging table (this is
probably *not* a temp table)
4) use a sproc or load tool to update changed records (update strategy)
5) use a sproc or load tool to add new records (insert strategy)

Note that there is a slight performance benefit by doing #4 before #5,
so might as well do it in that order.

Depending on the scale of what you're doing, you may consider turning
off indexes during the load. This will also force you to turn off
referential integrity during the load, which means you will need to
handle these checks yourself or use a feature of your load tool. After
the load is complete, you rebuild the indexes. This will cause the
load process to run much, much faster. You just need to be careful
about the lack of RI.

Hope this helps,

-- 
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com

This email proudly and graciously contributes to entropy.