[thelist] pre-project planning

Matt Warden mwarden at gmail.com
Thu Mar 20 12:27:50 CDT 2008

On 3/20/08, r937 <rudy at r937.com> wrote:
> > I'm about embark on a project that's going to require me
>  > to cycle through 5 to 10 text files, massage the data a bit,
>  > and then import them into a database.
>  that's it?  the project involves just loading data into a mysql database?
>  after massaging your data, create a CSV
>  use the LOAD DATA INFILE command
>  vwalah!
>  take the rest of the week off

Pssha! Typical database guy.

I'm currently on an 18 month ETL project, and "just load this data"
can hide a whole crapload of complexities.

For example, today, we found out that about 75% of records being sent
to use ended up with dollar amounts off by a factor of 10. We looked
into it, and it is due to data being sent in IBM number format for a
signed integer (the dollar amount is a payment, so why it is a signed
integer is beyond me), and conversion to character results in a value
like "1805{", which is intended to mean +$180.50. Our ETL tool
attempts to convert this to an integer, gets only to the 5 (so we have
"1805") then divides by 100 and ends up with +$18.50.

There is stuff like this on a daily basis. Granted, we are working
with much higher volumes, but the point is that you ought to be asking
a lot of questions about the format/quality of the data, because you
can end up blowing your time and cost estimate out of the water if you

Matt Warden
Cincinnati, OH, USA

This email proudly and graciously contributes to entropy.

More information about the thelist mailing list