[thelist] pre-project planning

Matt Warden mwarden at gmail.com
Thu Mar 20 12:27:50 CDT 2008


On 3/20/08, r937 <rudy at r937.com> wrote:
> > I'm about embark on a project that's going to require me
>  > to cycle through 5 to 10 text files, massage the data a bit,
>  > and then import them into a database.
>
>  that's it?  the project involves just loading data into a mysql database?
>
>  after massaging your data, create a CSV
>
>  use the LOAD DATA INFILE command
>
>  vwalah!
>
>  take the rest of the week off

Pssha! Typical database guy.

I'm currently on an 18 month ETL project, and "just load this data"
can hide a whole crapload of complexities.

For example, today, we found out that about 75% of records being sent
to use ended up with dollar amounts off by a factor of 10. We looked
into it, and it is due to data being sent in IBM number format for a
signed integer (the dollar amount is a payment, so why it is a signed
integer is beyond me), and conversion to character results in a value
like "1805{", which is intended to mean +$180.50. Our ETL tool
attempts to convert this to an integer, gets only to the 5 (so we have
"1805") then divides by 100 and ends up with +$18.50.

There is stuff like this on a daily basis. Granted, we are working
with much higher volumes, but the point is that you ought to be asking
a lot of questions about the format/quality of the data, because you
can end up blowing your time and cost estimate out of the water if you
don't.

-- 
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com


This email proudly and graciously contributes to entropy.



More information about the thelist mailing list