[thelist] pre-project planning
Matt Warden
mwarden at gmail.com
Thu Mar 20 12:27:50 CDT 2008
On 3/20/08, r937 <rudy at r937.com> wrote:
> > I'm about embark on a project that's going to require me
> > to cycle through 5 to 10 text files, massage the data a bit,
> > and then import them into a database.
>
> that's it? the project involves just loading data into a mysql database?
>
> after massaging your data, create a CSV
>
> use the LOAD DATA INFILE command
>
> vwalah!
>
> take the rest of the week off
Pssha! Typical database guy.
I'm currently on an 18 month ETL project, and "just load this data"
can hide a whole crapload of complexities.
For example, today, we found out that about 75% of records being sent
to use ended up with dollar amounts off by a factor of 10. We looked
into it, and it is due to data being sent in IBM number format for a
signed integer (the dollar amount is a payment, so why it is a signed
integer is beyond me), and conversion to character results in a value
like "1805{", which is intended to mean +$180.50. Our ETL tool
attempts to convert this to an integer, gets only to the 5 (so we have
"1805") then divides by 100 and ends up with +$18.50.
There is stuff like this on a daily basis. Granted, we are working
with much higher volumes, but the point is that you ought to be asking
a lot of questions about the format/quality of the data, because you
can end up blowing your time and cost estimate out of the water if you
don't.
--
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com
This email proudly and graciously contributes to entropy.
More information about the thelist
mailing list