[thelist] Compare Content in Multidimensional Array [PHP/SQL]

Luther, Ron Ron.Luther at hp.com
Thu Sep 27 08:23:12 CDT 2007

Stef further explained what he was looking for:

>>the aim is to plot a line graph for a single country but for two or
three variables. 

>>One should rather normalize them to a base year to enhance

Hi Stef,

The additional information helps.  But it was pretty clear from the
original post that you were looking for a statistical comparison of
'growth rates' from normalized data.

I was going to take a stab at trying to answer the question because I
have played with some ugly SQL statements nesting multiple functions
before.  But after thinking about it a while, I changed my mind.  ;-)

If you keep going down this road, then find the min values in a PHP

- Footnote [1] mostly says we can't do the exact syntax for you since we
don't know what the missing values look like.  Once you find the minimum
year for observation 1 and the minimum year for observation 2 a simple
MAX(Obs_1_Yr, Obs_2_Yr) give you the minimum year where they both have

- Footnote [2] mostly says that (unless you are very comfortable with
(and MySQL supports) getting metadata about the number and names of
columns in a table or you get clever with some side tables or phony data
or algorithms to impute column names) doing this on the SQL side has the
potential to become a serious pain to maintain.  Why?  Because your data
table will be adding a column every year.  Additional records are easy
to deal with.  Additional columns generally cause additional work.

I really think the issue is with your table design.  

[Yeah, yeah, I know - just answer the damn question he asked!  Leave
that other stuff alone!  Sorry, no can do.]

>>The table design is like this for each variable:
>>     name             2001    2002   2003   2004   2005
>>Afghanistan    ....
>>Albania            ....

Heh.  That kind of design looks great in a spreadsheet.  Piece of cake
to graph in Excel.  Very very nice for SQL-Loading *from* someone's
spreadsheet too!  But I'll bet you a nickel that eventually you will
wish that your records were at the 'data point' level instead of the
'data series' level.  

Consider your data restructured like this:

Country, Attribute, Data_Yr, Data_Value
Albania, Gross National Product, 1972, $250000000
Peru, Number of Left Handed Astrophysicists, 2004, 22.63

[or better yet, have some 3NF folks pull the attribute names out into a
separate table and just use Attribute_ID numbers in the raw data. Maybe
the country names too.]

That makes each data row a single observation.  I think most people find
it MUCH easier to work with data at the observation level than the
series level.  It gets rid of that need to parse into the series to find
an observation.

This kind of design change makes the current question kind of simple.  

Select Min(Data_Year) 
>From table 
Where Country="Albania" 
and Attribute="Gross National Product";

... gives you the earliest year you have data for a given country /
attribute combination.  Taking the maximum of any two of these,
MAX(Obs_Yr_1, Obs_Yr_2), would give you the minimum observation year the
two data attributes have in common.

Normalization would be a separate routine (for each series).

Then you pull the data you want to analyze; 

Select Country, Attribute, Data_Yr, Data_Value 
>From Big_Honking_Data_Table
Where Country="Chile" 
And Data_Attribute in ("Population", "Fish_Catch")
And Data_Yr > That_Start_Year_We_Just_Calculated;

Now graph it.  Done.  QED.  Ooops, sorry, that just slipped out.   ;-)


[1] Okay, I *was* going to start by pointing out that you didn't tell us
how you were handling non-existant data points.  I think knowing whether
you are using 'nulls', 'blanks', 'zeroes', or trigger values (e.g.
'999999' means "does not exist") for the data points you don't have
makes a difference to the technique (or at least the syntax) used to
solve the problem.  It is a more important issue to consider when
dealing with statistical data.  Particularly the world-wide kind of
statistics you seem to be playing with because you will run into
situations where you have data from 1960 to 1987 and then 1990 to
present - but no data for 1988 or 1989 because of the civil war that was
raging at that time.

[2] One reason is that fancy nested SQL functions might work okay for
those 5 years in your illustration.  But it can get pretty unwieldly
when you try to go back and add another 25 years.  Another reason is
that, if you don't get tricky (like adding a row of constants you can
count - or a separate manually maintained table with a 'year count' in
it), you can end up needing to revise and maintain these functions every
year as you add yet another column to your data table. Ick!

Random junk:

-- I'm not sure of the current state of the art, but a long time ago a
lot of that kind of statistical work was done pretty much manually by
actually _looking_ at the data and then either hardcoding the value or
using some kind of "Enter Year to Begin Comparison" prompt and entering
what we knew to be the correct year.

-- Again, current practices may have changed, but a lot time ago the
stat folks used to calculate and save the normalized data ahead of time.
So in our programs you would have a choice to graph the 'Fish_Catch'
variable or the 'Normalized_Fish_Catch' variable.  Normalization was not
done 'on the fly'.  Heck, a lot of the time it was done by hand ... On
stone tablets ... In caves ... By fire light ... In the snow ...


More information about the thelist mailing list