[thelist] Search closest match across multiple columns (mysql)

Luther, Ron Ron.Luther at hp.com
Thu Apr 17 09:08:10 CDT 2008

Antonio Angelo suggested ...

... modifying the optimization to more of a 'sum of squares' error formula:

>>Programs that performs optimization, iterates to minimize an error function.
>>In my opinion, to find the closest set of data you should work in a similar
>>way, by defining your error function (e.g. by summing weighted squared
>>differences respect the searched values), the sorting by that error function.

Hi Antonio,

It certainly is an interesting question.  (Especially in today's environment of ID cards, face-recognition software trials, and transnational biometric database projects!)

I think I'd add on to your suggestion with a few other ideas to investigate.

* If you're going to go 'sum of squares', then you might as well add the square root of the entire calculation to transform it into a true "distance" measure, right?  The upside of that is that you should be able to logically extend the formula to encompass any number of additional quantitative measures ... like age, 40 yard dash time in tenths of a second, wingspan in tenths of an inch, vertical leap, etc. ... without running into any 'out of range' calculation issues.  [If you are going to take this further and 'normalize' results to a preset standard (for grading college seniors into potential sports draft pick rounds perhaps) then you will need division by zero error catch traps as well, right?]

* You're also going to want to limit results to errors/distances within a certain range.  Very similar to the 'show me all stores within 5 miles' problem.  If you have a database of 600,000 records you probably don't want to sort and print all of those records every time you make a query.

* You may want to also consider how you are going to handle partial information queries.  You might not want to use the same weighting if you only have 'height' and 'weight' - but no 'date of birth' info.

* However, the real problem here is that, eventually, you will probably want to add parameters of a qualitative rather than quantitative nature. E.g. gender, race, hair color, eye color, mustache (Y/N), fingerprints on file (Y/N), religious preference, prior convictions for file sharing (Y/N), hamstah owner (Y/N), number of days since last known to be in physical proximity to the event of interest (in ranges rather than values --> '0', '<10', '<30', etc.) ...

Heck, just adding a binary variable like gender cuts your record search in half! ... and avoids all kinds of potentially embarrassing false positive issues!  Ewwww!  ;-)

Unfortunately, properly incorporating those kinds of elements in your modeling takes you away from Pythagorean distance equations and into some more challenging landscapes that use non-linear and/or non-parametric techniques.

Good Luck!

More information about the thelist mailing list