[thelist] Search closest match across multiple columns (mysql)

Thu Apr 17 09:44:29 CDT 2008

On Thu, Apr 17, 2008 at 10:08 AM, Luther, Ron <Ron.Luther at hp.com> wrote:
>  * If you're going to go 'sum of squares', then you might as well add the square root of
>  the entire calculation to transform it into a true "distance" measure, right?  The upside

If I understand your suggestion here, then I think you can accomplish
the same simply by taking the absolute value. If you are talking about
taking the sum of squares, then he needs to be careful in how he
interprets the resulting value.

5 - 8 = -3 --> 9
4 - 3 = +1 --> 1
6 - 1 = +5 --> 25

Total distance: abs(-3) + abs(1) + abs(5) = 9
Avg distance: 3
---
Sum of squares (SS): 35
Sq rt of SS: around 6

SS is a measure of variability within a data set. Taking the square
root is just going to affect the value along this curve:
http://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Square_root.png/400px-Square_root.png

So, yes, in a sense it is a distance. But it is really a measure of
how variable the entire data set is.

I guess what you are considering the data set is the "horizontal"
values in a given row? If so, that would work. But taking the sqrt of
the variability won't change anything. I guess I am just saying  be
careful how that value is interpreted. It is just a measure of
variability and doesn't mean much beyond that.

>  * You're also going to want to limit results to errors/distances within a certain range.  Very similar to the 'show me all stores within 5 miles' problem.  If you have a database of 600,000 records you probably don't want to sort and print all of those records every time you make a query.
>

Unfortunately that is unavaoidable. The entire dataset will have to be
sorted every time, because the sort criteria is dynamic (based on
input parameters that differ with each query). You can limit the
number of rows the database sends as a result of the query, but the
database is still going to have to sort the entire dataset.

You could perhaps venture down a path of designing a single quantity
like the sum (or some other operation) of height, weight, and dob (in
days), then index that column. The benefit is that you can search with
an index range scan, but the downside is that you lose information.
This may or may not work for the OP depending on what he's trying to
accomplish.

>  * You may want to also consider how you are going to handle partial information queries.  You might not want to use the same weighting if you only have 'height' and 'weight' - but no 'date of birth' info.
>

This is a great point, and a tough problem. If you consider NULL
values as zero distance from your search criteria, then rows with no
information would be perfect matches. If you consider NULL values to
be infinite distances (or some large distance), then perfect matches
in 2 of 3 criteria will be outweighed by fact that the 3rd criterion
is missing. The OP may want to consider setting some "missing
information penalty" in the calculation, but deciding that value is
non-trivial.

>  Heck, just adding a binary variable like gender cuts your record search in half! ... and avoids all kinds of potentially embarrassing false positive issues!  Ewwww!  ;-)
>
>  Unfortunately, properly incorporating those kinds of elements in your modeling takes you away from Pythagorean distance equations and into some more challenging landscapes that use non-linear and/or non-parametric techniques.
>

Can you elaborate? Why could he not simply use 0 and 1 for matches on
race, gender, etc. and multiplying that 0 or 1 by a decided weight?

Thanks to the OP for a very interesting topic!

-- 
Matt Warden
Cincinnati, OH, USA
http://mattwarden.com

This email proudly and graciously contributes to entropy.