[thelist] Search closest match across multiple columns (mysql)
Ron.Luther at hp.com
Thu Apr 17 14:31:26 CDT 2008
Matt Warden calls me out for imprecision:
Rats! Busted. ;-)
You're absolutely right, Matt, 'sum of squares' is a statistical concept related to variation and variability and would be properly applied over an entire data set. Although the equation does look very similar, that IS an imprecise (Aww heck, it's just downright sloppy) term to apply to a single value. My bad!
[To save face, I will pretend I was merely getting ahead of myself and thinking ahead to the techniques you will need to use to evaluate the effectiveness of your modeling. You *were* planning to test the effectiveness of different sets of 'weights', right? After all, you need to update and re-evaluate those weights on a regular basis. That effort will give you a 'set' of predicted versus actual values to check variance on. Multiple runs? Running partial data sets to measure predictability? It's all good!]
>>If I understand your suggestion here, then I think you can
>>accomplish the same simply by taking the absolute value.
Maybe so. It might well be personal preference. I'm just not a huge fan of ABS based distance equations. I'm more comfortable with the square root. <shrug />
>>SS is a measure of variability within a data set. Taking the
>>square root is just going to affect the value along this curve:
>>So, yes, in a sense it is a distance.
Yup, in a geometric sense. ;-PPPP
(Halfway down the page.)
And it easily generalizes to three or more dimensions. [Of course, ABS also generalizes well...]
I think the reason I prefer this method is because the (2D version anyway) distance formula above is nothing more than the equation for a circle (of radius 'distance') about a fixed point ((x1, y1) for the purplemath example). Every point on that circle is equidistant from the center, so it has a nice intuitive feel of "distance" to it.
> * You're also going to want to limit results to errors/distances within a certain range. Very similar to the 'show me all stores within 5 miles' problem. If you have a database of 600,000 records you probably don't want to sort and print all of those records every time you make a query.
>>Unfortunately that is unavaoidable. The entire dataset will have to be sorted every time,
Ah! D'Oh! Yeah. True. I was mostly thinking about print/display.
... Although ... That *is* a very interesting point you bring up about potential pre-calculation point(s)! I'm thinking multiple (and not all necessarily sequential) in order to effect some kind of a 'stepwise' matching process ... Match a high-level profile [old man vs teenage girl] first, then cut data (is this where you use them fancy vector index thangs?) to just those records of primary interest before continuing with the processing of that entire subset of data. In a transnational setting you might want to run the local Egyptian perp data first before expanding processing to encompass _all_ of the UC Berkeley journalism students. <sheesh /> Kind of a neat idea - if needed because of some high processing cost.}
>>Agreed. Null data and non-response is always a problem. (This is why statisticians bayonet the wounded!)
>>Can you elaborate? Why could he not simply use 0 and 1 for matches on
>>race, gender, etc. and multiplying that 0 or 1 by a decided weight?
You can try that as a first approximation. I think the problem that you run into is that on a binary item like gender the weight pretty much needs to be infinite. So if you are looking for a thin 5'5" tall guy who flashed kids in a park, then a fat 6'4" guy is a closer match than a medium build 5'5" tall gal even though she is the right height and weight.
Unfortunately, once you set the binary variable weight to a very high value like "500000000" to eliminate all of the non-guy records you have pretty much destroyed the value of any of the other weights. Then the thin 5'6" guy in your database might come back with a 500000001 point value match and a fat 6'4" guy record might come back with a 500000005 point value match. You've lost discrimination power because you had to set that crazy high weight to eliminate non-matches on the binary variable.
This might be another argument for a step-wise approach. Filter by all known binary variables (or best known or most confidently known) then run a generalized weighting on the rest. I think that's probably the easiest approach to start off with.
I mention 'best known' or 'most confident' because if it were, say, a police reporting application then you might want a 'level of confidence' weighting for each input in order to better utilize a witness saying "They wore a hat so I don't know what color their hair was, but they were definitely thin and very very very tall!" So, essentially, you might want weights for the weights!
I will, however, agree that turning "Tall/Medium/Short" or "Heavy/Medium/Thin" or various shades of eye, skin, or hair color into numerical values for matching algorithm purposes might be a necessary evil to get your algorithm to work ... but it's also definitely an art form! And a place where a lot of testing and experimentation might be necessary to 'optimize' results.
I think the main reason I mentioned non-linear techniques was because I was thinking that econometric modeling type techniques might be applicable in a person-matching setting. Mostly because of the binary variables. Actually I was thinking about a model a friend of mine once built that did one of the best jobs of predicting monthly long distance telephone call volumes seen up to that time. That was the most successful use of binary variables that I've seen. She had a binary variable for "Mother's Day" and another binary variable(s) for "severe weather". Big storms (like holidays) cause people to call and check on friends and relatives. We had a very interesting discussion on how you differentiate between a month with one big storm and a month with two big storms or between a big storm and a ginormous-OMG-the-end-of-the-world-is-upon-us storm ... in a mathematical model kind of setting. Big Fun!
Now, intuitively, I'm thinking that some very similar math applies in people matching algorithms. Definitely some Bayesian techniques come into play when you have positive ID's on two people out of a group of twenty, right?
Yeah, this is MUCH neater than the stuff I am actually working on today!!
More information about the thelist