On Thu, Apr 17, 2008 at 10:08 AM, Luther, Ron <Ron.Luther at hp.com> wrote: > * If you're going to go 'sum of squares', then you might as well add the square root of > the entire calculation to transform it into a true "distance" measure, right? The upside If I understand your suggestion here, then I think you can accomplish the same simply by taking the absolute value. If you are talking about taking the sum of squares, then he needs to be careful in how he interprets the resulting value. 5 - 8 = -3 --> 9 4 - 3 = +1 --> 1 6 - 1 = +5 --> 25 Total distance: abs(-3) + abs(1) + abs(5) = 9 Avg distance: 3 --- Sum of squares (SS): 35 Sq rt of SS: around 6 SS is a measure of variability within a data set. Taking the square root is just going to affect the value along this curve: http://upload.wikimedia.org/wikipedia/commons/thumb/7/72/Square_root.png/400px-Square_root.png So, yes, in a sense it is a distance. But it is really a measure of how variable the entire data set is. I guess what you are considering the data set is the "horizontal" values in a given row? If so, that would work. But taking the sqrt of the variability won't change anything. I guess I am just saying be careful how that value is interpreted. It is just a measure of variability and doesn't mean much beyond that. > * You're also going to want to limit results to errors/distances within a certain range. Very similar to the 'show me all stores within 5 miles' problem. If you have a database of 600,000 records you probably don't want to sort and print all of those records every time you make a query. > Unfortunately that is unavaoidable. The entire dataset will have to be sorted every time, because the sort criteria is dynamic (based on input parameters that differ with each query). You can limit the number of rows the database sends as a result of the query, but the database is still going to have to sort the entire dataset. You could perhaps venture down a path of designing a single quantity like the sum (or some other operation) of height, weight, and dob (in days), then index that column. The benefit is that you can search with an index range scan, but the downside is that you lose information. This may or may not work for the OP depending on what he's trying to accomplish. > * You may want to also consider how you are going to handle partial information queries. You might not want to use the same weighting if you only have 'height' and 'weight' - but no 'date of birth' info. > This is a great point, and a tough problem. If you consider NULL values as zero distance from your search criteria, then rows with no information would be perfect matches. If you consider NULL values to be infinite distances (or some large distance), then perfect matches in 2 of 3 criteria will be outweighed by fact that the 3rd criterion is missing. The OP may want to consider setting some "missing information penalty" in the calculation, but deciding that value is non-trivial. > Heck, just adding a binary variable like gender cuts your record search in half! ... and avoids all kinds of potentially embarrassing false positive issues! Ewwww! ;-) > > Unfortunately, properly incorporating those kinds of elements in your modeling takes you away from Pythagorean distance equations and into some more challenging landscapes that use non-linear and/or non-parametric techniques. > Can you elaborate? Why could he not simply use 0 and 1 for matches on race, gender, etc. and multiplying that 0 or 1 by a decided weight? Thanks to the OP for a very interesting topic! -- Matt Warden Cincinnati, OH, USA http://mattwarden.com This email proudly and graciously contributes to entropy.