I have a PHP application that allows the user to specify a list of countries and a list of products. It tells them which retailer is the closest match. It does this using a formula similar to this:
(
(number of countries matched / number of countries selected) * (importance of country match)
+
(number of products matched / number of products selected) * (importance of product match)
)
*
(significance of both country and solution matching * (coinciding matches / number of possible coinciding matches))
Where [importance of country match] is 30%, [importance of product match] is 10% and [significance of both country and solution matching] is 2.5
So to simplify it: (country match + product match) * multiplier.
Think of it as [do they operate in that country? + do they sell that product?] * [do they sell that product in that country?]
This gives us a match percentage for each retailer which I use to rank the search results.
My data table looks something like this:
id | country | retailer_id | product_id ======================================== 1 | FR | 1 | 1 2 | FR | 2 | 1 3 | FR | 3 | 1 4 | FR | 4 | 1 5 | FR | 5 | 1
Until now it's been fairly simple as it has been a binary decision. The retailer either operates in that country or sells that product or they don't.
However, I've now been asked to add some complexity to the system. I've been given the revenue data, showing how much of that 开发者_开发问答product each retailer sells in each country. The data table now looks something like this:
id | country | retailer_id | product_id | revenue =================================================== 1 | FR | 1 | 1 | 1000 2 | FR | 2 | 1 | 5000 3 | FR | 3 | 1 | 10000 4 | FR | 4 | 1 | 400000 5 | FR | 5 | 1 | 9000000
My problem is that I don't want retailer 3 selling ten times as much as retailer 1 to make them ten times better as a search result. Similarly, retailer 5 shouldn't be nine thousand times better as a match than retailer 1. I've looked into using the mean, the mode and median. I've tried using the deviation from the mean. I'm stumped as to how to make the big jumps less significant. My lack of ignorance of the field of statistics is showing.
Help!
Consider using the log10() function. This reduces the direct scaling of results, like you were describing. If you log10() of the revenue, then someone with a revenue 1000 times larger receives a score only 3x larger.
A classic in "dampening" huge increases in value are the logarithms. If you look at that Wikipedia article, you see that the function value initially grows fairly quickly but then much less so. As mentioned in another answer, a logarithm with base 10 means that each time you multiply the input value by ten, the output value increases by one. Similarly, a logarithm with base two will grow by one each time you multiply the input value by two.
If you want to weaken the effect of the logarithm, you could look into combining it with, say, a linear function, e.g. f(x) = log2 x + 0.0001 x
... but that multiplier there would need to be tuned very carefully so that the linear part doesn't quickly overshadow the logarithmic part.
Coming up with this kind of weighting is inherently tricky, especially if you don't know exactly what the function is supposed to look like. However, there are programs that do curve fitting, i.e. you can give it pairs of function input/output and a template function, and the program will find good parameters for the template function to approximate the desired curve. So, in theory you could draw your curve and then make a program figure out a good formula. That can be a bit tricky, too, but I thought you might be interested. One such program is the open source tool QtiPlot.
精彩评论