recently I was looking at Reddit's algorithm for determining what makes a post a "hot" topic and which content is suitable for 开发者_如何学运维the reddit homepage.
the article I was reading is here: http://amix.dk/blog/post/19588
I've noticed they have mathematical logorithms and have created some kind of a mathematical function to determine the hotness/relevance of a post.
In the formulas used, where do each of the mathematical components come from and how do they know to use them?
thank you!
-- Bakz
EDIT: just to clarify, I just graduated high school and apologize if the answer to this question seems pretty obvious. thanks again!
I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly. I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines:
Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).
If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes.)
The more upvotes a story gets, the longer it should remain near the top of the ranking.
Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.)
Stories with more downvotes than upvotes should not appear in the rankings at all.
Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements.
If the number of votes does not change, then z, y and t are all unchanged. So the score is unchanged. This satisfies requirement (1).
If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).
The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3).
Logarithm is a function that grows more slowly as it gets larger (take a look at its graph). So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4).
If the story has more downvotes than upvotes, then z = 1 and y = −1 so the score is negative. This satisfies requirement (5).
The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 86,400 seconds in a day, so t gets larger by this amount each day. Dividing t by 45,000 gives 1.92 which means that one day's relative newness is worth is 101.92 = 83 votes, and two days' relative newness are worth roughly 7,000 votes.
They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.
You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root.
The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).
Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.
精彩评论