Where do mathematical algorithms for Reddit's ranking, as an example, come from?

I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly.

I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines: Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good) If two stories are equally old, the one with more upvotes should be higher.(So there needs to be a contribution from the votes.) The more upvotes a story gets, the longer it should remain near the top of the ranking Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes.

Fairly soon (after a day or two), new stories need to outrank them.(So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets. ) Stories with more downvotes than upvotes should not appear in the rankings at all Now let's look at the formula: log z yt 45000 and see how it satisfies these requirements If the number of votes does not change, then z y and t are all unchanged.So the score is unchanged. This satisfies requirement (1) If two stories have the same age, then they have the same value for t But the one with more upvotes has a higher value of z and since log is monotonic, it has a higher score.

This satisfies requirement (2) The more upvotes a story has, the higher its z so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3) Logarithm is a function that grows more slowly as it gets larger ( take a look at its graph ). So a story needs more and more upvotes over time to keep up with newer stories.

This satisfies requirement (4) If the story has more downvotes than upvotes, then z 1 and y 1 so the score is negative. This satisfies requirement (5) The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 84,000 seconds in a day, so t gets larger by this amount each day.

Dividing t by 45,000 means that one day's relative newness is worth 73 votes (log to the base 10 of 73 = 1.86), and two days' relative newness are worth roughly 5,400 votes.

I'll tackle the first formula, for "hotness" of posts. Formulas like this come from requirements. The designers of Reddit have thought about what they want to achieve, and designed formulas accordingly.

I can't tell you exactly what requirements they had in mind, but I can look at the implementation and guess that they wanted a system along these lines: Scores shouldn't need to be recomputed unless the number of votes change. This reduces the number of changes to the database, and makes it easier to achieve consistency if data is replicated. (So any scoring system based on scores getting lower as the article ages will be no good).

If two stories are equally old, the one with more upvotes should be higher. (So there needs to be a contribution from the votes. ) The more upvotes a story gets, the longer it should remain near the top of the ranking.

Old stories shouldn't stay at the top of the rankings for ever, even if they had lots of upvotes. Fairly soon (after a day or two), new stories need to outrank them. (So there needs to be a contribution from the date, and this must outweigh the score due to votes fairly soon, no matter how many votes something gets.) Stories with more downvotes than upvotes should not appear in the rankings at all.

Now let's look at the formula: log z + yt / 45000 and see how it satisfies these requirements. If the number of votes does not change, then z, y and t are all unchanged.So the score is unchanged. This satisfies requirement (1).

If two stories have the same age, then they have the same value for t. But the one with more upvotes has a higher value of z, and since log is monotonic, it has a higher score. This satisfies requirement (2).

The more upvotes a story has, the higher its z, so the longer it will be until another story with higher t can outrank it. This satisfies requirement (3). Logarithm is a function that grows more slowly as it gets larger (take a look at its graph).

So a story needs more and more upvotes over time to keep up with newer stories. This satisfies requirement (4). If the story has more downvotes than upvotes, then z = 1 and y = -1 so the score is negative.

This satisfies requirement (5). The constant 45,000 is a scale factor that brings the upvotes and the age into balance. There are 84,000 seconds in a day, so t gets larger by this amount each day.

Dividing t by 45,000 means that one day's relative newness is worth 73 votes (log to the base 10 of 73 = 1.86), and two days' relative newness are worth roughly 5,400 votes.

I really appreciate you're detailed explanation on the algorithm. It has provided me insight on the constraints they built their formula around and where each of the elements came from. Thank you so much!

– bakz Jul 3 at 23:35.

They don't come from anywhere. There is no absolute truth to them, nor anything to prove. It's simply a way to quantify an attribute in as most sensible a way as seemed to the development team.

You would use log when you want something to be a factor although a less important one (since large values indeed grow, although very slowly). But by the same token, they could have chosen cube root. The formulae are simply a representation of those factors which we can presume are those which characteristically belong to something "hot", and a composition of them in such a manner that takes each into account in an appropriate proportion (for example, we'll square those values that have huge importance, and take log of those which are less).

Once they came up with the formula, they probably came up with 10 or 15 different types of posts and plugged the numbers in and saw that that made a lot of sense all round, so stuck with it. In fact, there first few attempts probably didn't come out so well, and after a little fiddling with the numbers arrived at that formula.

Thanks so much for your detailed response! I am rereading your words to fully take in all the different details you have included. Thanks again!

– bakz Jul 3 at 23:36 @bakz, a pleasure. – davin Jul 3 at 23:43.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions