reddit discovery

NOTE: Checkout couple more similar projects: https://anvaka.github.io/sayit/ and https://anvaka.github.io/map-of-reddit/. Description below is slightly outdated and should be updated.

Your comments on reddit are not only what makes reddit fun. They can also be used to x-ray the friendly alien and reveal its hidden structure.

Redditors who commented to this subreddit also commented to...

This simple idea is the core of the current recommendation website. Despite the simplicity it yields amazing results.

Before you read any further, please go ahead and check for yourself:

It works really well for subreddits with under 1 million subscribers. But how exactly does it work?

How exactly does it work?

Recently /u/Stuck_In_the_Matrix publicly released reddit's ~1.7 billion comments dataset. Each record contains information about author's name and target subreddit.

If you post to subreddit A and C very often - it doesn't necessary mean that A and C are related. But if there are thousands of people posting to both A and C we could suspect that maybe subreddits are related.

Of course sometimes A is way more popular than C, and we need to take that into account. Let's consider three subreddits:

A - has 1,000 subscribers
B - also has 1,000 subscribers; and
C - has only 100 subscribers

Imagine A and B share 100 reddittors who posted to both A and B. Also imagine A and C share other 100 redditors who posted to both A and C.

Which subreddit is more related to A? Is it B or C?

Only 10% of B has posted to A. While 100% of C has posted to A. This means C has very high "relationship index" with A.

Turns out this "relationship index" has many names and forms. One of the simplest forms is called Jaccard index (similarity).

Jaccard similarity

To find how much subreddits A and B are similar with each other, all we need to do is:

Find how many subscribers who posted to A has also posted to B (intersection of A and B).
Find how many subscribers has posted to A or B (union of A and B).
Divide 1 by 2 and we'll get Jaccard similarity.

In the example above. Jaccard similarity of A and C is: J(A, C) = 100/1000 = 0.1, while Jaccard similarity of A and B is: J(A, B) = 100/(900 + 100 + 900) = 0.053

This makes C two times more similar to A than B.

Drawbacks

This approach works extremely well for subreddits with less than 1,000,000 subscribers. For more popular subreddits results are getting saturated by popularity of those subreddits. If you have an idea how to fix this please let me know :).

Technical details

Note: The details below outline my old procedure. I didn't use it to build the latest snapshot, which is based on 150 million unique comments. Still keeping it here for reference.

To compute similarity between subreddits I downloaded only one month worth of public comments. This gives more than 50,000,000 user ⇄ subreddit records. Which translates to almost 50,000 unique subreddits.

Each record is stored into redis database in these 50 lines of code. And then I'm using SINTERSTORE and SUNIONSTORE to compute intersection and union of subreddits (code).

This is the most straightforward brute-force approach to compute similarities. It took almost 70 CPU hours of my old MacBookPro friend to compare all subreddits with other.

What's next?

I truly hope you enjoyed the simplicity of the formula and the power of results. If you have any feedback please let me know!

license

MIT

Krokette / redsim