Description

In this little project, we look at some basic information about how subreddits are founded, and how they grow.

All of the data is taken from the public dumps of reddit data at https://bigquery.cloud.google.com. I queried these to create csv files, which are analyzed here.

In [136]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
In [137]:
import pandas as pd 

new_subs_df = pd.read_csv('/home/jeremy/Projects/FoundingMotivation/code/new_subreddits_per_day.csv')
total_unique_df = pd.read_csv('/home/jeremy/Projects/FoundingMotivation/code/total_unique_posters.csv')
unique_per_day_df = pd.read_csv('/home/jeremy/Projects/FoundingMotivation/code/unique_posters_per_day.csv')
top_unique_df = pd.read_csv('/home/jeremy/Projects/FoundingMotivation/code/SR_posters_per_day.csv')
new_subs_df.created_date = pd.to_datetime(new_subs_df.created_date)
unique_per_day_df.post_date = pd.to_datetime(unique_per_day_df.post_date)
In [ ]:
# Just look at more recent data, and filter out some crazy days (almost certainly automated spam)
new_subs_df = new_subs_df[(new_subs_df.created_date >= '2014-01-01') & 
                          (new_subs_df.num_new_wikis < 4000)]

First, we plot the number of subreddits per day.

In [38]:
# Plotting recipe stolen from http://matplotlib.org/users/recipes.html
plt.close('all')
fig, ax = plt.subplots(1)
ax.plot(new_subs_df.created_date, new_subs_df.num_new_wikis)

# rotate and align the tick labels so they look better
fig.autofmt_xdate()

# use a more precise date string for the x axis locations in the
# toolbar
import matplotlib.dates as mdates
ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')
plt.title('Number of subreddits created each day');

With some exceptions, this has a pretty steady state of ~600 subreddits / day.

Growth measures

In this data, we look at all of the subreddits created on one day - my birthday - in 2014

First, we look at how many total unique users had posted to each subreddit as of September 2015.

In [41]:
plt.close('all')
plt.hist(total_unique_df.TotalPosters,bins=20);

And when we log it?

In [45]:
plt.close('all')
plt.hist(np.log1p(total_unique_df.TotalPosters),bins=20);

It's very skewed, but there are plenty of subreddits that are marginally successful.

In [55]:
# Here is the number of edits at each decile
total_unique_df.TotalPosters.quantile(np.arange(0.0,1.0,.1))
Out[55]:
0.0     1
0.1     1
0.2     1
0.3     1
0.4     1
0.5     2
0.6     2
0.7     3
0.8     5
0.9    10
dtype: float64

20% of the subreddits hit 5 posters, which isn't bad.

But how quickly does this happen? To answer that, we look at the number of unique posters per day for these top users

In [114]:
from matplotlib.pyplot import cm
import matplotlib.dates as mdates

# Make a plotting function, so we can easily change some parameters
def plot_subs(df, min_posters=5, max_posters=float('inf')):
    '''Takes in a dataframe with columns ['subreddit', 'post_date', 'TotalPosters'], and a cutoff value.
    Creates a plot of all of the subreddits with at least cutoff unique posters, by date. Assumes that
    total_unique_df exists as a global variable.'''
    # Filter
    top_subs = total_unique_df[(total_unique_df['TotalPosters'] >= min_posters) &
                              (total_unique_df['TotalPosters'] <= max_posters)]['subreddit']
    df = df[df['subreddit'].isin(top_subs)]

    # Plot
    plt.close('all')
    fig, ax = plt.subplots(1)
    
    # Get a bunch of colors to use
    color=iter(cm.rainbow(np.linspace(0,1,len(top_subs))))
    for sr in top_subs:
        curr_sr = df[df.subreddit == sr].sort('post_date')
        c=next(color)
        ax.plot(curr_sr.post_date, curr_sr.TotalPosters,c=c)

    # rotate and align the tick labels so they look better
    fig.autofmt_xdate()

    # use a more precise date string for the x axis locations in the
    # toolbar
    ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')
    plt.title('Number of unique posters each day');
In [176]:
# What does the trajectory of the top subs of all time look like?

top_subs = top_unique_df.subreddit.unique()
# Only look at the beginning
df = top_unique_df[(top_unique_df.days_since_founding < 200) & (top_unique_df.unique_posters < 200)]

# Plot
plt.close('all')
fig, ax = plt.subplots(1)

# Get a bunch of colors to use
color=iter(cm.rainbow(np.linspace(0,1,len(top_subs))))
for sr in top_subs:
    curr_sr = df[df.subreddit == sr].sort('days_since_founding')
    c=next(color)
    ax.plot(curr_sr.days_since_founding, curr_sr.unique_posters,c=c)

# Plot the median
median_posters = df.groupby('days_since_founding')
ax.plot(curr_sr.days_since_founding, median_posters.unique_posters.median(), 'ro')
# rotate and align the tick labels so they look better
fig.autofmt_xdate()

# use a more precise date string for the x axis locations in the
# toolbar
#ax.fmt_xdata = mdates.DateFormatter('%Y-%m-%d')
plt.title('Number of unique posters each day');
In [178]:
# Looking just at the median posters
plt.close('all')
plt.plot(curr_sr.days_since_founding, median_posters.unique_posters.median())
plt.title('Median unique posters per day')
Out[178]:
<matplotlib.text.Text at 0x7f7ef8491198>
In [121]:
# First, we look at the top performers
plot_subs(unique_per_day_df,5)
In [120]:
# How quickly do they look different from the worst performers?

plot_subs(unique_per_day_df, 0,4)
In [124]:
# Seems to be very quickly. What if we zoom in a bit more at the beginning?

plot_subs(unique_per_day_df[unique_per_day_df.post_date < '2015-06-01'],0,5)
In [128]:
plot_subs(unique_per_day_df[unique_per_day_df.post_date < '2015-06-01'],5)

It looks like most of the action for both groups happens at the beginning, with a few exceptions for the popular group. Is there any difference between them a month out?

In [134]:
plot_subs(unique_per_day_df[unique_per_day_df.post_date > '2015-05-14'],5)
In [132]:
plot_subs(unique_per_day_df[unique_per_day_df.post_date > '2015-05-14'],0,5)

There does seem to be some sort of sustained difference over time. Many of the popular ones continue to get at least some activity, and a few get lots of activity.