## Using APIs in Projects


When getting data from APIs, I strongly suggest following a three-step workflow:

1. Write some code that gets data from an API and saves all of the data (if possible) to a file
2. Write a second program (usually a second file) that loads the data from the API, extracts the data that will be useful for analysis, and saves it in a flat file (typically a CSV).
3. Program number 3 loads the CSV file and does the analysis

This approach has a few important benefits.

The first and most important is that often it is difficult to get the same raw data again. For example, some APIs only lets you get the last week. If you are doing analysis a month down the road and decide that you really wish you had saved different metadata, it is too late. By saving as much of the raw data as possible you can change your measures or analysis strategy in the future (or even do additional studies)

The second benefit is that this gives you a nice pipeline, with intermediate files. Instead of including the entire raw data file in the code that does analysis, you only have to load the CSV, which is often much smaller and easier to work with.

This brief lesson will show an example of this workflow, using `PRAW`.

Note that I'm going to put everything in one file for convenience, but my typical workflow is to put these in separate files and then run each file separately.

## Program 1 - Data Retrieval

The goal of our project is to characterize the way that people participate in the Purdue subreddit. In particular, we want to create a histogram of the number of posts per person, the number of comments per person, the median comment length per person, and as scatterplot of the relationship between the number of comments and the median comment length.

In order to do this, all we really need is to get as many comments as we can from the Purdue subreddit, so that's what our first program will do.

In [None]:
import praw
import reddit_authentication
import csv
import pandas as pd
import os
import seaborn as sns

# Create an instance called reddit. We'll use this to call the API.
reddit = praw.Reddit(client_id=reddit_authentication.client_id,
                     client_secret=reddit_authentication.client_secret,
                    user_agent = reddit_authentication.user_agent,
                    username = reddit_authentication.username,
                    password = reddit_authentication.password)

There may be a better approach, but we're going to grab all of the posts (called submissions), and then get all of the comments for each post.

We're also going to save the data as we go, so that if we need to stop, we can pick up where we left off.

This is a little bit complicated, but we're going to save two files: one that is a list of all of the submissions we've sucessfully retrieved, and one that actually contains all of the comments. I'm doing this because sometimes the amount of data you have is so large that you don't want to keep it all in memory, you just want to write it out as quickly as possible.

Ideally, we want to keep the data as close to raw as possible; PRAW gives us an object, which isn't easy to save. So we'll have to select the attributes we want to keep, and save these in a CSV file. But, I'm going to save everything I might possibly want.

Unfortunately, I learned that we can only get up to 1,000 submissions, so we'll get the top 1,000 over the last year.

In [None]:

with open('./submissions.csv', 'w', encoding='utf-8') as f:
    out = csv.writer(f)
    out.writerow(['id', 'title', 'author', 'created_utc', 'comments_retrieved'])
    for submission in reddit.subreddit('Purdue').top(limit=None, time_filter = 'year'):
        try:
            name = submission.author.name
        except AttributeError:
            name = None
            print(submission)
        out.writerow([submission.id, submission.title, name, submission.created_utc, False])

Now, we can just load that submissions file, so we don't need to run that code again.

The cool thing about this code is that it's written so that you can stop it and start running it again. It will pick up where it left off.

Sometimes, you will be running code that runs for hours or days (or longer), and having checkpointing like this can be really important.

Indeed, I received a network error while running this code, and it's likely that you will as well.

In [None]:
df = pd.read_csv('./submissions.csv')

# Check if the output file exists. If not, create it and write the header.

if not os.path.exists('./comments.csv'):
    with open('./comments.csv', 'w') as f:
        out = csv.writer(f)
        out.writerow(['id',
                      'body',
                      'author', 
                      'created_utc', 
                      'parent_id', 
                      'submission_id', 
                      'tot_awards_received', 
                      'ups', 
                      'downs', 
                      'score'])

for submission_id in df.loc[df.comments_retrieved == False, 'id']:
    print(f'Retrieving comments for {submission_id}')
    submission = reddit.submission(id=submission_id)
    # This sets the limit to None, which means that it will retrieve all comments.
    submission.comments.replace_more(limit=None)
    # Because we're only storing whether a submission was retrieved, we save all the comments and write them at the same time.
    curr_comments = []
    for comment in submission.comments.list():
        try:
            name = comment.author.name
        except AttributeError:
            name = None
        curr_comments.append([comment.id, 
                        comment.body, 
                        name, 
                        comment.created_utc, 
                        comment.parent_id,
                        submission.id,
                        comment.total_awards_received,
                        comment.ups,
                        comment.downs,
                        comment.score
                        ])
    with open('./comments.csv', 'a') as f:
        out = csv.writer(f)
        out.writerows(curr_comments)
    df.loc[df.id == submission_id, 'comments_retrieved'] = True
    df.to_csv('./submissions.csv', index=False)

## Program 2 - Data Cleaning

This program loads the saved raw data. Here, we grab what we want, create new measures, and save it to a new CSV.

We need to get posts per person, comments per person, and median comment length per person.

Pandas is really good at this, so we'll use it.

In [16]:
comments_df = pd.read_csv('./comments.csv')

comments_df['comment_length'] = comments_df.body.str.len()

commenter_stats = comments_df.groupby('author').agg(
    # Number of comments
    num_comments = ('id', 'count'),
    # Median comment length
    median_comment_length = ('comment_length', 'median'),
    # Median score
    median_score = ('score', 'median'),
).reset_index()

# Now, we need to grab the number of posts from the other CSV file, and merge the two together.

submissions_df = pd.read_csv('./submissions.csv')

submitter_stats = submissions_df.groupby('author').agg(
    num_posts = ('id', 'count')
).reset_index()

# Now, we can merge the two together.
merged_df = pd.merge(commenter_stats, submitter_stats, on='author', how='left')

In [None]:
merged_df.sort_values('num_posts', ascending=False)

In [18]:
# Save our cleaed data to a CSV file.

merged_df.to_csv('./cleaned_data.csv', index=False)

## Program 3 - Data Analysis

Here we use pandas to load the data and analyze it. This could include statistical tests. Here, I'm just visualizing the distribution of posts, comments, and comment length.

In [19]:
df = pd.read_csv('./cleaned_data.csv')

In [None]:
# Just make sure it looks OK.
df.sort_values('num_comments')

### Distribution of posts

In [None]:
sns.histplot(x='num_posts', data = df, binwidth=1);

### Distribution of comments

In [None]:
sns.histplot(x='num_comments', data = df, binwidth=4);

As expected, these are both super skewed, with most people only commenting or posting once, while a few commented a ton.

Let's see if it changes if we get rid of people who only commented once (maybe we have a principled reason to believe they are different than other users).

In [None]:
sns.histplot(df.loc[df.num_comments > 1, 'num_comments'], binwidth=4);

As I thought, this is a somewhat "scale-free" distribution, meaning wherever you zoom in, you see the same pattern. Try changing the `1` up above to any (small) number.

### Comment length and number of comments

In [None]:
sns.jointplot(y='num_comments', x='median_comment_length', data = df);

In [13]:
import numpy as np

In [None]:
# Both of these are so skewed, so let's log them
p = sns.jointplot(y=np.log(df.num_comments), x=np.log(df.median_comment_length), data = df, kind = 'reg')
p.set_axis_labels(xlabel= 'Median comment length (logged)', ylabel='Number of comments (logged)');

There does appear to be a correlation between the number of comments and the median comment length. This is interesting, and suggests that people who comment a lot tend to write longer comments.

For fun, let's also look at the relationship between the number of comments and the median score. Ths might be an explanation for our findings: if people who comment a lot tend to get more upvotes, then they might be more likely to comment more.

In [None]:
sns.histplot(x='median_score', data = df, binwidth=1);

In [None]:
# Created a logged median score (hard because it can be negative)

df['logged_median_score'] = np.sign(df.median_score) * np.log1p(np.abs(df.median_score))

p = sns.jointplot(y=np.log(df.num_comments), x='logged_median_score', data = df, kind = 'reg')
p.set_axis_labels(xlabel= 'Median score (logged)', ylabel='Number of comments (logged)');