# Using APIs to Get Data From the Internet


**API** means Application Programming Interface

An API is a set of instructions that describe how computers can interact with each other to request and receive information.

Some important questions we will ask that help us discover APIs is below.

|Question | In technical terms |
|:---------|:--------------------|
|Where is my data? | What is the domain? |
|How do I learn what data is available?| Where is the documentation? |
|How do I request specific data?| How do I formulate a URL for a specific purpose? |
|How do I interpret the data?| What is the structure and format of the output?|



**Let's walk through an example in the browser**

PlaceCats!

In a browser, go to http://www.placecats.com

|In technical terms | PlaceCat |
|:---------|:--------------------|
|What is the domain? | http://www.placecats.com |
|Where is the documentation?| The documentation is on the home page. |
|How do I formulate a URL for a specific purpose? | You put it in the url like http://www.placecats/width/height |
|What is the structure and format of the output?| It's an image! |

# Accessing placecats in python

We're going to use a special library called <code>requests</code>

In [58]:
from IPython.display import display, Image  # This line lets you display images. We'll use that in a bit.

# This line lets you use python to download data from the web.
import requests
import pandas as pd

In [59]:
# Get a 200 by 300 image from placecats.
r = requests.get('http://www.placecats.com/200/300')

In [None]:
# Look at the status code
r.status_code

In [None]:
# print the content
r.content

In [None]:
# Use the Image function to display the image
display(Image(r.content))

### Exercise 1

Write a function that takes in the width and height and prints an image

### Exercise 2

Can you write a loop to show several images?


In [None]:
# Write a loop that shows multiple images


# Example 2: Getting World Times

This example introduces a slightly more complicated API. It also introduces **JSON** which is a very common data format.

The API (including some documentation) is at http://worldtimeapi.org/

In [None]:
# Download list of time zones
r = requests.get("http://worldtimeapi.org/api/timezone")
print(r.content)

### Exercise 3

Use the .json() function to get the response converted to a dictionary or list

In [11]:
# Use the .json() function to get the response converted to a dictionary or list
# What did it return?


### Exercise 4

Get the time for your time zone

In [None]:
# Your code here


### Exercise 5

Get the time for your IP address

In [None]:
# Get the time for your IP address


# Example 3: Getting Wikipedia pages

Wikipedia also has an open API, and I want to use it to show one other tip for using the `requests` library; many APIs will take in a set of parameters, which you can pass as a parameter dictionary.

The documentation for the very extensive API is [here](https://www.mediawiki.org/wiki/API:Main_page). Many of the operations require you to authenticate (which we will cover next), but some things, like getting the content of a page, do not.

For example, the following code gets the recent changes to Wikipedia.

In [12]:
import requests

endpt = 'https://en.wikipedia.org/w/api.php'


def get_last_pages_changed(n):
    params = {'action': 'query',
          'format': 'json',
          'list': 'recentchanges',
          'rcnamespace': '0',
          'rclimit': n}
    r = requests.get(endpt, params=params)
    #print(r.json())
    #print(r.json()['query']['recentchanges'])
    result = []
    content = r.json()['query']['recentchanges']
    for page in content:
        result.append(page['title'])
    return result

In [None]:
get_last_pages_changed(n = 20)

## Exercise 6

Review the documentation (and Google) to see if you can figure out how to get a list of the last users who edited the most recently edited Wikipedia page.

The function below will get you partway there. It takes in an article name, and give you the last edits.

You should:
* Use the get_last_pages_changed function and extract the last page changed
* Use the get_edits function to get the last edits of that page
* Extract the user names from the edits and make a list of them

### Bonus challenge

If you are feeling really courageous, figure out how to get all of the edits/editors for a page, not just the last 500.


In [None]:
## Your code here

def get_edits(title):
    params = {'action':'query',
         'prop':'revisions',
         'titles': title,
              'format': 'json',
          'rvlimit': 500,
          'rvprop': 'user|timestamp'
         }
    r = requests.get(endpt, params=params)
    print(r.json())
    
get_edits('Purdue University')

# Example 4: Intro to Reddit API

## Setup
In order to use the Reddit API, you need to do two things:

1. Install [PRAW](https://praw.readthedocs.io/en/stable/) (the Python Reddit API Wrapper). This is a python library designed to make it easier to use the API (rather than using `requests` directly).

You can install PRAW in the terminal using `conda install -c conda-forge praw` or `pip install praw`

2. To use the Reddit API, you need to be authenticated, and so you need a Reddit account. You also need to create an app. [This page](https://wiki.communitydata.science/Intro_to_Programming_and_Data_Science_(Fall_2023)/Reddit_authentication_setup) explains how to get a developer account, create an app, and get the `client_id` and `client_secret`.

Once you have your client keys, you should create a file called `reddit_authentication.py` in the same directory as this file. It should contain the following (replace the fake strings below with the corresponding info from your Reddit app):

```
client_id = "_anb-dsxipuqf7jA9wzeMqZ"
client_secret = "4kXxiBOFdPY1HBw4843sgm6oiTYbWkFgz"
user_agent = "python:COM 674 class project:v1.0 (by /u/yourusername)"
username = "yourusername"
password = "yourpassword"
```

In general, it is a good practice to keep your keys (which should be secret) separate from your code, which you can share. In this case, we put them in a different file and then import them.

## Using PRAW

When using PRAW, we need to authenticate. For more complicated APIs, like the Reddit API, it's important for the server to know who is makin the request, so they know what information to receive. Authentication means proving you are who you say you are.

In PRAW, we create a `Reddit` object, which handles authentication and other things. It's basically creating an authenticated session, similar to when you log into a website.

Here, we'll import the PRAW library and our authentication info from the reddit_authenticaton file. Note that there are some things you can do with the API without authenticating, but the rate limit is much higher if you are authenticated. (100 queries per minute vs 10 queries per minute)

In [2]:
import praw
import reddit_authentication
from prawcore.exceptions import NotFound

# Create an instance called reddit. We'll use this to call the API.
reddit = praw.Reddit(client_id=reddit_authentication.client_id,
                     client_secret=reddit_authentication.client_secret,
                    user_agent = reddit_authentication.user_agent,
                    username = reddit_authentication.username,
                    password = reddit_authentication.password)

Version 7.7.1 of praw is outdated. Version 7.8.0 was released 1 day ago.


You can run the following to make sure you are logged in. It should show your username

In [4]:
reddit.user.me()

Redditor(name='jdfoote')

The Reddit API is powerful and complicated. We'll just do a few simple things here.

The [full documentation is here](https://praw.readthedocs.io/en/stable/) if you want to explore more.

For now, we'll show how to get the top subreddits on a topic, and how to explore the comments from a given subreddit.

First, let's find the top 10 Purdue-related subreddits.

In [61]:
# The top 10 Purdue-related subreddits, according to reddit's search
top_purdue_subs = [x for x in reddit.subreddits.search('Purdue')][:10]

In [62]:
for s in top_purdue_subs:
    print(f"Name: {s.display_name}\t\tSubscribers: {s.subscribers}")

Name: Purdue		Subscribers: 74834
Name: PurdueR4R		Subscribers: 178
Name: PurdueHousing		Subscribers: 2064
Name: CollegeBasketball		Subscribers: 2734620
Name: Boilermakers		Subscribers: 5653
Name: PurdueVent		Subscribers: 156
Name: PurdueGlobal		Subscribers: 1746
Name: CFB		Subscribers: 3918660
Name: ApplyingToCollege		Subscribers: 1116281
Name: purduefootball		Subscribers: 508


I used the `.display_name` and `.subscribers` attributes of the subreddits. To see what is part of an object (like these subreddit objects), you can use the `dir` function.

In [63]:
dir(top_purdue_subs[0])

['MESSAGE_PREFIX',
 'STR_FIELD',
 'VALID_TIME_FILTERS',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattr__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_convert_to_fancypants',
 '_create_or_update',
 '_fetch',
 '_fetch_data',
 '_fetch_info',
 '_fetched',
 '_kind',
 '_parse_xml_response',
 '_path',
 '_prepare',
 '_read_and_post_media',
 '_reddit',
 '_reset_attributes',
 '_safely_add_arguments',
 '_submission_class',
 '_submit_media',
 '_subreddit_collections_class',
 '_subreddit_list',
 '_upload_inline_media',
 '_upload_media',
 '_url_parts',
 '_validate_gallery',
 '_validate_inline_media',
 '_validate_time_filter',
 'accept_followers',
 'accounts_active',
 'accounts_active_is_fuzzed',
 'active_user_

## Our small project

Let's say that our goal is to identify the redditors who have been most active recently on the Purdue subreddit, and to see what other subreddits they are active on.

In [64]:
commenters = {}

for comment in reddit.subreddit('Purdue').comments(limit=3000):
    if comment.author in commenters:
        commenters[comment.author] += 1
    else:
        commenters[comment.author] = 1
        

Note that when we look at our dictionary, we actually saved all of the Redditor objects for the authors. This makes it a little bit simpler to get information about those users later, but we could have also saved their usernames instead.

In [65]:
commenters

{Redditor(name='ploomyoctopus'): 4,
 Redditor(name='iamayoutuberiswear'): 2,
 Redditor(name='AlmondManttv'): 1,
 Redditor(name='sam246821'): 2,
 Redditor(name='boilerbitch'): 13,
 Redditor(name='HistorianSure8402'): 1,
 Redditor(name='Layne1665'): 4,
 Redditor(name='AngyDino404'): 1,
 Redditor(name='boiler_classes'): 11,
 Redditor(name='AutoModerator'): 19,
 Redditor(name='Westporter'): 6,
 Redditor(name='Arshonb'): 1,
 Redditor(name='softskep'): 1,
 Redditor(name='PunMatster'): 6,
 Redditor(name='EnByChic'): 2,
 Redditor(name='Loveandgloom'): 2,
 Redditor(name='krorkle'): 1,
 Redditor(name='HorizonsReptile'): 7,
 Redditor(name='SomeAppleGuy'): 1,
 Redditor(name='Johnnycarroll'): 2,
 Redditor(name='homelaunder'): 1,
 Redditor(name='hahnarama'): 1,
 Redditor(name='Rambo_8641'): 1,
 Redditor(name='Ambitious_Dot_3141'): 4,
 Redditor(name='CentralSega'): 5,
 Redditor(name='CaptPotter47'): 4,
 Redditor(name='General-Pryde-2019'): 16,
 Redditor(name='ThatOnePilotDude'): 4,
 Redditor(name='Fa

## Exercises 7 and 8

7. Improve my code above so that it only gets comments if they have a positive score.

8. See if you can figure out how to get the "comment karma" for each of the users in our dictionary, and print out the top 10 users by comment karma.

### Getting the top users by number of comments

Ok, so let's look at the top 100 users by the number of comments posted. We can do this a few ways. One way is to use the `sorted` function on a dictionary. This will sort the keys of the dictionary by the value of the dictionary. We can then use the `reverse` parameter to sort in descending order.

In [66]:
# Sort commenters dictionary by value
sorted_commenters = sorted(commenters.items(), key=lambda x: x[1], reverse=True)

In [67]:
sorted_commenters[:10]

[(Redditor(name='AutoModerator'), 19),
 (Redditor(name='TyrannoJoris_Rex'), 18),
 (Redditor(name='niksjman'), 17),
 (Redditor(name='General-Pryde-2019'), 16),
 (Redditor(name='Mental-Cupcake9750'), 15),
 (Redditor(name='boilerbitch'), 13),
 (Redditor(name='ryanstartedthefire_'), 13),
 (Redditor(name='Its-Mike-Jones'), 13),
 (Redditor(name='boiler_classes'), 11),
 (Redditor(name='Bread1992'), 8)]

In [68]:
top_commenters = sorted_commenters[:100]

I happen to know that AutoModerator is a bot, so let's remove that from our list.

In [69]:
top_commenters = top_commenters[1:]

### Getting the top subreddits that our users are active on

In [70]:
x = top_commenters[1][0]

In [71]:
subreddits = {}

for commenter in top_commenters:
    user = commenter[0]
    try:
        karma = user.comment_karma
    except NotFound:
        continue
    # Get the user's 100 most recent comments
    for comment in user.comments.new(limit=100):
        subreddit = comment.subreddit.display_name
        if subreddit == 'Purdue':
            continue
        if subreddit in subreddits:
            subreddits[subreddit] += 1
        else:
            subreddits[subreddit] = 1

### A note on rate limits

You may have noticed that that took a long time. That's because we are making a lot of requests to the Reddit API. The Reddit API has a rate limit of 100 requests per minute. If you go over the limit, then Reddit sends you a message asking you to wait. PRAW actually handles this without our intervention, but it does mean that it's hard to tell how long things will take. One approach is to at least print out the commenter name, for example, so you can tell how quickly the queries are running. If they are going too slowly, then you may want to change the limits. For example, we got 100 comments per user in this code, but that may take multiple queries, so reducing that number could speed things up.

When using other libraries, including `requests`, you will often have to write code to handle rate limits yourself.

### Sorting the subreddits

Let's just do the same thing we did before, but this time sort the subreddits by the number of comments.

In [56]:
sorted_subreddits = sorted(subreddits.items(), key=lambda x: x[1], reverse=True)

In [None]:
sorted_subreddits[:20]

## Exercise 9 (Challenge Exercise)

Instead of storing the number of total comments per subreddit, store the number of our top_commenters who contribute to each subreddit. In other words, if User A comments on Subreddit A twice, my code counts that twice. Instead, I want to count that only once.

Hint: This is tricky. One approach would be to make a list of subreddits that each commenter has commented in, and then change that into a set. 

## Additional Exercises

10. Get the last comments across all subreddits. Figure out which subreddits were most actively commented in.
11. Get the last comments across all subreddits. Figure out which users were most active.
12. Find the top 5 posts on the Purdue subreddit over the last year ([HINT](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html#praw.models.Subreddit.top)). Get all of the comments for each of those posts.