{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Pandas and Visualization\n",
"\n",
"pandas is a library that is indended to make data analysis easier.\n",
"\n",
"The central concept in pandas is a \"dataframe\". These dataframes are similar (in many respects) to Excel spreadsheets. They are typically 2-dimensional representations of data.\n",
"\n",
"The first thing that we need to do is to import pandas. By convention, we rename it to \"pd\".\n",
"\n",
"We're also importing another visualization library called seaborn. Again, by convention it's imported as \"sns\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting the data\n",
"\n",
"For now, we are assuming that you have data from somewhere else, in a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). For today's exercises, we're going to use data that I got from [BigQuery](https://console.cloud.google.com/bigquery), which is a tool produced by Google for querying huge datasets.\n",
"\n",
"The folks at [Pushshift](https://pushshift.io/) gather and publish a bunch of the data from reddit. I queried their data to get a random set of 100K comments from 2018. You can download that data [here](https://github.com/jdfoote/Intro-to-Programming-and-Data-Science/raw/master/resources/data/100k_random_reddit_2018.csv).\n",
"\n",
"First, we load it into a dataframe. Change the path to where you are storing your data."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"reddit_df = pd.read_csv('/home/jeremy/Teaching/intro_to_programming_and_ds/resources/data/100k_random_reddit_2018.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"head\" method lets us look at the first few rows of the data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" created_utc | \n",
" subreddit | \n",
" author | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1514764841 | \n",
" Astronomy | \n",
" illuminatiisnowhere | \n",
" 50 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 1514764972 | \n",
" politics | \n",
" ExtracurricularSpork | \n",
" 31 | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 1514765017 | \n",
" frugalmalefashion | \n",
" TheBarnhouseEffect | \n",
" 126 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 1514765217 | \n",
" KansasCityChiefs | \n",
" thenewguyreddit | \n",
" 48 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" 1514765360 | \n",
" Aeon | \n",
" ribnag | \n",
" 385 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" created_utc subreddit author comment_length score\n",
"0 1514764841 Astronomy illuminatiisnowhere 50 3\n",
"1 1514764972 politics ExtracurricularSpork 31 10\n",
"2 1514765017 frugalmalefashion TheBarnhouseEffect 126 5\n",
"3 1514765217 KansasCityChiefs thenewguyreddit 48 5\n",
"4 1514765360 Aeon ribnag 385 1"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reddit_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, this is organized a lot like a spreadsheet or a CSV file. Each row is an observation, and each column is a measure about that observation. In this case, a row represents a comment on reddit.\n",
"\n",
"We are going to be talking a lot more about how to manipulate and work with this data. This document is intended to introduce a few key ideas.\n",
"\n",
"## Selecting data\n",
"\n",
"The first is selecting data - getting a subset of the data.\n",
"\n",
"This is a really confusing thing about pandas and takes some getting used to. There is a detailed explanation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) but I'm going to try to give a gentler introduction here.\n",
"\n",
"The main thing that trips people up is remembering that dataframes have an \"index\". By default, this is a number that refers to each row in the dataframe. If you look at our dataframe, you will see that on the far left is an unnamed column of numbers starting at 0. This is the index.\n",
"\n",
"It is assigned in the order that the rows were in the CSV file. However, it can change if we reorder things.\n",
"\n",
"For example, this code will sort the dataframe by comment length. Notice how the index numbers are now out of order."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" created_utc | \n",
" subreddit | \n",
" author | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 21207 | \n",
" 1522288057 | \n",
" LoveNikki | \n",
" queerestqueen | \n",
" 10045 | \n",
" 6 | \n",
"
\n",
" \n",
" 52329 | \n",
" 1532344433 | \n",
" Music | \n",
" DJ_Spam | \n",
" 9726 | \n",
" 1 | \n",
"
\n",
" \n",
" 78315 | \n",
" 1540314439 | \n",
" NewsBrasil | \n",
" newsbrasilbot | \n",
" 9720 | \n",
" 1 | \n",
"
\n",
" \n",
" 61600 | \n",
" 1535240874 | \n",
" NFLstream4kHD | \n",
" Illustrious_Play | \n",
" 9577 | \n",
" 1 | \n",
"
\n",
" \n",
" 71908 | \n",
" 1538405283 | \n",
" Music | \n",
" DJ_Spam | \n",
" 9418 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" created_utc subreddit author comment_length score\n",
"21207 1522288057 LoveNikki queerestqueen 10045 6\n",
"52329 1532344433 Music DJ_Spam 9726 1\n",
"78315 1540314439 NewsBrasil newsbrasilbot 9720 1\n",
"61600 1535240874 NFLstream4kHD Illustrious_Play 9577 1\n",
"71908 1538405283 Music DJ_Spam 9418 2"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_by_length = reddit_df.sort_values('comment_length', ascending=False)\n",
"df_by_length.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use this \"out of order\" version of the data fram to show the difference between the two main ways of selecting data: `iloc` and `loc`.\n",
"\n",
"`iloc` selects data like a list index, based on the location in the current dataframe. The syntax is just like selecting elements from two lists: the row list and the column list. If we use slice notation it looks like this:\n",
"\n",
"`dataframe_name.iloc[starting_row:ending_row, staring_column:ending_column]`\n",
"\n",
"You can also select a list of row numbers or column numbers. Below, I pass a list of columns to select."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" subreddit | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 21207 | \n",
" LoveNikki | \n",
" 10045 | \n",
" 6 | \n",
"
\n",
" \n",
" 52329 | \n",
" Music | \n",
" 9726 | \n",
" 1 | \n",
"
\n",
" \n",
" 78315 | \n",
" NewsBrasil | \n",
" 9720 | \n",
" 1 | \n",
"
\n",
" \n",
" 61600 | \n",
" NFLstream4kHD | \n",
" 9577 | \n",
" 1 | \n",
"
\n",
" \n",
" 71908 | \n",
" Music | \n",
" 9418 | \n",
" 2 | \n",
"
\n",
" \n",
" 33033 | \n",
" FortNiteBR | \n",
" 9345 | \n",
" 8 | \n",
"
\n",
" \n",
" 91701 | \n",
" conspiracy | \n",
" 9187 | \n",
" 1 | \n",
"
\n",
" \n",
" 13795 | \n",
" The_Mueller | \n",
" 9134 | \n",
" 18 | \n",
"
\n",
" \n",
" 98385 | \n",
" kol | \n",
" 9100 | \n",
" 5 | \n",
"
\n",
" \n",
" 44483 | \n",
" copypasta | \n",
" 8969 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" subreddit comment_length score\n",
"21207 LoveNikki 10045 6\n",
"52329 Music 9726 1\n",
"78315 NewsBrasil 9720 1\n",
"61600 NFLstream4kHD 9577 1\n",
"71908 Music 9418 2\n",
"33033 FortNiteBR 9345 8\n",
"91701 conspiracy 9187 1\n",
"13795 The_Mueller 9134 18\n",
"98385 kol 9100 5\n",
"44483 copypasta 8969 1"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This code gets the first 10 rows and columns 1, 3, and 4\n",
"df_by_length.iloc[0:10, [1,3,4]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `loc` syntax, on the other hand, selects based on labels. It selects rows based on index labels and columns based on column names. When something is sorted, this can have unexpected consequences. \n",
"\n",
"For example, we might think that the code below would just select the first three rows, or maybe the three rows with indexes 0,1, and 2. Instead, it starts with the item with index `0`, and selects all of the rows from that row until the row with index `3`, as currently sorted! Notice at the bottom of the output, it says that the length is 1,494!"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 50\n",
"4910 50\n",
"46574 50\n",
"46678 50\n",
"56224 50\n",
" ..\n",
"61453 48\n",
"62063 48\n",
"25059 48\n",
"955 48\n",
"3 48\n",
"Name: comment_length, Length: 1494, dtype: int64"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_by_length.loc[0:3, 'comment_length']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Boolean indexing\n",
"\n",
"Typically, if we want to get a certain number of rows, we use `.iloc`. However, if we want to filter data based on its value, then we typically use `.loc`\n",
"\n",
"For example, if we wanted to get the first 3 rows by their index, we'd need to do something like this, which uses \"boolean indexing\" to filter to only the rows where the index is less than or equal to 3."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" 126 | \n",
" 5 | \n",
"
\n",
" \n",
" 0 | \n",
" 50 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 31 | \n",
" 10 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" comment_length score\n",
"2 126 5\n",
"0 50 3\n",
"1 31 10"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_by_length.loc[df_by_length.index < 3,\"comment_length\":]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, if we wanted to just get the subreddit and the score whenever scores are above the median score, we could do this:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" subreddit | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 21207 | \n",
" LoveNikki | \n",
" 6 | \n",
"
\n",
" \n",
" 33033 | \n",
" FortNiteBR | \n",
" 8 | \n",
"
\n",
" \n",
" 13795 | \n",
" The_Mueller | \n",
" 18 | \n",
"
\n",
" \n",
" 98385 | \n",
" kol | \n",
" 5 | \n",
"
\n",
" \n",
" 95387 | \n",
" HPHogwartsMystery | \n",
" 5 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 87238 | \n",
" memes | \n",
" 3 | \n",
"
\n",
" \n",
" 65012 | \n",
" teenagers | \n",
" 6 | \n",
"
\n",
" \n",
" 54485 | \n",
" AskOuija | \n",
" 3 | \n",
"
\n",
" \n",
" 54527 | \n",
" AskOuija | \n",
" 3 | \n",
"
\n",
" \n",
" 84270 | \n",
" gaming | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
36258 rows × 2 columns
\n",
"
"
],
"text/plain": [
" subreddit score\n",
"21207 LoveNikki 6\n",
"33033 FortNiteBR 8\n",
"13795 The_Mueller 18\n",
"98385 kol 5\n",
"95387 HPHogwartsMystery 5\n",
"... ... ...\n",
"87238 memes 3\n",
"65012 teenagers 6\n",
"54485 AskOuija 3\n",
"54527 AskOuija 3\n",
"84270 gaming 5\n",
"\n",
"[36258 rows x 2 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_by_length.loc[df_by_length.score > df_by_length.score.median(),['subreddit', 'score']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, it is possible to filter dataframes without using `iloc` or `loc`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One that's useful is dot notation - as long as your column names don't have spaces, you can do something like:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"21207 6\n",
"52329 1\n",
"78315 1\n",
"61600 1\n",
"71908 2\n",
" ..\n",
"79848 2\n",
"54527 3\n",
"84270 5\n",
"99714 1\n",
"79494 1\n",
"Name: score, Length: 99910, dtype: int64"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Just gives one column\n",
"df_by_length.score\n",
"\n",
"# This is equivalent, and can handle spaces\n",
"df_by_length['score']"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" created_utc | \n",
" subreddit | \n",
" author | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 21207 | \n",
" 1522288057 | \n",
" LoveNikki | \n",
" queerestqueen | \n",
" 10045 | \n",
" 6 | \n",
"
\n",
" \n",
" 52329 | \n",
" 1532344433 | \n",
" Music | \n",
" DJ_Spam | \n",
" 9726 | \n",
" 1 | \n",
"
\n",
" \n",
" 78315 | \n",
" 1540314439 | \n",
" NewsBrasil | \n",
" newsbrasilbot | \n",
" 9720 | \n",
" 1 | \n",
"
\n",
" \n",
" 61600 | \n",
" 1535240874 | \n",
" NFLstream4kHD | \n",
" Illustrious_Play | \n",
" 9577 | \n",
" 1 | \n",
"
\n",
" \n",
" 71908 | \n",
" 1538405283 | \n",
" Music | \n",
" DJ_Spam | \n",
" 9418 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" created_utc subreddit author comment_length score\n",
"21207 1522288057 LoveNikki queerestqueen 10045 6\n",
"52329 1532344433 Music DJ_Spam 9726 1\n",
"78315 1540314439 NewsBrasil newsbrasilbot 9720 1\n",
"61600 1535240874 NFLstream4kHD Illustrious_Play 9577 1\n",
"71908 1538405283 Music DJ_Spam 9418 2"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# If you use slicing notation, this will give you rows (like iloc)\n",
"\n",
"df_by_length[:5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 1\n",
"\n",
"Sort the data frame by score and select the rows with the 10th to 20th highest scores.\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"## Your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 2\n",
"\n",
"Now, how would you select just the `subreddit` and `comment_length` columns for the rows you selected above?"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"## Your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grouping and aggregating\n",
"\n",
"Pandas is also great for grouping and aggregating data.\n",
"\n",
"We're going to talk a lot more about analyzing data but here's a quick introduciton to some great built in functions like max, sum, and mean"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"comment_length 162.317035\n",
"score 8.736553\n",
"dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the mean score and comment length\n",
"\n",
"df_by_length.loc[:,\"comment_length\":\"score\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We also might want to create subreddit-level or person-level measures. To do this, we'll want to \"group\" the data.\n",
"\n",
"The groupby function will create a \"grouped\" dataframe, where aggregations apply to the group rather than the whole dataframe. I think of this as temporarily sorting all of the rows into their own mini-spreadsheets based on whatever is in the `groupby` function.\n",
"\n",
"For example, this will tell use the average comment score by subreddit:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"subreddit\n",
"happyendings -56.0\n",
"u_washingtonpost -24.5\n",
"deepthroat -20.0\n",
"wwiipics -13.5\n",
"Bundesliga -13.0\n",
" ... \n",
"shittydarksouls 119.0\n",
"TalesFromTheCustomer 167.5\n",
"LearnUselessTalents 217.0\n",
"HobbyDrama 236.0\n",
"Dogberg 455.0\n",
"Name: score, Length: 10198, dtype: float64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Note that I'm using the original dataframe but either one would work\n",
"\n",
"# Groups by subreddit, then gets the score column\n",
"# Then gets the mean of that column, then sorts it by the value of the column\n",
"reddit_df.groupby('subreddit').score.median().sort_values() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As another example, this will get the number of comments per person:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"author\n",
"---------_---------_ 1\n",
"dihahs1 1\n",
"digswonderlust 1\n",
"digiternity 1\n",
"digitalruse 1\n",
" ... \n",
"transcribersofreddit 41\n",
"transcribot 52\n",
"MemeInvestor_bot 76\n",
"AutoModerator 953\n",
"[deleted] 7913\n",
"Length: 84941, dtype: int64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reddit_df.groupby('author').size().sort_values()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 3\n",
"\n",
"Find the average (mean) comment length for each subreddit and sort by comment length"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can talk more about this in class, but this creates a user-based dataframe with data for the number of subreddits, number of comments, and average score for each user."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" created_utc | \n",
" subreddit | \n",
" author | \n",
" comment_length | \n",
" score | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1514764841 | \n",
" Astronomy | \n",
" illuminatiisnowhere | \n",
" 50 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 1514764972 | \n",
" politics | \n",
" ExtracurricularSpork | \n",
" 31 | \n",
" 10 | \n",
"
\n",
" \n",
" 2 | \n",
" 1514765017 | \n",
" frugalmalefashion | \n",
" TheBarnhouseEffect | \n",
" 126 | \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
" 1514765217 | \n",
" KansasCityChiefs | \n",
" thenewguyreddit | \n",
" 48 | \n",
" 5 | \n",
"
\n",
" \n",
" 4 | \n",
" 1514765360 | \n",
" Aeon | \n",
" ribnag | \n",
" 385 | \n",
" 1 | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 99905 | \n",
" 1546300240 | \n",
" Shamanism | \n",
" [deleted] | \n",
" 215 | \n",
" 3 | \n",
"
\n",
" \n",
" 99906 | \n",
" 1546300264 | \n",
" funny | \n",
" PCDuranet | \n",
" 40 | \n",
" 60 | \n",
"
\n",
" \n",
" 99907 | \n",
" 1546300377 | \n",
" feludasports | \n",
" ManyYoung | \n",
" 116 | \n",
" 1 | \n",
"
\n",
" \n",
" 99908 | \n",
" 1546300631 | \n",
" mydadwroteaporno | \n",
" Migrane | \n",
" 51 | \n",
" 24 | \n",
"
\n",
" \n",
" 99909 | \n",
" 1546300662 | \n",
" relationship_advice | \n",
" Xaveth- | \n",
" 86 | \n",
" 1 | \n",
"
\n",
" \n",
"
\n",
"
99910 rows × 5 columns
\n",
"
"
],
"text/plain": [
" created_utc subreddit author comment_length \\\n",
"0 1514764841 Astronomy illuminatiisnowhere 50 \n",
"1 1514764972 politics ExtracurricularSpork 31 \n",
"2 1514765017 frugalmalefashion TheBarnhouseEffect 126 \n",
"3 1514765217 KansasCityChiefs thenewguyreddit 48 \n",
"4 1514765360 Aeon ribnag 385 \n",
"... ... ... ... ... \n",
"99905 1546300240 Shamanism [deleted] 215 \n",
"99906 1546300264 funny PCDuranet 40 \n",
"99907 1546300377 feludasports ManyYoung 116 \n",
"99908 1546300631 mydadwroteaporno Migrane 51 \n",
"99909 1546300662 relationship_advice Xaveth- 86 \n",
"\n",
" score \n",
"0 3 \n",
"1 10 \n",
"2 5 \n",
"3 5 \n",
"4 1 \n",
"... ... \n",
"99905 3 \n",
"99906 60 \n",
"99907 1 \n",
"99908 24 \n",
"99909 1 \n",
"\n",
"[99910 rows x 5 columns]"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reddit_df"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"grouped_df = reddit_df.groupby('author')\n",
"\n",
"subreddit_count = grouped_df.subreddit.nunique()\n",
"comments_count = grouped_df.size()\n",
"score_mean = grouped_df.score.mean()\n",
"\n",
"\n",
"person_df = pd.DataFrame({'subreddit_count': subreddit_count,\n",
" 'comments_count': comments_count,\n",
" 'mean_score': score_mean})"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" subreddit_count | \n",
" comments_count | \n",
" mean_score | \n",
"
\n",
" \n",
" author | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" ---------_---------_ | \n",
" 1 | \n",
" 1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" --------__--_--- | \n",
" 1 | \n",
" 1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" -----_------_--- | \n",
" 1 | \n",
" 1 | \n",
" 1.0 | \n",
"
\n",
" \n",
" -----o____ | \n",
" 1 | \n",
" 1 | \n",
" 7.0 | \n",
"
\n",
" \n",
" ----Prometheus---- | \n",
" 1 | \n",
" 1 | \n",
" 2.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" subreddit_count comments_count mean_score\n",
"author \n",
"---------_---------_ 1 1 1.0\n",
"--------__--_--- 1 1 1.0\n",
"-----_------_--- 1 1 1.0\n",
"-----o____ 1 1 7.0\n",
"----Prometheus---- 1 1 2.0"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"person_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Visualizations\n",
"\n",
"Python has a powerful (and complicated) plotting library called matplotlib. If you want to do more complicated plots, I highly recommend learning how to use it. For now, I'm going to introduce the visualizations that are part of pandas, and later introduce some which are part of Seaborn. Both of these are based on matplotlib but make it easier and simpler to use.\n",
"\n",
"I'm introducing a few basic visualizations here. You should [read and reference this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) to learn about a bunch more options.\n",
"\n",
"### Scatterplots\n",
"\n",
"I think it would be interesting to know if there's a relationship between the length of comments and the score. This plot shows the scatterplot of that relationship."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reddit_df.plot.scatter(x = 'comment_length', y = 'score');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Both of these measures are really skewed, so let's try plotting them on log scales instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Note that this is slightly different (improved) from the code in the video\n",
"reddit_df['comment_logged'] = np.log1p(reddit_df.comment_length)\n",
"# Scores can be negative, so we add the minimum (plus one) to make sure they are positive\n",
"# And then log transform\n",
"reddit_df['score_logged'] = np.log1p(reddit_df.score + abs(reddit_df.score.min()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reddit_df.plot.scatter(x = 'comment_logged', y = 'score_logged');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Not much of a relationship there.\n",
"\n",
"### Exercise 4\n",
"\n",
"Plot the number of comments and average score per person. Are frequent posters more likely to share comments that are well-received?\n",
"\n",
"*Hint*: You will want to use a dataframe that groups the data by person"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## Your code here\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Time Series\n",
"\n",
"The last thing I want to show is how to do time series (because they are confusing!)\n",
"\n",
"They key thing to remember is that the plots assume that time is the index of the data frame. By default, the index is just the row number of the row when it was first imported.\n",
"\n",
"So, the first thing we need to do is to convert the datetime column to the index.\n",
"\n",
"`reddit_df.index` is where the index is stored, so the following code takes the `created_utc` column--which is the [Unix time](https://en.wikipedia.org/wiki/Unix_time) that each comment was posted--and converts it to a \"DateTime\", which is a way of representing a date and time in Python. It then stores the datetime in the index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reddit_df.index = pd.to_datetime(reddit_df.created_utc, unit='s', utc=True)\n",
"reddit_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In order to make a time series, we have to combine the data. The `resample` function does this. Below, we combine it by hour to see daily trends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hourly_comments = reddit_df.resample('H').size()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hourly_comments"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hourly_comments.plot();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is too messy, so let's zoom in on just a few weeks, using boolean indexing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"hourly_comments.loc[hourly_comments.index < '2018-01-18'].plot()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 5\n",
"\n",
"Plot the number of comments per month over the year."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Your code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, if we wanted to summarize by hour, we could group the posting time by hour. Below, we see that about 8:00 UTC (which is about 3:00 AM Eastern) corresponds to the lowest activity time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reddit_df.groupby(reddit_df.index.hour).size().plot();"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 6\n",
"\n",
"See if you can figure out how to change the above plot so that instead of showing one overall line for comments per hour, it shows a line for each day of the week (e.g., a line for comments per hour on Mondays, another for Tuesdays, etc.).\n",
"\n",
"This code can help you to get started: it sums the number of comments by hour and day."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"reddit_df.groupby([reddit_df.index.hour, reddit_df.index.weekday]).size()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}