{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pandas and Visualization\n",
    "\n",
    "pandas is a library that is indended to make data analysis easier.\n",
    "\n",
    "The central concept in pandas is a \"dataframe\". These dataframes are similar (in many respects) to Excel spreadsheets. They are typically 2-dimensional representations of data.\n",
    "\n",
    "The first thing that we need to do is to import pandas. By convention, we rename it to \"pd\".\n",
    "\n",
    "We're also importing another visualization library called seaborn. Again, by convention it's imported as \"sns\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the data\n",
    "\n",
    "For now, we are assuming that you have data from somewhere else, in a [CSV file](https://en.wikipedia.org/wiki/Comma-separated_values). For today's exercises, we're going to use data that I got from [BigQuery](https://console.cloud.google.com/bigquery), which is a tool produced by Google for querying huge datasets.\n",
    "\n",
    "The folks at [Pushshift](https://pushshift.io/) gather and publish a bunch of the data from reddit. I queried their data to get a random set of 100K comments from 2018. You can download that data [here](https://github.com/jdfoote/Intro-to-Programming-and-Data-Science/raw/master/resources/data/100k_random_reddit_2018.csv).\n",
    "\n",
    "First, we load it into a dataframe. Change the path to where you are storing your data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df = pd.read_csv('/home/jeremy/Teaching/intro_to_programming_and_ds/resources/data/100k_random_reddit_2018.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"head\" method lets us look at the first few rows of the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_utc</th>\n",
       "      <th>subreddit</th>\n",
       "      <th>author</th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1514764841</td>\n",
       "      <td>Astronomy</td>\n",
       "      <td>illuminatiisnowhere</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1514764972</td>\n",
       "      <td>politics</td>\n",
       "      <td>ExtracurricularSpork</td>\n",
       "      <td>31</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1514765017</td>\n",
       "      <td>frugalmalefashion</td>\n",
       "      <td>TheBarnhouseEffect</td>\n",
       "      <td>126</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1514765217</td>\n",
       "      <td>KansasCityChiefs</td>\n",
       "      <td>thenewguyreddit</td>\n",
       "      <td>48</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1514765360</td>\n",
       "      <td>Aeon</td>\n",
       "      <td>ribnag</td>\n",
       "      <td>385</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   created_utc          subreddit                author  comment_length  score\n",
       "0   1514764841          Astronomy   illuminatiisnowhere              50      3\n",
       "1   1514764972           politics  ExtracurricularSpork              31     10\n",
       "2   1514765017  frugalmalefashion    TheBarnhouseEffect             126      5\n",
       "3   1514765217   KansasCityChiefs       thenewguyreddit              48      5\n",
       "4   1514765360               Aeon                ribnag             385      1"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reddit_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, this is organized a lot like a spreadsheet or a CSV file. Each row is an observation, and each column is a measure about that observation. In this case, a row represents a comment on reddit.\n",
    "\n",
    "We are going to be talking a lot more about how to manipulate and work with this data. This document is intended to introduce a few key ideas.\n",
    "\n",
    "## Selecting data\n",
    "\n",
    "The first is selecting data - getting a subset of the data.\n",
    "\n",
    "This is a really confusing thing about pandas and takes some getting used to. There is a detailed explanation [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) but I'm going to try to give a gentler introduction here.\n",
    "\n",
    "The main thing that trips people up is remembering that dataframes have an \"index\". By default, this is a number that refers to each row in the dataframe. If you look at our dataframe, you will see that on the far left is an unnamed column of numbers starting at 0. This is the index.\n",
    "\n",
    "It is assigned in the order that the rows were in the CSV file. However, it can change if we reorder things.\n",
    "\n",
    "For example, this code will sort the dataframe by comment length. Notice how the index numbers are now out of order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_utc</th>\n",
       "      <th>subreddit</th>\n",
       "      <th>author</th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21207</th>\n",
       "      <td>1522288057</td>\n",
       "      <td>LoveNikki</td>\n",
       "      <td>queerestqueen</td>\n",
       "      <td>10045</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52329</th>\n",
       "      <td>1532344433</td>\n",
       "      <td>Music</td>\n",
       "      <td>DJ_Spam</td>\n",
       "      <td>9726</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78315</th>\n",
       "      <td>1540314439</td>\n",
       "      <td>NewsBrasil</td>\n",
       "      <td>newsbrasilbot</td>\n",
       "      <td>9720</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61600</th>\n",
       "      <td>1535240874</td>\n",
       "      <td>NFLstream4kHD</td>\n",
       "      <td>Illustrious_Play</td>\n",
       "      <td>9577</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71908</th>\n",
       "      <td>1538405283</td>\n",
       "      <td>Music</td>\n",
       "      <td>DJ_Spam</td>\n",
       "      <td>9418</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       created_utc      subreddit            author  comment_length  score\n",
       "21207   1522288057      LoveNikki     queerestqueen           10045      6\n",
       "52329   1532344433          Music           DJ_Spam            9726      1\n",
       "78315   1540314439     NewsBrasil     newsbrasilbot            9720      1\n",
       "61600   1535240874  NFLstream4kHD  Illustrious_Play            9577      1\n",
       "71908   1538405283          Music           DJ_Spam            9418      2"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_by_length = reddit_df.sort_values('comment_length', ascending=False)\n",
    "df_by_length.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can use this \"out of order\" version of the data fram to show the difference between the two main ways of selecting data: `iloc` and `loc`.\n",
    "\n",
    "`iloc` selects data like a list index, based on the location in the current dataframe. The syntax is just like selecting elements from two lists: the row list and the column list. If we use slice notation it looks like this:\n",
    "\n",
    "`dataframe_name.iloc[starting_row:ending_row, staring_column:ending_column]`\n",
    "\n",
    "You can also select a list of row numbers or column numbers. Below, I pass a list of columns to select."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subreddit</th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21207</th>\n",
       "      <td>LoveNikki</td>\n",
       "      <td>10045</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52329</th>\n",
       "      <td>Music</td>\n",
       "      <td>9726</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78315</th>\n",
       "      <td>NewsBrasil</td>\n",
       "      <td>9720</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61600</th>\n",
       "      <td>NFLstream4kHD</td>\n",
       "      <td>9577</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71908</th>\n",
       "      <td>Music</td>\n",
       "      <td>9418</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33033</th>\n",
       "      <td>FortNiteBR</td>\n",
       "      <td>9345</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>91701</th>\n",
       "      <td>conspiracy</td>\n",
       "      <td>9187</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13795</th>\n",
       "      <td>The_Mueller</td>\n",
       "      <td>9134</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98385</th>\n",
       "      <td>kol</td>\n",
       "      <td>9100</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44483</th>\n",
       "      <td>copypasta</td>\n",
       "      <td>8969</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           subreddit  comment_length  score\n",
       "21207      LoveNikki           10045      6\n",
       "52329          Music            9726      1\n",
       "78315     NewsBrasil            9720      1\n",
       "61600  NFLstream4kHD            9577      1\n",
       "71908          Music            9418      2\n",
       "33033     FortNiteBR            9345      8\n",
       "91701     conspiracy            9187      1\n",
       "13795    The_Mueller            9134     18\n",
       "98385            kol            9100      5\n",
       "44483      copypasta            8969      1"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This code gets the first 10 rows and columns 1, 3, and 4\n",
    "df_by_length.iloc[0:10, [1,3,4]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `loc` syntax, on the other hand, selects based on labels. It selects rows based on index labels and columns based on column names. When something is sorted, this can have unexpected consequences. \n",
    "\n",
    "For example, we might think that the code below would just select the first three rows, or maybe the three rows with indexes 0,1, and 2. Instead, it starts with the item with index `0`, and selects all of the rows from that row until the row with index `3`, as currently sorted! Notice at the bottom of the output, it says that the length is 1,494!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        50\n",
       "4910     50\n",
       "46574    50\n",
       "46678    50\n",
       "56224    50\n",
       "         ..\n",
       "61453    48\n",
       "62063    48\n",
       "25059    48\n",
       "955      48\n",
       "3        48\n",
       "Name: comment_length, Length: 1494, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_by_length.loc[0:3, 'comment_length']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean indexing\n",
    "\n",
    "Typically, if we want to get a certain number of rows, we use `.iloc`. However, if we want to filter data based on its value, then we typically use `.loc`\n",
    "\n",
    "For example, if we wanted to get the first 3 rows by their index, we'd need to do something like this, which uses \"boolean indexing\" to filter to only the rows where the index is less than or equal to 3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>126</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>31</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   comment_length  score\n",
       "2             126      5\n",
       "0              50      3\n",
       "1              31     10"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_by_length.loc[df_by_length.index < 3,\"comment_length\":]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, if we wanted to just get the subreddit and the score whenever scores are above the median score, we could do this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subreddit</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21207</th>\n",
       "      <td>LoveNikki</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33033</th>\n",
       "      <td>FortNiteBR</td>\n",
       "      <td>8</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13795</th>\n",
       "      <td>The_Mueller</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98385</th>\n",
       "      <td>kol</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>95387</th>\n",
       "      <td>HPHogwartsMystery</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>87238</th>\n",
       "      <td>memes</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>65012</th>\n",
       "      <td>teenagers</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54485</th>\n",
       "      <td>AskOuija</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>54527</th>\n",
       "      <td>AskOuija</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84270</th>\n",
       "      <td>gaming</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>36258 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "               subreddit  score\n",
       "21207          LoveNikki      6\n",
       "33033         FortNiteBR      8\n",
       "13795        The_Mueller     18\n",
       "98385                kol      5\n",
       "95387  HPHogwartsMystery      5\n",
       "...                  ...    ...\n",
       "87238              memes      3\n",
       "65012          teenagers      6\n",
       "54485           AskOuija      3\n",
       "54527           AskOuija      3\n",
       "84270             gaming      5\n",
       "\n",
       "[36258 rows x 2 columns]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_by_length.loc[df_by_length.score > df_by_length.score.median(),['subreddit', 'score']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, it is possible to filter dataframes without using `iloc` or `loc`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One that's useful is dot notation - as long as your column names don't have spaces, you can do something like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "21207    6\n",
       "52329    1\n",
       "78315    1\n",
       "61600    1\n",
       "71908    2\n",
       "        ..\n",
       "79848    2\n",
       "54527    3\n",
       "84270    5\n",
       "99714    1\n",
       "79494    1\n",
       "Name: score, Length: 99910, dtype: int64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Just gives one column\n",
    "df_by_length.score\n",
    "\n",
    "# This is equivalent, and can handle spaces\n",
    "df_by_length['score']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_utc</th>\n",
       "      <th>subreddit</th>\n",
       "      <th>author</th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>21207</th>\n",
       "      <td>1522288057</td>\n",
       "      <td>LoveNikki</td>\n",
       "      <td>queerestqueen</td>\n",
       "      <td>10045</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52329</th>\n",
       "      <td>1532344433</td>\n",
       "      <td>Music</td>\n",
       "      <td>DJ_Spam</td>\n",
       "      <td>9726</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78315</th>\n",
       "      <td>1540314439</td>\n",
       "      <td>NewsBrasil</td>\n",
       "      <td>newsbrasilbot</td>\n",
       "      <td>9720</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>61600</th>\n",
       "      <td>1535240874</td>\n",
       "      <td>NFLstream4kHD</td>\n",
       "      <td>Illustrious_Play</td>\n",
       "      <td>9577</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71908</th>\n",
       "      <td>1538405283</td>\n",
       "      <td>Music</td>\n",
       "      <td>DJ_Spam</td>\n",
       "      <td>9418</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       created_utc      subreddit            author  comment_length  score\n",
       "21207   1522288057      LoveNikki     queerestqueen           10045      6\n",
       "52329   1532344433          Music           DJ_Spam            9726      1\n",
       "78315   1540314439     NewsBrasil     newsbrasilbot            9720      1\n",
       "61600   1535240874  NFLstream4kHD  Illustrious_Play            9577      1\n",
       "71908   1538405283          Music           DJ_Spam            9418      2"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# If you use slicing notation, this will give you rows (like iloc)\n",
    "\n",
    "df_by_length[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 1\n",
    "\n",
    "Sort the data frame by score and select the rows with the 10th to 20th highest scores.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 2\n",
    "\n",
    "Now, how would you select just the `subreddit` and `comment_length` columns for the rows you selected above?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grouping and aggregating\n",
    "\n",
    "Pandas is also great for grouping and aggregating data.\n",
    "\n",
    "We're going to talk a lot more about analyzing data but here's a quick introduciton to some great built in functions like max, sum, and mean"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "comment_length    162.317035\n",
       "score               8.736553\n",
       "dtype: float64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Get the mean score and comment length\n",
    "\n",
    "df_by_length.loc[:,\"comment_length\":\"score\"].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also might want to create subreddit-level or person-level measures. To do this, we'll want to \"group\" the data.\n",
    "\n",
    "The groupby function will create a \"grouped\" dataframe, where aggregations apply to the group rather than the whole dataframe. I think of this as temporarily sorting all of the rows into their own mini-spreadsheets based on whatever is in the `groupby` function.\n",
    "\n",
    "For example, this will tell use the average comment score by subreddit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "subreddit\n",
       "happyendings            -56.0\n",
       "u_washingtonpost        -24.5\n",
       "deepthroat              -20.0\n",
       "wwiipics                -13.5\n",
       "Bundesliga              -13.0\n",
       "                        ...  \n",
       "shittydarksouls         119.0\n",
       "TalesFromTheCustomer    167.5\n",
       "LearnUselessTalents     217.0\n",
       "HobbyDrama              236.0\n",
       "Dogberg                 455.0\n",
       "Name: score, Length: 10198, dtype: float64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Note that I'm using the original dataframe but either one would work\n",
    "\n",
    "# Groups by subreddit, then gets the score column\n",
    "# Then gets the mean of that column, then sorts it by the value of the column\n",
    "reddit_df.groupby('subreddit').score.median().sort_values() "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As another example, this will get the number of comments per person:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "author\n",
       "---------_---------_       1\n",
       "dihahs1                    1\n",
       "digswonderlust             1\n",
       "digiternity                1\n",
       "digitalruse                1\n",
       "                        ... \n",
       "transcribersofreddit      41\n",
       "transcribot               52\n",
       "MemeInvestor_bot          76\n",
       "AutoModerator            953\n",
       "[deleted]               7913\n",
       "Length: 84941, dtype: int64"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reddit_df.groupby('author').size().sort_values()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 3\n",
    "\n",
    "Find the average (mean) comment length for each subreddit and sort by comment length"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can talk more about this in class, but this creates a user-based dataframe with data for the number of subreddits, number of comments, and average score for each user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>created_utc</th>\n",
       "      <th>subreddit</th>\n",
       "      <th>author</th>\n",
       "      <th>comment_length</th>\n",
       "      <th>score</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1514764841</td>\n",
       "      <td>Astronomy</td>\n",
       "      <td>illuminatiisnowhere</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1514764972</td>\n",
       "      <td>politics</td>\n",
       "      <td>ExtracurricularSpork</td>\n",
       "      <td>31</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1514765017</td>\n",
       "      <td>frugalmalefashion</td>\n",
       "      <td>TheBarnhouseEffect</td>\n",
       "      <td>126</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1514765217</td>\n",
       "      <td>KansasCityChiefs</td>\n",
       "      <td>thenewguyreddit</td>\n",
       "      <td>48</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1514765360</td>\n",
       "      <td>Aeon</td>\n",
       "      <td>ribnag</td>\n",
       "      <td>385</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99905</th>\n",
       "      <td>1546300240</td>\n",
       "      <td>Shamanism</td>\n",
       "      <td>[deleted]</td>\n",
       "      <td>215</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99906</th>\n",
       "      <td>1546300264</td>\n",
       "      <td>funny</td>\n",
       "      <td>PCDuranet</td>\n",
       "      <td>40</td>\n",
       "      <td>60</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99907</th>\n",
       "      <td>1546300377</td>\n",
       "      <td>feludasports</td>\n",
       "      <td>ManyYoung</td>\n",
       "      <td>116</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99908</th>\n",
       "      <td>1546300631</td>\n",
       "      <td>mydadwroteaporno</td>\n",
       "      <td>Migrane</td>\n",
       "      <td>51</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>99909</th>\n",
       "      <td>1546300662</td>\n",
       "      <td>relationship_advice</td>\n",
       "      <td>Xaveth-</td>\n",
       "      <td>86</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>99910 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       created_utc            subreddit                author  comment_length  \\\n",
       "0       1514764841            Astronomy   illuminatiisnowhere              50   \n",
       "1       1514764972             politics  ExtracurricularSpork              31   \n",
       "2       1514765017    frugalmalefashion    TheBarnhouseEffect             126   \n",
       "3       1514765217     KansasCityChiefs       thenewguyreddit              48   \n",
       "4       1514765360                 Aeon                ribnag             385   \n",
       "...            ...                  ...                   ...             ...   \n",
       "99905   1546300240            Shamanism             [deleted]             215   \n",
       "99906   1546300264                funny             PCDuranet              40   \n",
       "99907   1546300377         feludasports             ManyYoung             116   \n",
       "99908   1546300631     mydadwroteaporno               Migrane              51   \n",
       "99909   1546300662  relationship_advice               Xaveth-              86   \n",
       "\n",
       "       score  \n",
       "0          3  \n",
       "1         10  \n",
       "2          5  \n",
       "3          5  \n",
       "4          1  \n",
       "...      ...  \n",
       "99905      3  \n",
       "99906     60  \n",
       "99907      1  \n",
       "99908     24  \n",
       "99909      1  \n",
       "\n",
       "[99910 rows x 5 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "reddit_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped_df = reddit_df.groupby('author')\n",
    "\n",
    "subreddit_count = grouped_df.subreddit.nunique()\n",
    "comments_count = grouped_df.size()\n",
    "score_mean = grouped_df.score.mean()\n",
    "\n",
    "\n",
    "person_df = pd.DataFrame({'subreddit_count': subreddit_count,\n",
    "                          'comments_count': comments_count,\n",
    "                          'mean_score': score_mean})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>subreddit_count</th>\n",
       "      <th>comments_count</th>\n",
       "      <th>mean_score</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>author</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>---------_---------_</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>--------__--_---</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-----_------_---</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>-----o____</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>7.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>----Prometheus----</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                      subreddit_count  comments_count  mean_score\n",
       "author                                                           \n",
       "---------_---------_                1               1         1.0\n",
       "--------__--_---                    1               1         1.0\n",
       "-----_------_---                    1               1         1.0\n",
       "-----o____                          1               1         7.0\n",
       "----Prometheus----                  1               1         2.0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "person_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizations\n",
    "\n",
    "Python has a powerful (and complicated) plotting library called matplotlib. If you want to do more complicated plots, I highly recommend learning how to use it. For now, I'm going to introduce the visualizations that are part of pandas, and later introduce some which are part of Seaborn. Both of these are based on matplotlib but make it easier and simpler to use.\n",
    "\n",
    "I'm introducing a few basic visualizations here. You should [read and reference this page](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) to learn about a bunch more options.\n",
    "\n",
    "### Scatterplots\n",
    "\n",
    "I think it would be interesting to know if there's a relationship between the length of comments and the score. This plot shows the scatterplot of that relationship."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.plot.scatter(x = 'comment_length', y = 'score');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both of these measures are really skewed, so let's try plotting them on log scales instead."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Note that this is slightly different (improved) from the code in the video\n",
    "reddit_df['comment_logged'] = np.log1p(reddit_df.comment_length)\n",
    "# Scores can be negative, so we add the minimum (plus one) to make sure they are positive\n",
    "# And then log transform\n",
    "reddit_df['score_logged'] = np.log1p(reddit_df.score + abs(reddit_df.score.min()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.plot.scatter(x = 'comment_logged', y = 'score_logged');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not much of a relationship there.\n",
    "\n",
    "### Exercise 4\n",
    "\n",
    "Plot the number of comments and average score per person. Are frequent posters more likely to share comments that are well-received?\n",
    "\n",
    "*Hint*: You will want to use a dataframe that groups the data by person"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "## Your code here\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Time Series\n",
    "\n",
    "The last thing I want to show is how to do time series (because they are confusing!)\n",
    "\n",
    "They key thing to remember is that the plots assume that time is the index of the data frame. By default, the index is just the row number of the row when it was first imported.\n",
    "\n",
    "So, the first thing we need to do is to convert the datetime column to the index.\n",
    "\n",
    "`reddit_df.index` is where the index is stored, so the following code takes the `created_utc` column--which is the [Unix time](https://en.wikipedia.org/wiki/Unix_time) that each comment was posted--and converts it to a \"DateTime\", which is a way of representing a date and time in Python. It then stores the datetime in the index."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.index = pd.to_datetime(reddit_df.created_utc, unit='s', utc=True)\n",
    "reddit_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to make a time series, we have to combine the data. The `resample` function does this. Below, we combine it by hour to see daily trends."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_comments = reddit_df.resample('H').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_comments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_comments.plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is too messy, so let's zoom in on just a few weeks, using boolean indexing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hourly_comments.loc[hourly_comments.index < '2018-01-18'].plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 5\n",
    "\n",
    "Plot the number of comments per month over the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, if we wanted to summarize by hour, we could group the posting time by hour. Below, we see that about 8:00 UTC (which is about 3:00 AM Eastern) corresponds to the lowest activity time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.groupby(reddit_df.index.hour).size().plot();"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 6\n",
    "\n",
    "See if you can figure out how to change the above plot so that instead of showing one overall line for comments per hour, it shows a line for each day of the week (e.g., a line for comments per hour on Mondays, another for Tuesdays, etc.).\n",
    "\n",
    "This code can help you to get started: it sums the number of comments by hour and day."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "reddit_df.groupby([reddit_df.index.hour, reddit_df.index.weekday]).size()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}