{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Text Analysis in Python\n", "\n", "I am not an NLP person and this is outside of my expertise, but I know enough to give a basic introduction to text analysis and text processing tools.\n", "\n", "For this tutorial + homework, I'm going to use data from Reddit. I retrieved it from PushShift, using the following code. Unfortunately, Pushshift is not currently available for researchers so this code won't work and you'll just have to grab the version of the data that I've linked to below." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import requests\n", "from datetime import datetime\n", "import time\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "from bertopic import BERTopic" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```python\n", "## Code used to create the dataset - no longer works\n", "\n", "endpt = 'https://api.pushshift.io/reddit/search/submission'\n", "\n", "subreddits = ['Coronavirus', 'politics', 'aww']\n", "\n", "# Start and end date (pushshift expects these in epoch time)\n", "start_date = int(datetime.strptime('2021-09-11', '%Y-%m-%d').timestamp())\n", "end_date = int(datetime.strptime('2021-09-25', '%Y-%m-%d').timestamp())\n", "\n", "\n", "def get_posts(subreddit, before = end_date, after = start_date, result = None, min_comments = 20):\n", " params = {'subreddit': subreddit,\n", " 'num_comments': f'>{min_comments}',\n", " 'before': before,\n", " 'size': 500\n", " }\n", " if result == None:\n", " result = []\n", " r = requests.get(endpt, params=params)\n", " print(r.url)\n", " print(datetime.fromtimestamp(before))\n", " for item in r.json()['data']:\n", " created_time = item['created_utc']\n", " if created_time < after: # If we've reached the earliest we want, then return\n", " print(len(result))\n", " return result\n", " else:\n", " try:\n", " result.append((item['title'],item['selftext'], created_time, subreddit))\n", " except KeyError:\n", " print(item)\n", " time.sleep(.5)\n", " return get_posts(subreddit, before = created_time, result = result)\n", "\n", "\n", "sr_data = []\n", "for subreddit in subreddits:\n", " new_data = get_posts(subreddit)\n", " sr_data = sr_data + new_data\n", "sr = pd.DataFrame(sr_data, columns = ['title', 'selftext', 'date', 'subreddit'])\n", "sr.date = pd.to_datetime(sr.date, unit='s')\n", "sr.to_csv('./sr_post_data.csv', index = False)\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "## Code to download and import the file (DO run this code)\n", "sr = pd.read_csv('https://raw.githubusercontent.com/jdfoote/Intro-to-Programming-and-Data-Science/refs/heads/master/resources/data/sr_post_data.csv')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Change the date to a datetime, and put it in the index\n", "sr.index = pd.to_datetime(sr.date)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleselftextdatesubreddit
date
2021-09-25 03:59:39FALSE: US records 12,366 deaths due to COVID-1...NaN2021-09-25 03:59:39Coronavirus
2021-09-25 03:55:12Researchers who developed the mRNA technology ...NaN2021-09-25 03:55:12Coronavirus
2021-09-25 02:51:33The United States Completes Donation of 3.5 mi...NaN2021-09-25 02:51:33Coronavirus
2021-09-25 02:25:46When will the pandemic end? Models project a d...NaN2021-09-25 02:25:46Coronavirus
2021-09-25 01:55:31Thousands of teachers may be forced out of NYC...NaN2021-09-25 01:55:31Coronavirus
\n", "
" ], "text/plain": [ " title \\\n", "date \n", "2021-09-25 03:59:39 FALSE: US records 12,366 deaths due to COVID-1... \n", "2021-09-25 03:55:12 Researchers who developed the mRNA technology ... \n", "2021-09-25 02:51:33 The United States Completes Donation of 3.5 mi... \n", "2021-09-25 02:25:46 When will the pandemic end? Models project a d... \n", "2021-09-25 01:55:31 Thousands of teachers may be forced out of NYC... \n", "\n", " selftext date subreddit \n", "date \n", "2021-09-25 03:59:39 NaN 2021-09-25 03:59:39 Coronavirus \n", "2021-09-25 03:55:12 NaN 2021-09-25 03:55:12 Coronavirus \n", "2021-09-25 02:51:33 NaN 2021-09-25 02:51:33 Coronavirus \n", "2021-09-25 02:25:46 NaN 2021-09-25 02:25:46 Coronavirus \n", "2021-09-25 01:55:31 NaN 2021-09-25 01:55:31 Coronavirus " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarization\n", "\n", "There are some simple ways to summarize text data that can be useful, without using any special NLP tools.\n", "\n", "\n", "For example, it can be very interesting to see how the frequency of a term changes over time:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# This code plots the frequency of \"COVID-19\", \"Coronavirus\", and \"Trump\" each day\n", "\n", "for term in [\"COVID-19\", \"Coronavirus\", \"Trump\"]:\n", " curr_df = sr.loc[sr.title.str.contains(term) | sr.selftext.str.contains(term)]\n", " posts_per_day = curr_df.resample('D').size()\n", " posts_per_day.plot(label = term)\n", "\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EXERCISE 1\n", "\n", "Modify the code above to plot how often \"Coronavirus\" is used in each of the three subreddits over time" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "#### YOUR CODE HERE\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A similar approach is dictionary-based. The most well-known version of this is [LIWC](http://liwc.wpengine.com/), but the basic idea is that you create a set of words that are associated with a construct you are interested in, and you count how often they appear.\n", "\n", "This is a very simple example of how you might do this to look for gendered words among our subreddits" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# First we change NAs and removed/deleted to empty strings\n", "sr.loc[(pd.isna(sr.selftext)) | (sr.selftext.isin(['[removed]', '[deleted]'])), 'selftext'] = ''\n", "sr['all_text'] = sr.title + ' ' + sr.selftext" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "male_words = ['he', 'his']\n", "female_words = ['she', 'hers']\n", "\n", "# This puts all of the text of each subreddit into lists\n", "def string_to_list(x):\n", " return ' '.join(x).split()\n", "grouped_text = sr.groupby('subreddit').all_text.apply(string_to_list)\n", "\n", "# Then, we count how often each type of words appears in each subreddit\n", "agg = grouped_text.transform({'proportionMale': lambda x: sum([x.count(y) for y in male_words])/len(x),\n", " 'proportionFemale': lambda x: sum([x.count(y) for y in female_words])/len(x)}\n", " )" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "subreddit\n", "Coronavirus [FALSE:, US, records, 12,366, deaths, due, to,...\n", "aww [This, little, guy, (girl), snuck, into, my, o...\n", "politics [CNN, Expert, Claims, Black, Voters, Don’t, Ha...\n", "Name: all_text, dtype: object" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grouped_text" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proportionMaleproportionFemale
subreddit
Coronavirus0.0018740.000234
aww0.0111290.003251
politics0.0039500.000525
\n", "
" ], "text/plain": [ " proportionMale proportionFemale\n", "subreddit \n", "Coronavirus 0.001874 0.000234\n", "aww 0.011129 0.003251\n", "politics 0.003950 0.000525" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EXERCISE 2\n", "\n", "One of the trickiest parts of analysis is getting the data in the form that you want it in order to analyze/visualize it. \n", "\n", "I think a good visualization for this would be a barplot showing how often male and female word types appear for each subreddit. I'll give you the final call to produce the plot:\n", "\n", "`sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df_long)`\n", "\n", "Now, see if you can get the data in shape so that this code actually works! :)\n", "\n", "*Hint: You'll want to use [wide to long](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html)*" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
A(weekly)-2010A(weekly)-2011B(weekly)-2010B(weekly)-2011Xid
00.5488140.5448830.4375870.38344200
10.7151890.4236550.8917730.79172511
20.6027630.6458940.9636630.52889512
\n", "
" ], "text/plain": [ " A(weekly)-2010 A(weekly)-2011 B(weekly)-2010 B(weekly)-2011 X id\n", "0 0.548814 0.544883 0.437587 0.383442 0 0\n", "1 0.715189 0.423655 0.891773 0.791725 1 1\n", "2 0.602763 0.645894 0.963663 0.528895 1 2" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "## Example of how wide_to_long works (from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html)\n", "\n", "import numpy as np\n", "np.random.seed(0)\n", "\n", "df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3),\n", " 'A(weekly)-2011': np.random.rand(3),\n", " 'B(weekly)-2010': np.random.rand(3),\n", " 'B(weekly)-2011': np.random.rand(3),\n", " 'X' : np.random.randint(3, size=3)})\n", "df['id'] = df.index\n", "df " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
XA(weekly)B(weekly)
idyear
0201000.5488140.437587
1201010.7151890.891773
2201010.6027630.963663
0201100.5448830.383442
1201110.4236550.791725
2201110.6458940.528895
\n", "
" ], "text/plain": [ " X A(weekly) B(weekly)\n", "id year \n", "0 2010 0 0.548814 0.437587\n", "1 2010 1 0.715189 0.891773\n", "2 2010 1 0.602763 0.963663\n", "0 2011 0 0.544883 0.383442\n", "1 2011 1 0.423655 0.791725\n", "2 2011 1 0.645894 0.528895" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.wide_to_long(df, # The data\n", " # The prefixes for the data columns. These will become column names that hold data values.\n", " stubnames = ['A(weekly)', 'B(weekly)'], \n", " # i is a column which uniquely identifies each row\n", " i='id',\n", " # j is what you want to call the prefix\n", " j='year',\n", " # sep is a string that is between the stubnames and the values which will go in j\n", " sep='-')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "## Exercise 2 Code\n", "## This code will get the df ready for pd.wide_to_long (try printing agg_df after running these to see what it looks like)\n", "agg_df = agg.unstack(level=0)\n", "agg_df = agg_df.reset_index()\n", "\n", "### Your code here" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
level_0subreddit0
0proportionMaleCoronavirus0.001874
1proportionMaleaww0.011129
2proportionMalepolitics0.003950
3proportionFemaleCoronavirus0.000234
4proportionFemaleaww0.003251
5proportionFemalepolitics0.000525
\n", "
" ], "text/plain": [ " level_0 subreddit 0\n", "0 proportionMale Coronavirus 0.001874\n", "1 proportionMale aww 0.011129\n", "2 proportionMale politics 0.003950\n", "3 proportionFemale Coronavirus 0.000234\n", "4 proportionFemale aww 0.003251\n", "5 proportionFemale politics 0.000525" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "agg_df" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "ename": "NameError", "evalue": "name 'agg_df_long' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn [15], line 2\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;66;03m## Once you've created agg_df_long with the columns proportion and word_gender, you should be able to run this\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m sns\u001b[38;5;241m.\u001b[39mbarplot(x\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124msubreddit\u001b[39m\u001b[38;5;124m'\u001b[39m, y\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mproportion\u001b[39m\u001b[38;5;124m'\u001b[39m, hue \u001b[38;5;241m=\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mword_gender\u001b[39m\u001b[38;5;124m'\u001b[39m, data \u001b[38;5;241m=\u001b[39m \u001b[43magg_df_long\u001b[49m)\n", "\u001b[0;31mNameError\u001b[0m: name 'agg_df_long' is not defined" ] } ], "source": [ "## Once you've created agg_df_long with the columns proportion and word_gender, you should be able to run this\n", "sns.barplot(x='subreddit', y='proportion', hue = 'word_gender', data = agg_df_long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EXERCISE 3\n", "\n", "Make your own analysis, with a different set of terms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TF-IDF\n", "\n", "There are more complicated approaches to summarization in Python, including using LIWC (see [here](https://pypi.org/project/liwc/)).\n", "\n", "Almost all approaches are based on a \"bag of words\" approach, where the order of words is totally ignored. This is obviously a big simplification, but can often work quite well.\n", "\n", "One thing we might want to do is to differentiate groups of texts based on how often words are used. The naive way is to just count how often words appear. However, the most common words will always appear first. So, computational linguists came up with \"term frequency--inverse document frequency\" (TF-IDF). This normalizes words based on how often they appear across groups of texts. A detailed explanation with code is [here](https://towardsdatascience.com/natural-language-processing-feature-engineering-using-tf-idf-e8b9d00e7e76).\n", "\n", "There are a number of NLP / text analysis libraries in Python. The one I'm most familiar with is scikit-learn, which is a machine learning library. NLTK, SpaCy, and textblob are some of the most popular. Here is how to run TF-IDF in scikit-learn." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "## First, we prepare the data for the TF-IDF tool.\n", "# We want each subreddit to be represented by a list of strings.\n", "# So, we take our grouped_text (which is a list of lists of words)\n", "# and change it into a list of three really long strings, where each\n", "# string is all the words that appeared for that subreddit.\n", "\n", "# This called a 'list comprehension'\n", "as_text = [' '.join(x) for x in grouped_text]\n", "\n", "# It is equivalent to the following for loop\n", "as_text = []\n", "for x in grouped_text:\n", " as_text.append(' '.join(x))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction.text import TfidfVectorizer\n", "\n", "# Just gets the 5000 most common words\n", "vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')\n", "\n", "tfidf_result = vectorizer.fit_transform(as_text)\n", "feature_names = vectorizer.get_feature_names_out()\n", "dense = tfidf_result.todense()\n", "denselist = dense.tolist()\n", "df = pd.DataFrame(denselist, columns=feature_names).transpose()\n", "df.columns = list(grouped_text.index)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Coronavirusawwpolitics
cat0.0000000.4099080.000000
dog0.0000000.2866040.000858
little0.0000000.2664920.003431
just0.0065420.2069550.031306
cute0.0000000.1718970.000000
girl0.0000000.1718970.000000
baby0.0013080.1718110.000666
new0.0562620.1718110.106574
today0.0000000.1458160.001715
puppy0.0000000.1454510.000000
adopted0.0000000.1322280.000000
old0.0052340.1171440.003997
got0.0117760.1171440.009991
good0.0000000.1156470.003431
think0.0000000.1106190.015439
like0.0091590.1093350.018650
kitten0.0000000.1057830.000000
suggestions0.0000000.1057830.000000
guy0.0000000.1055910.000858
help0.0065420.0976200.015320
\n", "
" ], "text/plain": [ " Coronavirus aww politics\n", "cat 0.000000 0.409908 0.000000\n", "dog 0.000000 0.286604 0.000858\n", "little 0.000000 0.266492 0.003431\n", "just 0.006542 0.206955 0.031306\n", "cute 0.000000 0.171897 0.000000\n", "girl 0.000000 0.171897 0.000000\n", "baby 0.001308 0.171811 0.000666\n", "new 0.056262 0.171811 0.106574\n", "today 0.000000 0.145816 0.001715\n", "puppy 0.000000 0.145451 0.000000\n", "adopted 0.000000 0.132228 0.000000\n", "old 0.005234 0.117144 0.003997\n", "got 0.011776 0.117144 0.009991\n", "good 0.000000 0.115647 0.003431\n", "think 0.000000 0.110619 0.015439\n", "like 0.009159 0.109335 0.018650\n", "kitten 0.000000 0.105783 0.000000\n", "suggestions 0.000000 0.105783 0.000000\n", "guy 0.000000 0.105591 0.000858\n", "help 0.006542 0.097620 0.015320" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This shows the values with the highest TF-IDF for r/Coronavirus\n", "df.sort_values('aww', ascending=False).head(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Relative frequency\n", "\n", "\n", "An even simpler approach that works pretty well when comparing just two \"documents\" is to rank how much more often a word appears in one rather than the other.\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "politics_str = ' '.join(sr.loc[sr.subreddit == 'politics', 'all_text']).lower()\n", "covid_str = ' '.join(sr.loc[sr.subreddit == 'Coronavirus', 'all_text']).lower()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def word_ratios(text):\n", " counts = {}\n", " tot_words = 0\n", " for word in text.split():\n", " counts[word] = counts.get(word, 0) + 1\n", " tot_words +=1\n", " result = {}\n", " for word, count in counts.items():\n", " result[word] = count/tot_words\n", " return result\n", " \n", " \n", "politics_ratio = word_ratios(politics_str)\n", "covid_ratio = word_ratios(covid_str)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "ratio_diff = []\n", "for word in politics_ratio:\n", " if word in covid_ratio:\n", " ratio_diff.append((word, politics_ratio[word] - covid_ratio[word]))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "ratio_diff = sorted(ratio_diff, key = lambda x: x[1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the words that appear more often in r/Coronavirus" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('covid-19', -0.020888477566596217),\n", " ('covid', -0.012544929169800238),\n", " ('vaccine', -0.00592151224450452),\n", " ('more', -0.005169286900006672),\n", " ('information', -0.004752876614738616),\n", " ('our', -0.004724119177071341),\n", " ('for', -0.004715598694414918),\n", " ('vaccination', -0.004380547049728374),\n", " ('are', -0.003921754318080493),\n", " ('in', -0.003903396786161282),\n", " ('health', -0.0037529976515266027),\n", " ('of', -0.0036447707007916177),\n", " ('there', -0.0033541814836345095),\n", " ('vaccines', -0.0031055833011634434),\n", " ('cases', -0.0030923399997225182)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratio_diff[:15]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And here are those that appear more often in r/politics." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('biden', 0.008418094533021217),\n", " ('to', 0.0066828403369883875),\n", " ('the', 0.004575618874372623),\n", " ('texas', 0.0032892912474584077),\n", " ('bill', 0.0030318005794814544),\n", " ('on', 0.0030221543838544786),\n", " ('a', 0.002709801382600024),\n", " ('that', 0.002206546223422705),\n", " ('court', 0.0021136522597004167),\n", " ('law', 0.001782188016405662),\n", " ('his', 0.0016662137019796605),\n", " ('was', 0.0016506995657533103),\n", " ('from', 0.00162875463244231),\n", " ('calls', 0.0016230775454787474),\n", " ('white', 0.001588833874483722)]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ratio_diff[-15:][::-1] # The [::-1] just reverses the list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Classification\n", "\n", "Another commonly-used tool in NLP is classification. This is a \"supervised machine learning\" model, where you build a \"training set\" of items that are classified, and a machine learner uses that set to predict the classification of new items.\n", "\n", "One very common example is sentiment. In sentiment analysis, a random set of texts is manually classified as positive, neutral, or negative. This set is then used to train a classifier to predict the sentiment of unseen texts.\n", "\n", "It's beyond the scope of this class to learn how to do machine learning, but there are also pre-trained classifiers. One I found is from [textblob](https://textblob.readthedocs.io/en/dev/).\n", "\n", "NLTK also has a pre-trained classifier, trained on social media data, called VADER. That is pretty similar to what we're looking at, so this example shows how to use it.\n", "\n", "NLTK is interesting - the core is installed in Anaconda, so you should have it. However, to get various pieces to work you need to install them. So, we need to start by installing the vader lexicon." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package vader_lexicon to\n", "[nltk_data] /home/jeremy/nltk_data...\n", "[nltk_data] Package vader_lexicon is already up-to-date!\n" ] }, { "data": { "text/plain": [ "True" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import nltk\n", "nltk.downloader.download('vader_lexicon')" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n", "\n", "analyzer = SentimentIntensityAnalyzer()\n", "\n", "def get_sentiment(sentence):\n", " vs = analyzer.polarity_scores(sentence)\n", " return vs['compound']\n", "\n", "sr['sentiment'] = sr.all_text.apply(get_sentiment)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
titleselftextdatesubredditall_textsentiment
date
2021-09-24 15:18:15We’re the League of Women Voters! We launched ...Hi! We’re activists and experts from the Leagu...2021-09-24 15:18:15politicsWe’re the League of Women Voters! We launched ...0.9822
2021-09-20 17:37:13The \"What happened in your state last week?\" M...Welcome to the 'What happened in your state la...2021-09-20 17:37:13politicsThe \"What happened in your state last week?\" M...0.9671
2021-09-13 16:08:17The \"What happened in your state last week?\" M...Welcome to the 'What happened in your state la...2021-09-13 16:08:17politicsThe \"What happened in your state last week?\" M...0.9671
2021-09-17 16:05:41Free Chat Friday ThreadIt's finally Friday! That means it's time to s...2021-09-17 16:05:41politicsFree Chat Friday Thread It's finally Friday! T...0.9650
2021-09-14 02:54:07Hello I know I'm different but my beautiful he...2021-09-14 02:54:07awwHello I know I'm different but my beautiful he...0.9587
\n", "
" ], "text/plain": [ " title \\\n", "date \n", "2021-09-24 15:18:15 We’re the League of Women Voters! We launched ... \n", "2021-09-20 17:37:13 The \"What happened in your state last week?\" M... \n", "2021-09-13 16:08:17 The \"What happened in your state last week?\" M... \n", "2021-09-17 16:05:41 Free Chat Friday Thread \n", "2021-09-14 02:54:07 Hello I know I'm different but my beautiful he... \n", "\n", " selftext \\\n", "date \n", "2021-09-24 15:18:15 Hi! We’re activists and experts from the Leagu... \n", "2021-09-20 17:37:13 Welcome to the 'What happened in your state la... \n", "2021-09-13 16:08:17 Welcome to the 'What happened in your state la... \n", "2021-09-17 16:05:41 It's finally Friday! That means it's time to s... \n", "2021-09-14 02:54:07 \n", "\n", " date subreddit \\\n", "date \n", "2021-09-24 15:18:15 2021-09-24 15:18:15 politics \n", "2021-09-20 17:37:13 2021-09-20 17:37:13 politics \n", "2021-09-13 16:08:17 2021-09-13 16:08:17 politics \n", "2021-09-17 16:05:41 2021-09-17 16:05:41 politics \n", "2021-09-14 02:54:07 2021-09-14 02:54:07 aww \n", "\n", " all_text \\\n", "date \n", "2021-09-24 15:18:15 We’re the League of Women Voters! We launched ... \n", "2021-09-20 17:37:13 The \"What happened in your state last week?\" M... \n", "2021-09-13 16:08:17 The \"What happened in your state last week?\" M... \n", "2021-09-17 16:05:41 Free Chat Friday Thread It's finally Friday! T... \n", "2021-09-14 02:54:07 Hello I know I'm different but my beautiful he... \n", "\n", " sentiment \n", "date \n", "2021-09-24 15:18:15 0.9822 \n", "2021-09-20 17:37:13 0.9671 \n", "2021-09-13 16:08:17 0.9671 \n", "2021-09-17 16:05:41 0.9650 \n", "2021-09-14 02:54:07 0.9587 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sr.sort_values('sentiment', ascending=False).head()" ] } ], "metadata": { "kernelspec": { "display_name": "teaching", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.19" } }, "nbformat": 4, "nbformat_minor": 4 }