# Computational Text Analysis Part II: 
## This time it's personal

In this lecture, I'm going to introduce a few more advanced computational text analysis approaches: Topic modeling, word embeddings, and LLM-based classifications.

## Topic Modeling

The first example is topic modeling. The idea of topic modeling is to group documents (in this case, posts) which are similar to each other, and to characterize those groups somehow. In that sense, it is similar to qualitative inductive coding.

There are a number of approaches. I'm going to show you a very vanilla version of BERTopic, which is a new, fancy approach which uses a large language model (LLM) in order to understand the semantic meaning of sentences in a corpus.

To install it, run `conda install bertopic` in the terminal.

The BERTopic library is really great, and has [great documentation and a website here](https://maartengr.github.io/BERTopic/index.html).

The first step is to load a model. (Note that `hdbscan_model = ...` line is optional, and sets some parameters which help to avoid having lots of topics. The `representation_model` is also optional, but can help to identify more representative terms for each topic)

In [6]:
# BERTopic example
from hdbscan import HDBSCAN
import pandas as pd
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired
from sklearn.feature_extraction.text import CountVectorizer


# Load a pre-trained BERT model
hdbscan_model = HDBSCAN(min_cluster_size=25, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
representation_model = KeyBERTInspired()
topic_model = BERTopic(language="english", verbose=True, hdbscan_model=hdbscan_model, representation_model=representation_model)



Let's load the subreddit data from last week. Because there are more `r/politics` posts than the other subreddits, we'll focus on those

This next piece of code loads the data and trains the model. It may take a few minutes to run, but BERTopic has a nice progress bar so you know that it's working.

In [7]:
sr = pd.read_csv('https://raw.githubusercontent.com/jdfoote/Intro-to-Programming-and-Data-Science/refs/heads/master/resources/data/sr_post_data.csv')

sr = sr[sr.subreddit == 'politics']
# First we change NAs and removed/deleted to empty strings
sr.loc[(pd.isna(sr.selftext)) | (sr.selftext.isin(['[removed]', '[deleted]'])), 'selftext'] = ''
sr['all_text'] = sr.title + ' ' + sr.selftext

dataset = sr.all_text.to_list()

In [8]:
topics, probs = topic_model.fit_transform(dataset)

2024-11-05 11:24:44,739 - BERTopic - Embedding - Transforming documents to embeddings.


Batches:   0%|          | 0/79 [00:00<?, ?it/s]

2024-11-05 11:25:07,855 - BERTopic - Embedding - Completed ✓
2024-11-05 11:25:07,859 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-05 11:25:17,454 - BERTopic - Dimensionality - Completed ✓
2024-11-05 11:25:17,456 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-05 11:25:17,600 - BERTopic - Cluster - Completed ✓
2024-11-05 11:25:17,608 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-05 11:25:27,016 - BERTopic - Representation - Completed ✓


BERTopic decides how many topics are appropriate, and assigns each document to a topic. It also includes a "miscellaneous" topic (`Topic -1`) for documents that don't fit very well. This can include quite a few documents, as below.

In [9]:
topic_model.get_topic_freq()

Unnamed: 0,Topic,Count
5,0,419
0,-1,397
2,1,286
11,2,162
7,3,128
1,4,119
10,5,119
8,6,100
9,7,100
3,8,74


There are a bunch of cool visualizations and tools for understanding the topics. There are a bunch of them shown [on the BERTopic website](https://maartengr.github.io/BERTopic/getting_started/visualization/visualize_topics.html). Here are a few.

This first one visualizes the topics. We can see that they are fairly clustered.

In [10]:
topic_model.visualize_topics()

This shows the top words and their probabilities for each of the top `n` topics

In [11]:
topic_model.visualize_barchart(top_n_topics=15)

We can also visualize how much topics are used over time.

In [12]:
timestamps = sr[sr.subreddit=='politics'].date.to_list()
topics_over_time = topic_model.topics_over_time(dataset, timestamps, nr_bins=30)

30it [04:16,  8.55s/it]


In [13]:
topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=5)

### Qualitative Analyses

I think it's really vital to get back into the actual text data in order to make sure that the topics really represent what you think they do. One way to do that is to extract the documents most closely associated with each topic.

First, we get info about each document and how well it matches the topic.

In [266]:
doc_df = topic_model.get_document_info(dataset)

# Remove the -1 topic, which is the "garbage" topic
doc_df = doc_df.loc[doc_df.Topic != -1]

Then, we sort by "Probability", which is the likelihood that the document belongs to the assigned topic, group by topic, and take the top `num_docs` documents for each topic.

I print these, but in practice what you probably want to do is save the output, so you can look at it in Excel or similar. (i.e., `top_docs.to_csv('top_documents.csv')`)

In [267]:
num_docs = 20
top_docs = doc_df.sort_values('Probability', ascending=False).groupby('Topic').head(num_docs)
top_docs = top_docs.sort_values('Topic').loc[:, ['Topic', 'Probability', 'Document']]

In [268]:
for topic, group in top_docs.groupby('Topic'):
    print(f"Topic {topic}")
    print(group.Document.values)
    print("\n\n")

Topic 0
['Sorry, a Coronavirus Infection Might Not Be Enough to Protect You '
 'Red leaning areas do have higher infection rates per study '
 'L.A. County plans to require COVID vaccine proof at indoor bars, nightclubs and more '
 'A Florida chiropractor signed hundreds of mask exemption forms for students. Now, the district has tightened its mask policy '
 'Pandemic frustrations zero in on unvaccinated Americans '
 'Broward will reward vaccinated county workers and subject the unvaccinated to charges and testing '
 'Will the Biden Administration Mandate Vaccines for Flying? That sound you hear is the gang at Fox News screaming in a pitch only dogs can hear. '
 'Florida AG Ashley Moody suing Biden administration over COVID-19 vaccine mandate '
 'Missouri Is the Next Front in the COVID Culture War '
 'Joy Reid calls GOP a ‘Covid-loving death cult’| ‘You love Covid so much you want it to spread into schools, at the office, in the Walmart, on the cruise ships and in the club’ says MSNBC h

### EXERCISE 1

Where topic modeling really shines is in analyzing longer texts - for example, the subreddit [changemyview](https://www.reddit.com/r/changemyview/) has fairly long posts where people explain a controversial view that they hold.

Try to figure out how to get a few hundred posts from changemyview using PRAW, and run a topic model on them, where the selftext of each post is a document.

## Word Embeddings

The next method is word embeddings. Word embeddings crete a multidimensional "space" and then place words in that space based on the words that they appear near in a corpus. There are a bunch of complex versions of word embeddings, and complex uses for them. Indeed, BERTopic uses word embeddings, as do LLMs. 

The embeddings themselves can also be interesting, as we can think of them as putting words into a contextualized semantic space. We can then compare how different groups or communities contextualized different terms or concepts differently.

I'm going to teach you a simple version of word embeddings called Word2Vec. In this example, we'll build the model from scratch, but another option is to use something like BERT to build on a pre-trained model.

Much of what follows is borrowed from [Laura Nelson's wonderful example](https://github.com/lknelson/DH-Institute-2017/blob/d20246758d6da88dfedbad2e75933ad4ef370930/07-Word2Vec/Word2Vec.ipynb).

We will use Laura's code as template to look at differences between some recent comments on `r/Purdue` and `r/IndianaUniversity`

In [41]:
import numpy as np
#import pandas as pd
#from sklearn.metrics import pairwise
#from sklearn.manifold import MDS, TSNE

import gensim
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

from string import punctuation


In [42]:

def fast_tokenize(text):
    
    # Get a list of punctuation marks
    
    lower_case = text.lower()
    
    # Iterate through text removing punctuation characters
    no_punct = "".join([char for char in lower_case if char not in punctuation])
    
    # Split text over whitespace into list of words
    tokens = no_punct.split()
    
    return tokens


In [43]:
def tokenize_sr(df, sr_name):
    sr_data = df[df.subreddit==sr_name].body.to_list()
    sr_data = [fast_tokenize(text) for text in sr_data]
    sr_data = [text for text in sr_data if len(text) > 0]
    return sr_data

df = pd.read_csv('https://raw.githubusercontent.com/jdfoote/Intro-to-Programming-and-Data-Science/refs/heads/master/resources/data/purdue_iu_comments.csv')
purdue_data = tokenize_sr(df, 'Purdue')
iu_data = tokenize_sr(df, 'IndianaUniversity')


Word2Vec actually has two different options for algorithms. CBOW (Continuous Bag of Words) and Skip-Gram.

I won't focus on the details here. In general, CBOW is is faster and does well with frequent words, while Skip-Gram can be better for rare words.

Parameters for the `gensim` `Word2Vec` function that you might want to adjust:

* vector_size: Number of dimensions for embedding model
* window: Number of context words to observe in each direction
* min_count: Words must appear this many times to be included
* max_vocab_size: Maximum number of words to consider (will remove less frequent words)
* sg (Skip-Gram): '0' indicates CBOW model; '1' indicates Skip-Gram
* alpha: Learning rate
* epochs: Number of passes (iterations) through dataset

Note: The code below uses the default for all values except for `sg`. In general, you probably don't need to change any of the parameters.

In [44]:
purdue_model, iu_model = (gensim.models.Word2Vec(x, vector_size=100, window=5,
                               min_count=5, max_vocab_size=None, sg=1, alpha=0.025, epochs=5) for x in [purdue_data, iu_data])

We should now have vectors for each common word that appears in the data. Each word is represented by 100 numbers (its location in the 100-dimension meaning space)

In [272]:
purdue_model.wv['study']

array([ 0.42650676,  0.2423227 ,  0.4340216 ,  0.1232065 , -0.00371701,
       -0.45077795, -0.00652294,  0.22863336,  0.05336555, -0.38784248,
       -0.17059462, -0.30575195, -0.06730964, -0.0582826 ,  0.15157396,
       -0.05116662, -0.04489793,  0.08904201,  0.05273012, -0.4681878 ,
        0.386036  ,  0.07347596,  0.27227664, -0.1489299 , -0.28774157,
        0.13151063, -0.18973255,  0.05411329,  0.0546989 ,  0.13433374,
       -0.05010374,  0.08429157, -0.2429416 , -0.09779704, -0.12813367,
        0.35818008,  0.37954837, -0.12390593, -0.12781247, -0.23319471,
        0.06319663, -0.03309186, -0.10148015,  0.01792365,  0.5536579 ,
       -0.18210748, -0.13849355,  0.04882931, -0.02414883,  0.1733538 ,
        0.22831964, -0.17154573, -0.28409106, -0.01810449,  0.00270636,
        0.13850257,  0.00569986,  0.02664356, -0.03922816, -0.00855842,
       -0.14857441, -0.06613347, -0.08124437,  0.12198006, -0.10080033,
        0.06520334,  0.13152489, -0.07359037, -0.18478926,  0.47

We can now do things like look at which terms are most similar to a given topic in both communities. For example, this shows the words most similar to "sports" and "studying"

In [255]:
print(f"Purdue similar words to study: {purdue_model.wv.most_similar('studying')}")
print(f"IU similar words to study: {iu_model.wv.most_similar('studying')}")

Purdue similar words to study: [('sleeping', 0.9779280424118042), ('hardest', 0.9761364459991455), ('aspects', 0.9753471612930298), ('101', 0.9752112627029419), ('friendship', 0.9741437435150146), ('raise', 0.9737470746040344), ('harrys', 0.9724946022033691), ('hc', 0.9722948670387268), ('18th', 0.9717273712158203), ('laptop', 0.9712340831756592)]
IU similar words to study: [('extracurriculars', 0.9869362115859985), ('stress', 0.9844495058059692), ('limit', 0.9833267331123352), ('miss', 0.9829302430152893), ('haha', 0.9819132089614868), ('typical', 0.9812778830528259), ('handle', 0.9812208414077759), ('da', 0.9806753396987915), ('sharing', 0.9803783297538757), ('shouldn’t', 0.9802150130271912)]


In [258]:
print(f"Purdue similar words to sports: {purdue_model.wv.most_similar('sports')}")
print(f"IU similar words to sports: {iu_model.wv.most_similar('sports')}")

Purdue similar words to sports: [('immigrants', 0.9894766211509705), ('dark', 0.9888549447059631), ('media', 0.9886048436164856), ('modern', 0.9876541495323181), ('discussion', 0.9866571426391602), ('minds', 0.9865592122077942), ('offense', 0.986454427242279), ('stations', 0.9858655333518982), ('pets', 0.9857378005981445), ('gold', 0.9857314229011536)]
IU similar words to sports: [('consulting', 0.9733580946922302), ('ib', 0.9721837043762207), ('you’d', 0.9718393683433533), ('mentor', 0.9717192053794861), ('coats', 0.9706007838249207), ('jump', 0.9705644249916077), ('annoying', 0.9703823328018188), ('unhappy', 0.9700692296028137), ('concepts', 0.9698516726493835), ('effectively', 0.9697093963623047)]


### Exercise

Identify topics where you think IU and Purdue commenters might differ and figure out how to display those differences.

## Using LLMs for research

The last thing I want to show you is some example code for using LLMs (like ChatGPT or Claude) in your work.

They are incredible, fleixible tools, which have a broad semantic understanding of texts, and can be used in a lot of the same ways as a trained undergraduate.

For example, let's say we wanted to identify the different hobbies that people do at each school.

In a sense, what we're doing is programming an LLM agent using natural langauge. So, we want to come up with a prompt. I'll show you "few-shot" prompting, which gives a few examples for the agent. This can often be helpful, especially when a task might be ambiguous. Unlike most of the programs we've written so far, you may receive different results with even small changes to a prompt. It's a stochastic process.

Another thing is that you will need an API key. Using these models isn't but they are quite cheap.

I'll show you the Anthropic (Claude.ai) API, but OpenAI's is quite similar.

You can buy $5 worth of credits and create an account at https://console.anthropic.com/. This is also where you'll get an API key, which you should save in a file called `anthropic_credentials.py`

On that page, you can also get Claude.ai's help in generating a prompt (wild, I know!). The below is generated with its help, plus some additional changes when things didn't work right.



In [249]:
import anthropic
import anthropic_credentials
from anthropic._exceptions import InternalServerError, RateLimitError
import time

client = anthropic.Anthropic(
    api_key=anthropic_credentials.api_key,
)

# You can use haiku for a cheaper model
model = "claude-3-5-sonnet-20241022"

def get_classifications(comments, num_comments):
    try:
        message = client.messages.create(
            model=model,
            max_tokens=2000,
            temperature=0,
            messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": f"""
                        You are an AI assistant tasked with analyzing Reddit comments from Purdue University and Indiana University (IU) subreddits. Your goal is to identify and extract the hobbies and things that people like to do at each schoool.
                        You will be given {num_comments} comments at a time. Here the Reddit comments you need to analyze:

<reddit_comments>
{comments}
</reddit_comments>

Your task is to process each comment individually, determining if it mentions a hobby or activity that people do at either Purdue or IU. 
A hobby or activity is any pastime or interest that people engage in for pleasure or relaxation, other than their academic studies.

For each comment, follow these steps:
1. Read the comment carefully.
2. Determine if the comment mentions a hobby or activity that people do at Purdue or IU.
3. If a hobby or activity is mentioned, extract a description of the hobby or activity, and try to categorize it.
4. If no hobby is mentioned, move on to the next comment.


After analyzing all comments, compile your findings into a JSON array of the hobbies (the extracted hobby description or null if no hobby was found). This JSON array should be your only output.

Here's an example of the expected JSON output structure for the following input comments:

["There's not much to do here, but you can always join a club or organization", "Math 231 is super tough - don't take it", "Can't wait for the basketball game"]

```json
["clubs and organizations", null, "basketball"]
```

Remember:
- Focus only on hobbies that are explicitly stated in the comments, but you can be generous in your interpretation of what constitutes a hobby.
- If more than one hobby is mentioned in a given, only extract the most prominent one. Don't extract multiple hobbies from the same comment.
- Do not make inferences or assumptions about potential hobbies that are not clearly expressed.
- If no hobbies or activities are found in the comment, then return an empty string.
- ONLY return the JSON, without any newlines or any additional commentary.
- Return one entry for each comment, even if no hobbies or activities are found. You should have {num_comments} entries in your JSON output.

Begin your analysis now, processing each comment individually before compiling the final JSON output.
"""
                    }
                ]
            },
        ]
    )
    except (InternalServerError, RateLimitError) as e:
        print(f"Error {e}: Retrying after 30 seconds")
        time.sleep(30)
        message = get_classifications(comments, num_comments)
    return message

The code below is one version of how you might do this, and is the result of running into some issues with other approaches.

I found out that if you have too many comments, then it doesn't always keep track of which is which, so I batched them into groups of 10.

Then, if it still returns the wrong number, I go one comment at a time.

Also, note that I write out comments directly to a file, and skip to where I left off. This is a good practice so you don't have to start over if there's a network error.

In [273]:
comments = df.body.to_list()

# Batch the comments into groups of 10
batch_size = 10
comment_batches = [[{'text': x} for x in comments[i:i+batch_size]] for i in range(0, len(comments), batch_size)]


In [None]:
hobbies_fn = 'hobbies.csv'
try:
    with open(hobbies_fn, 'r') as f:
        hobbies = []
        for line in f:
            hobbies.append(line)
except FileNotFoundError:
    hobbies = []
hobbies_count = len(hobbies)

i = 0
with open(hobbies_fn, 'a') as f:
    for batch in comment_batches:
        i += len(batch)
        if i <= hobbies_count:
            continue
        # If we're partway through a batch, we need to start where we left off
        if (i - hobbies_count) < batch_size:
            batch = batch[(hobbies_count % batch_size):]

        print(f"Processing batch {i//batch_size} of {len(comment_batches)}")
        response = get_classifications(batch, batch_size)
        curr_hobbies = json.loads(response.content[0].text)
        # Sometimes it returns the wrong number of hobbies, so we need to reprocess one at a time
        if len(curr_hobbies) != len(batch):
            curr_hobbies = []
            for comment in batch:
                response = get_classifications([comment], 1)
                curr_hobby = json.loads(response.content[0].text)[0]
                if len(curr_hobby) != 1:
                    curr_hobby = ""
                curr_hobbies.append(curr_hobby)
            
        for hobby in curr_hobbies:
            if hobby is None:
                hobby = ""
            f.write(hobby + '\n')

We can then put the hobbies back into the original dataframe, and do things like filter by them, compare them across campuses, etc.

In [252]:
with open(hobbies_fn, 'r') as f:
    hobbies = []
    for line in f:
        hobbies.append(line.strip())

df['hobby'] = hobbies

In [259]:
df.to_csv('purdue_iu_comments_hobbies.csv', index=False)

In [261]:
df.loc[df.hobby != '', ['body', 'hobby']].head(20)

Unnamed: 0,body,hobby
10,Looking for club info? Consider checking [Boil...,clubs
14,Surprises me how many people don't look both w...,biking
16,great photos!,photography
17,Hi! \n\nI am a professional photographer and ...,photography
20,We are trying to catch it but animal control i...,animal rescue
23,https://preview.redd.it/nkrxy9qq75wd1.jpeg?wid...,animal watching
32,The veterinary school has a vet clinic. I'd im...,visiting the veterinary school clinic
34,Check to see if the Humane Society is open,visiting the Humane Society
36,Oh trust me I would've adopted it in a heartbe...,adopting pets
44,Counter-point: the 2021 and 2022 Football seas...,watching football and basketball games


In [276]:
df[df.hobby != ''].groupby('subreddit').count()

Unnamed: 0_level_0,body,author,score,created_utc,post_id,hobby
subreddit,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
IndianaUniversity,1345,1345,1345,1345,1345,1345
Purdue,991,991,991,991,991,991
