Using Social Media and Online Community Data for Social Science Research

Jeremy Foote
Brian Lamb School of Communication
Purdue University

Types of online data

  • Experimental
  • Design
  • Observational

Three main types of observational online data

  • Web pages
  • Social Media

  • Online Communities

Online data is an incredible new opportunity

  • As part of normal operations, servers track what people do
    • Posts, comments, likes, pages visited, etc.
    • Completely unobtrusive
    • Retrospective - we can realize later that it was important
  • May include millions of people and interactions

Research questions for social media data

  • How do ideas/links/hashtags/norms spread?
  • How do people talk about a topic?
  • Who are the central people talking about a topic?
  • How do people’s beliefs change over time? And whose beliefs change?

Research questions for online communities

  • Who starts new communities?
  • Do communities have lifecycles?
  • What types of interaction networks predict group outcomes?
  • How do communities compete with each other?
  • How does leadership emerge?

Gathering online data

Three primary means of getting online data: Curated data is ideal

  • Curated datasets
    • Wikipedia dumps, COVID tweets, etc.
    • Often flat files
    • Often cleaned

APIs are the next best thing

  • Application Programming Interfaces
    • Programmatic access to a server’s database
    • Often built for external developers, not researchers
    • Outages, code changes, etc. are often invisible
    • Typically in JSON
    • Some wrappers to make this easier (e.g., rtweet in R)

Screen scraping is often the only option

  • Easy for static pages - just download them
  • Can also be used on complicated pages
    • Program pretends to be a browser and extracts data from the HTML page
  • Can be against ToS
  • Programs break often
  • The only way to get data not made available via APIs
  • requests + BeautifulSoup

Where are the good data sources?

  • Places where the conversation is already public
    • Twitter
      • Very easy in Python and R
      • Incredibly generous Academic API for retrieving tweets
    • Reddit
      • API is OK but rate limited
      • Pushshift is amazing!
    • Wikipedia, Github
  • Kaggle

Benefits and drawbacks to online data

  • So much data!
  • Unobtrusively collected
  • Amenable to machine learning approaches
  • These are all benefits and drawbacks!

Dealing with lots of data

  • Can find even small effects
  • Can do analyses of often neglected subgroups
  • Requires IT + cluster computing for many questions

Unobtrusively collected data

  • Subjects don’t modify their behavior
  • But they also often don’t know that their data is accessible to researchers!
  • Informed consent is typically impossible
  • There are emerging best practices around the ethics of using online data
    • e.g., aggregating, not quoting directly or including usernames, etc.

Amenable to machine learning

  • Great advances in NLP and ML
  • Machine-only approaches have biases and blind spots

Some other downsides

  • Users != individuals
    • Multiple accounts, bots, etc.
  • Very little demographic information
  • Ethically and practically difficult to tie people to offline behavior
    • Aggregate behavior can still be useful, though (e.g., Google Flu Trends)

Recent research using online data

Project 1 - Google Maps and OpenStreetMap

“Zooming in” on the individual-level causes

  • Big question is how competition influences an open source mapping project community
    • It hurts it!
  • Having digital data lets them ask why
    • New members stop contributing but established members increase their dedication
Figure 3 from Nagaraj and Piezunka, 2020

Project 2 - How do protests spread on Twitter?

How do people decide when to post a supportive hashtag?

  • People have different thresholds before joining in (Granovetter, 1978).
  • Gonzáles-Bailón et al. measured this threshold empirically

Project 3 - Wikipedia’s declining userbase

  • Wikipedia study found a “rise and decline” pattern.
  • Caused by Wikipedia’s quality control

Comparing many communities

Looking at many communities helps to identify patterns

  • TeBlunthuis et al. found that this patterns was common across lots of wikis of various topics and sizes

Project 4 - Combining computational and qualitative approaches

How do different communities characterize risk?

  • Project led by Tiwalade Adekunle
  • Get comment data from pro- and anti-mask subreddits
  • Use NLP to identify salient topics
  • Gain qualitative understanding of topics and identify other insights
  • Develop and test hypotheses

Further reading


Jeremy Foote
Brian Lamb School of Communication
Purdue University