A new paper that our that our group has published seeks to test whether the kind of communication patterns associated with successful offline teams also predict success in online collaborative settings. Surprisingly, we find that it does not. In the rest of this blog post, we summarize that research and unpack that result.
Many of us have been part of a work team where everyone clicked. Everyone liked and respected each other, maybe you even hung out together outside of work. In a team like that, when someone asks you to cover a shift, or asks you to stay late to help them finish a project, you do it.
This anecdotal experience that many of us have is borne out by research. When members of work groups in corporate settings feel integrated into a group, and particularly when their identity is connected to their group membership, they are more willing to contribute to the group's goals. Integrative groups (where there isn't a strong hierarchy and where very few people are on the periphery) are also able to communicate and coordinate their work better.
One way to measure whether a group is "integrative" is to look at the group's conversation networks, as shown in the figure below. Groups where few people are on the periphery (like on the left) usually perform better along a number of dimensions, such as creativity and productivity.
In our new paper, we set out to look for evidence that early online wiki communities at Fandom.com work the same way as work groups. When communities are getting started, there are lots of reasons to think that they would also benefit from integrative networks. Their members typically don't know each other and communicate mostly via text—conditions that should make building a shared identity tough. In addition, they are volunteers who can easily leave at any time. The research on work groups made us think that integrative social structures would be especially important in making new wikis successful.
In order to measure the social structure of these communities, we created communication networks for almost 1,000 wikis for the talk that happened during their firs 700 main page edits. Connections between people were based on who talked to whom on Talk pages. These are wiki pages connected to each page and each registered user on a wiki. We connected users who talked to each other at least a few times on the same talk pages, and looked at whether how integrative a communication network was predicted 1) how much people contributed and 2) how long a wiki remained active.
Surprisingly, we found that no matter how we measured communication networks, and no matter how we measured success, integrative network measures were not good at predicting that a wiki would survive or be productive. While a few of our control variables helped to predict productivity and survival, none of the network measures (nor all of them taken together) helped much to predict either of our success measures, as shown in Figures 5 and 6 from the paper.
So, what is going on here?
We have a few possible explanations for why communication network structures don't seem to matter. One is that group identity for wiki members may not be influenced much by network structure. In a work group, it can be painfully obvious if you are on the periphery and not included in conversations or activities. Even though wiki conversations are technically all public and visible, in practice it's very easy for group members to be unaware of conversations happening in other parts of the site. This idea is supported by research led by Sohyeon Hwang, which showed that people can build identity in an online community even without personal relationships.
Another complementary explanation for how groups coordinate work without integrative communication networks is that wiki software helps to organize what needs to be done without explicit communication. Much of this happens just because the central artifact of the community—the wiki—is continuously updated, so it is (relatively) clear what has been done and what needs to be done. In addition, there are opportunities for stigmergy. Stigmergy occurs when actors modifying the environment as a way of communicating. Then, others make decisions based on observing the environment. The canonical example is ants who leave pheremone trails for other ants to find and follow.
In wikis, this can be accomplished in a few ways. For example, contributors can create a link to a page that doesn't yet exist. By default, these show up as red links, suggesting to others that a page needs to be created.
A final possible explanation for our results is based on how easy it is to join and leave online communities. It may be that integrative structures are so important because they help groups to overcome and navigate conflicts; in online communities contributors may be more likely to simply disengage instead of trying to resolve a conflict.
As we conclude in the paper:
Why do communication networks—important predictors of group performance outcomes across diverse domains—not predict productivity or survival in peer production? Our findings suggest that the relationship of communication structure to effective collaboration and organization is not universal but contingent. While all groups require coordination and undergo social influence, groups composed of different types of people or working in different technological contexts may have different communicative needs. Wikis provide a context where coordination via stigmergy may suffice and where the role of cheap exit as well as the difficulty of group-level conversation may lead to consensus-by-attrition.
We hope that others will help us to study some of these mechanisms more directly, and look forward to talking more with researchers and others interested in how and why online groups succeed.
The full citation for the paper is: Foote, Jeremy, Aaron Shaw, and Benjamin Mako Hill. 2023. “Communication Networks Do Not Predict Success in Attempts at Peer Production.” Journal of Computer-Mediated Communication 28 (3): zmad002. https://doi.org/10.1093/jcmc/zmad002.
We have also released replication materials for the paper, including all the data and code used to conduct the analyses.
This post was also posted on the CDSC blog at https://blog.communitydata.science/the-social-structure-of-new-wiki-communities/.
]]>I use Google Calendar to organize my life, so I was disappointed–nay, horrified–when I learned that Purdue didn’t provide any sort of reasonable format for the calendar, only a webpage or a PDF document (this year’s offical calendar is at https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html)
So, I decided to write a little script to grab the information from the website and put it into a CSV file that Google Calendar can understand (CSV format for their calendars taken from https://support.google.com/calendar/answer/37118?hl=en).
I’m writing this up and publishing it for two audiences: first, other Purdue community members who want to import the calendar into their own calendar program (although the easier way is to just subscribe to my public Google calendar) and as an example of doing screen scraping with Beautiful Soup.
# First, we import the required libraries
import requests # This fetches the webpage
from bs4 import BeautifulSoup # This is the parsing library
from datetime import datetime, timedelta # Need to convert things to/from date objects
import re # Regular expressions
import pandas as pd # Pandas, just for exporting the final result
The first step is to download the webpage as HTML, and then to parse it into a soup object.
doc = requests.get('https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html')
soup = BeautifulSoup(doc.text)
At this point, it makes sense to explore the structure of the page. You could look at the HTML or at the soup object but it’s usually much easier to right click on the page itself and “Inspect” it (both Chrome and Firefox have this).
What we want to do is figure out a good path to the data that we want. What we want are the dates and the descriptions for the events. If you look at the page, it’s structured something like this (with irrelevant portions omitted):
<html>
...
<div class = "maincontent col-lg-9 col-md-9 col-sm-9 right">
<h4> August </h4>
...
<table>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">
19
</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">
<strong>7:30 a.m.</strong><br>
FALL SEMESTER CLASSES BEGIN
</td>
...
</tr>
</table>
...
<h4> September </h4>
...
</div>
...
</html>
What we’re looking for are unique descriptors that will get us the data we’re looking for and nothing else.
All of the calendar data is in the div with the ‘maincontent’ class, so we’ll use that to limit our search. Then, dates are in <h4>
tags, followed by a table with a row for each entry in the calendar for that month.
At this point, it takes some playing around to find the right syntax. It turns out that each month header is followed directly by the table of the events, making the event table the ‘sibling’ of the month, which we can access with BeautifulSoup’s find_nex_sibling()
function.
for item in soup.find('div', 'maincontent').find_all('h4')[:2]: # Only looking at the first few for testing
print(item.text) # Should be the month
table = item.find_next_sibling()
print(table)
August
<table class="calendarTable" summary="blah">
<tbody>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">19</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong>7:30 a.m.</strong><br/>FALL SEMESTER CLASSES BEGIN</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">26</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong>5 p.m.</strong><br/>Last Day to Register Without a Late Fee</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
</tbody>
</table>
September
<table class="calendarTable" summary="blah">
<tbody>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">2</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">Last Day to Cancel a Course Assignment Without It Appearing On Record</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">2</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">LABOR DAY (No Classes)</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">16</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong><strong>5 p.m.</strong><br/></strong>Last Day to Withdraw a Course With a Grade of W or to Add/Modify a Course With Instructor and Advisor Signature</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">30</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong><strong><strong>5 p.m.</strong><br/></strong></strong>Last Day For Grade Correction for Spring Semester 2018-19 and 2019 Summer Session</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
</tbody>
</table>
This looks really good, so the next step is just to extract the content from the table.
Each event has its own row, so we parse the table to find the <td>
tags with the information we’re looking for.
for item in soup.find('div', 'maincontent').find_all('h4')[:3]:
curr_month = item.text
table = item.find_next_sibling()
for event in table.find_all('tr'):
day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
description_string = event.find('td', 'description').text
print(curr_month, day_string, description_string)
August 19 7:30 a.m.FALL SEMESTER CLASSES BEGIN
August 26 5 p.m.Last Day to Register Without a Late Fee
September 2 Last Day to Cancel a Course Assignment Without It Appearing On Record
September 2 LABOR DAY (No Classes)
September 16 5 p.m.Last Day to Withdraw a Course With a Grade of W or to Add/Modify a Course With Instructor and Advisor Signature
September 30 5 p.m.Last Day For Grade Correction for Spring Semester 2018-19 and 2019 Summer Session
October 7
Schedule of Classes published for Spring 2020 Term
October 7-8
OCTOBER BREAK
October 16
7:30 a.m.Second Eight-Week Courses Begin
October 22
5 p.m.Last Day To Withdraw From a Course With a W or WF Grade
October 22
5 p.m.Last Day To Add/Modify a Course With Instructor, Advisor, and Department Head Signatures
It looks like we have the data that we need, but we need to deal with two problems.
First, some events include a start date and an end date (e.g., October 7-8), so we need to figure out how to deal with that.
Second, the descriptions sometimes include times (which I don’t want), sometimes have linebreaks, etc. We want to clean them up to just show the event name itself.
Let’s start by figuring out the start and end date problem. We basically want to turn the string ‘7-8’ into the list ['7','8']
. We do that with the split()
function, which takes as an argument the character to use to split on:
'7-8'.split('-')
['7', '8']
We clean up the event names using the strip()
function, which removes whitespace and using regular expressions to remove the time.
Regular expressions are a super powerful way of manipulating text and are worth learning. Unlike much of python, they are very difficult to parse so I’ll take a minute to explain the following code:
re.sub('^\d.*?[ap]\.m\.','',description_string.strip())
re.sub
is a function from the regular expression library that takes in 3 arguments: the regular expression pattern you want to search for, what you want to replace that pattern with, and the string where you want to search.
The pattern ^\d.*?[ap]\.m\.
finds a time if it appears at the beginning of some descriptions. They take various formats, such as ‘5:30p.m.’ or ‘4 p.m.’ and we need to capture all of them.
^
means to start searching at the beginning of the string
\d
means that the first character is a digit (i.e., a number from 0-9)</br>
.
means look for any character, *
means of any quantity, so \d.*
means to look for a number followed by zero or more characters of any type.
[ap]
means to look for either the ‘a’ or ‘p’ characters, and [ap]\.m\.
means to look for ‘a’ or ‘p’, followed by a ‘.’, followed by ‘m’, followed by another ‘.’ (the '\'
characters mean to treat ‘.’ like a normal ‘.’ and not as a representation for any character).
Finally, you may have noticed that I skipped the ?
: This means to take the previous expression and be ‘non-greedy’: in other words, make it as small as possible.
An example may make this more clear:
string = 'Lorem ipsum dolor'
re.search('^L.*m', string).group(0)
'Lorem ipsum'
re.search('^L.*?m', string).group(0)
'Lorem'
In the first case, the pattern matches until the last m that it finds, while in the second it matches until the first m.
So, taken together the original regular expression means, “Look for a digit at the beginning of the string. Then, get all of the text until you come to either ‘a.m.’ or ‘p.m.’.” The next argument to re.sub
, the empty string (‘’), means to remove the text that matches.
Here’s some code that puts all of this together.
After the first version of this, I realized that the dates go over multiple years and the year is not listed anywhere on the page, so we need to keep track of when we move into the next year. We do this by assuming that the start dates are in chronological order. By converting each event date into a datetime we can identify when the current entry is earlier than the last entry, and assume that means we’ve actually moved to the next year.
events = []
curr_month = None
curr_year = '2019'
# Initialize to earliest possible date (used to change year)
last_date = datetime.strptime('2019-01-01', '%Y-%m-%d')
def get_start_end_dates(year, month, days):
'''
Function to convert strings of year, month, and days into a datetime object
Input:
year: 4-digit string
month: Full name of month (e.g., 'January')
day: String of one date or two dates separated by '-'
Output:
A tuple of datetime objects representing the start and end dates.
Assumes that both days are within the same month.
'''
def get_date(day):
'''Very simple helper function that converts strings to a datetime'''
return datetime.strptime(month + day + year, '%B%d%Y')
days = days.split('-')
start_date = get_date(days[0])
end_date = get_date(days[-1])
return (start_date, end_date)
for item in soup.find('div', 'maincontent').find_all('h4'):
curr_month = item.text
table = item.find_next_sibling()
for event in table.find_all('tr'):
day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
if start_date < last_date:
curr_year = '2020' # update the year if the current event is out of order
start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string) # And get new start and end dates
last_date = start_date
description_string = event.find('td', 'description').text
description = re.sub('^\d.*?[ap]\.m\.','',description_string.strip())
# CSV format is given at https://support.google.com/calendar/answer/37118?hl=en
curr_event = {
'Subject': description,
'Start Date': start_date,
'End Date': end_date,
'All Day Event': True,
'Private': False
}
events.append(curr_event)
# We then use pandas simply to write the output to a CSV file
pd.DataFrame(events).to_csv('/home/jeremy/Desktop/DeleteMe/purdue_cal.csv',
index=False)
We’ll take a quick look at the output to make sure it worked
pd.read_csv('/home/jeremy/Desktop/DeleteMe/purdue_cal.csv').sample(10)
Subject | Start Date | End Date | All Day Event | Private | |
---|---|---|---|---|---|
52 | Third 4-Week Summer Module Begins | 2020-07-13 | 2020-07-13 | True | False |
17 | COMMENCEMENT (First Division) | 2019-12-15 | 2019-12-15 | True | False |
54 | 12-Week Full Summer Module Ends (Grades due by... | 2020-08-07 | 2020-08-07 | True | False |
34 | Deadline For Pending Spring 2019 Incomplete Gr... | 2020-05-09 | 2020-05-09 | True | False |
6 | Schedule of Classes published for Spring 2020 ... | 2019-10-07 | 2019-10-07 | True | False |
56 | Third 4-Week Summer Module Ends (Grades due by... | 2020-08-07 | 2020-08-07 | True | False |
43 | 12-Week Full Summer Module Begins | 2020-05-18 | 2020-05-18 | True | False |
18 | COMMENCEMENT (Second Division) | 2019-12-15 | 2019-12-15 | True | False |
1 | Last Day to Register Without a Late Fee | 2019-08-26 | 2019-08-26 | True | False |
23 | Last Day to Cancel a Course Assignment Without... | 2020-01-27 | 2020-01-27 | True | False |
It looks great, and works for events with different start and end dates. This is ready to upload to Google Calendar. But let’s level up.
If I knew I only wanted to get data this one time from this one page then what I have above would be just fine. However, there are calendars for multiple years and I’d like to get all of them. How could we extend what we’ve done?
What I want to do is put the code into a function that I can call with different pages.
So, what’s missing to let us do that?
First, we have the year hardcoded in - we need to figure out how to either extract it or take it as a parameter.
Second, we need to aggregate the content before writing it to a CSV file.
I noticed that the year is in the URL, so I’m thinking we can take the url as a parameter and just extract the year, like so:
re.search('(\d{4})-', 'https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html').group(1)
'2019'
\d{4}
looks for a 4 digit number and the parentheses mean to put it in its own ‘group’. The -
means that the number has to come right before a dash.
Then, at the end of the expression, group(1)
means to return the first group that was in parentheses.
To update the year, we’ll need to take out the hardcoding. In order to increment a year that’s a string we have to turn it into an integer, add 1 to it, and then turn it back into a string, like so:
str(int('2020') + 1)
'2021'
In order to aggregate the data, we’ll create a function that will get all of the events from one page and return them as a list of event dictionaries. We will have an outer loop that figures out which URLs to look at, passes them to the function, and then appends the events from each page together and saves them.
def get_events(url):
# Notice that the URL includes the year in it, so we can extract that
doc = requests.get(url)
soup = BeautifulSoup(doc.text)
events = []
curr_month = None
curr_year = re.search('(\d{4})-', url).group(1)
# Initialize to earliest possible date (used to change year)
last_date = datetime.strptime(curr_year + '-01-01', '%Y-%m-%d')
def get_start_end_dates(year, month, days):
'''
Function to convert strings of year, month, and days into a datetime object
Input:
year: 4-digit string
month: Full name of month (e.g., 'January')
day: String of one date or two dates separated by '-'
Output:
A tuple of datetime objects representing the start and end dates.
Assumes that both days are within the same month.
'''
def get_date(day, month = month, year = year):
'''Very simple helper function that converts strings to a datetime'''
if not re.match('\d', day): # If it doesn't start with a digit, it crosses months
month, day = day.split() # Split on whitespace and grab the month and day
try:
return datetime.strptime(month + day + year, '%B%d%Y')
except ValueError:
month = month.strip('.')
return datetime.strptime(month + day + year, '%b%d%Y')
days = days.split('-')
start_date = get_date(days[0])
end_date = get_date(days[-1])
return (start_date, end_date)
for item in soup.find('div', 'maincontent').find_all('h4'):
curr_month = item.text
table = item.find_next_sibling()
for event in table.find_all('tr'):
day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
if start_date < last_date:
curr_year = str(int(curr_year) + 1) # update the year if the current event is out of order
start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
last_date = start_date
description_string = event.find('td', 'description').text
description = re.sub('^\d.*?[ap]\.m\.','',description_string.strip())
# CSV format is given at https://support.google.com/calendar/answer/37118?hl=en
curr_event = {
'Subject': description,
'Start Date': start_date,
'End Date': end_date,
'All Day Event': True,
'Private': False
}
events.append(curr_event)
return events # Return the list of all events
Now we can write our outside function, which by convention is called ‘main()’.
Currently, Purdue lists calendars until 2024-2025; let’s grab all of them.
We can build a simple loop to create the URLs based on the structure we know they have
for i in range(19, 25):
url = 'https://www.purdue.edu/registrar/calendars/20{}-{}-Academic-Calendar.html'.format(i, i+1)
print(url)
https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2020-21-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2021-22-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2022-23-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2023-24-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2024-25-Academic-Calendar.html
def main(filename):
events = []
for i in range(19, 25):
url = 'https://www.purdue.edu/registrar/calendars/20{}-{}-Academic-Calendar.html'.format(i, i+1)
events = events + get_events(url)
pd.DataFrame(events).to_csv(filename,
index=False)
As is often the case when enlarging the data you’re retrieving, I found a weakness. In some of the future years, events actually go across multiple months. In these cases, the month was included in the day td
, so the day list ended up being something like ['April 25', 'May 1']
. This made it super easy to fix - I just wrote a little regex to look if a day started with a number or a letter and to parse the month out if it started with a letter.
I put the code in the get_date
helper function above.
Ok, now we’re finally ready to run it. We run the main function and then check the output to make sure it looks reasonable.
After it’s done, I’ll use it to update my Google Calendar.
main('/home/jeremy/Documents/purdue_cal.csv')
# Look at some random rows to see how it looks
pd.read_csv('/home/jeremy/Documents/purdue_cal.csv').sample(15)
Subject | Start Date | End Date | All Day Event | Private | |
---|---|---|---|---|---|
37 | COMMENCEMENT (Second Division) | 2020-05-15 | 2020-05-15 | True | False |
131 | SEMESTER ENDS | 2021-12-18 | 2021-12-18 | True | False |
152 | COMMENCEMENT (First Division)** | 2022-05-13 | 2022-05-13 | True | False |
341 | Second 4-Week Summer Module Ends (Grades due b... | 2025-07-11 | 2025-07-11 | True | False |
282 | First 8-Week Summer Module Ends (Grades due by... | 2024-07-05 | 2024-07-05 | True | False |
26 | Second Eight-Week Courses Begin | 2020-03-09 | 2020-03-09 | True | False |
40 | COMMENCEMENT (Fifth Division) | 2020-05-17 | 2020-05-17 | True | False |
25 | Last Day For Grade Correction For Fall Semest... | 2020-02-24 | 2020-02-24 | True | False |
186 | CLASSES END | 2022-12-10 | 2022-12-10 | True | False |
234 | Last Day to Cancel a Course Assignment Without... | 2023-09-04 | 2023-09-04 | True | False |
326 | COMMENCEMENT (First Division)** | 2025-05-16 | 2025-05-16 | True | False |
88 | Spring Vacation | 2021-03-15 | 2021-03-20 | True | False |
246 | FINAL EXAMS | 2023-12-11 | 2023-12-16 | True | False |
160 | First 4-Week Summer Module Begins | 2022-05-16 | 2022-05-16 | True | False |
183 | Last Day To Withdraw From a Course With a W or... | 2022-10-25 | 2022-10-25 | True | False |
So, that is a basic introduction to BeautifulSoup and to using it to parse web pages.
As you can see, it’s a really powerful tool. In this case, I could have done the work manually probably in less time than it took me to write this script but imagine having hundreds or thousands or more web pages to parse. This kind of scripting can save hundreds of hours of time. And, it’s fun!
]]>The Master’s degree went well enough that I decided to pursue a PhD (the phrase is apt – some days I feel I’m running after a quickly escaping quarry). For the past nearly five years I have been at Northwestern University, in what I think is the only other Media, Technology, and Society program in the country. I became the second PhD student in (and helped to name) the Community Data Science Collective, a group that now includes six PhD students (and more on the way!). When I was choosing between PhD programs I tried to find an advisor I could work with and who could help me to learn to become a scholar. I chose incredibly well: Aaron Shaw has been a model of academic passion, integrity, mentorship, and kindness. He and Mako Hill (the other PI of our lab) have become academic role models as well as friends.
More broadly, I have come to really love my academic communities. The CSCW, Organizational Communication, and Computational Social Science communities are full of really great people doing cool and important work. In graduate school, I’ve learned that academics are “my people” and I love being able to be my nerdy self with all of you.
In a job market that is difficult and full of noisy outcomes, I am so excited that I have a path to continue being part of these communities. This week, I officially accepted an offer to go back to the Purdue University Brian Lamb School of Communication, this time as an Assistant Professor!!
I feel incredibly blessed to join such a wonderful department and university. In addition to the great faculty who were there when I was, the department has hired some stellar people in recent years (present company excluded). Many of my future colleagues have overlapping and complementary interests and I look forward to learning from and working with them. I am particularly excited to learn better how to do work that impacts and intersects with public policy, something that the department excels in.
This is the end of what has been a great chapter in my life. I have heard lots of grad school horror stories, but our program at Northwestern and my corner of academia have been incredibly friendly and supportive. I should probably save it for the dissertation acknowledgements, but I’m most thankful for my wife, Kedra. She has a degree in genetics and still has only a begrudging respect for the social sciences, but she has been endlessly supportive in too many ways to count.
Now, I look forward to proving that Purdue didn’t make a giant mistake in hiring me (starting by finishing my dissertation!). I have incredible collaborators and some exciting projects in the pipeline that I’m really excited to work on. Stay tuned! :)
]]>However, I’ve noticed that even though the baby has been well-behaved and even sleeps pretty well, our home life has felt more stressful and chaotic. At first, I chalked it up to the effects of getting less sleep and the start of a new school year. Kids don’t always deal well with change.
While I think that’s a reasonable explanation, I realized that there’s also a mathematical explanation. Being a parent of young kids has lots and lots of great moments, but it also has its stresses. One of the most stressful is when kids have a meltdown. Dealing with a child who can’t control themselves or their emotions takes the full attention of a parent and takes a lot of emotional and mental energy. When two children are melting down at the same time, even with two parents around, it’s incredibly stressful. When one kid is having problems, the other parent can pick up some of the slack (helping the other kids, making dinner, etc.). When both parents are occupied, nothing else gets done.
I realized that as the number of kids goes up, the probability of two or more having a meltdown goes up very quickly (at first). If we assume that meltdowns are independent1 and children have a constant probability of meltdown, then whether any given child is melting down can be treated like a coin flip, and the number of meltdowns in a given time period can be modeled as a binomial distribution. A binomial distribution is the probability of successes in a given number of trials (e.g., the number of heads in 10 coin flips). Formally, the chance of having more than x simultaneous meltdowns in a given time period is equal to
\[P(M > x | n, p) = 1 - \sum_{i=0}^x {n \choose i} * p^i*(1-p)^{1-i}\]where n is the number of children, and p is each child’s probability of melting down.
So, if we look at the chance of having at least 2 simultaneous meltdowns, and assume that each child has an individual meltdown probability of .1 in a given half hour:
At those meltdown rates, with 3 children, we had a 2.8% chance of experiencing a two-or-more-person meltdown at any given time. However, with our fourth child, we’ve nearly doubled that likelihood, to 5.2%.
Things look even worse if we consider it at the level of a day. In the four hours after school and before bed, with three kids we had a 20% chance of seeing a two-person meltdown each day, but with four we’ve moved to a 35% chance.
In reality, meltdowns are correlated, and one child’s meltdown often causes another’s, exacerbating the problem!
The lesson of this post is not to avoid having lots of kids! There are lots of great things about kids that also scale super-linearly. The lesson for me is that probabilistic thinking does not come naturally but can be a powerful tool in understanding the world.
1: Of course, meltdowns are likely to be correlated, as one child’s meltdown can directly cause another’s!
]]>There are a few different ways to see Wikipedia editors. One way is as members of an organization. Through their contributions, editors gain power and opportunity to shape the organization in their image. From this perspective, inequalities in Wikipedia are deeply troubling. Even though Wikipedia is nominally “the encyclopedia anyone can edit”, the same privileged groups that run traditional organizations have appropriated the power and influence in Wikipedia. Unsurprisingly, the interests of educated white men are overrepresented on Wikipedia while other topics are underrepresented. Even the way articles are written reinforces biases and stereotypes (Wikipedia’s summary of the problem).
However, Wikipedia editors are also contributors to a public good. Public goods are non-rivalrous (my reading of Wikipedia doesn’t diminish your ability to read it) and non-excludable (everyone has access to Wikipedia, not just those who edit it). Public goods sound really great but there’s a catch (there’s always a catch!). According to standard economic and game theory models, public goods are related to prisoner’s dilemmas - everyone would prefer to let others contribute, since you get the benefits of the good whether or not you contribute.
In these sorts of games, an individual’s goal is to get the good without paying for it. When viewed through this lens, a situation where privileged groups are overrepresented as editors is preferred. Wikipedia acts as a means of transferring resources (knowledge) from the resource-rich to the resource-poor. Less privileged groups–who benefit from Wikipedia without contributing to it–are the beneficiaries.
So which of these perspectives is right? I think they both are. If we focus only on Wikipedia as a public good, we might turn a blind eye to structural problems and organizational solutions. On the other hand, if we treat Wikipedia as only a troubled organization, we might miss out on the fact that Wikipedia represents a massive investment by (mostly) privileged groups that we all benefit from.
Many thanks to my fellow CDSCers for discussing and debating with me about this. By acknowledging their input I am not claiming that they agree with my conclusions. :)
]]>We were invited by Jean Burgess, Alice Marwick, and Thomas Poell to write a chapter about computational methods for the Sage Handbook of Social Media. Rather than simply listing what sorts of computational research has been done with social media data, we decided to use the chapter to both introduce a few computational methods and to use those methods in order to analyze the field of social media research.
In the chapter, we start by describing the process of obtaining data from web APIs and use as a case study our process for obtaining bibliographic data about social media publications from Elsevier’s Scopus API. We follow this same strategy in discussing social network analysis, topic modeling, and prediction. For each, we discuss some of the benefits and drawbacks of the approach and then provide an example analysis using the bibliographic data.
We think that our analyses provide some interesting insight into the emerging field of social media research. For example, we found that social network analysis and computer science drove much of the early research, while recently consumer analysis and health research have become more prominent.
More importantly though, we hope that the chapter provides an accessible introduction to computational social science and encourages more social scientists to incorporate computational methods in their work, either by gaining computational skills themselves or by partnering with more technical colleagues. While there are dangers and downsides (some of which we discuss in the chapter), we see the use of computational tools as one of the most important and exciting developments in the social sciences.
One of the great benefits of computational methods is their transparency and their reproducibility. The entire process—from data collection to data processing to data analysis—can often be made accessible to others. This has both scientific benefits and pedagogical benefits.
To aid in the training of new computational social scientists, and as an example of the benefits of transparency, we worked to make our chapter pedagogically reproducible. We have created a permanent website for the chapter at https://communitydata.cc/social-media-chapter/ and uploaded all the code, data, and material we used to produce the paper itself to an archive in the Harvard Dataverse.
Through our website, you can download all of the raw data that we used to create the paper, together with code and instructions for how to obtain, clean, process, and analyze the data. Our website walks through what we have found to be an efficient and useful workflow for doing computational research on large datasets. This workflow even includes the paper itself, which is written using LaTeX + knitr. These tools let changes to data or code propagate through the entire workflow and be reflected automatically in the paper itself.
If you use our chapter for teaching about computational methods—or if you find bugs or errors in our work—please let us know! We want this chapter to be a useful resource, will happily consider any changes, and have even created a git repository to help with managing these changes!
This blog post was written with Aaron Shaw and Mako Hill and was originally published on the Community Data Science Collective blog.
]]>In my career, I have found myself at the intersection of intellectual fields. My undergraduate degree (Go Cougars!) is in English literature. My current work is at the intersection of sociology and computer science and organizational research and communications. My academic career is at the intersection of many different “structural holes”.
Structural holes are areas in a social network where different groups have few connections between them. Ron Burt, a sociologist at the University of Chicago, has shown that people who are at the intersection of otherwise disconnected groups–“brokers”– receive higher compensation and have better ideas (Burt, 2004).
This all sounds great (and I certainly hope that those outcomes accrue to me!) but I wanted to talk autobiographically about one downside of being an “interdisciplinary scholar”. And that is that I often feel like an idiot.
I am often exposed to scholars who are more firmly within a discipline – computer scientists developing social network analysis algorithms or organizational scholars developing theories of group interactions. They have a much deeper understanding than I about topics that I care about and use in my research. I often come away from these interactions feeling like a lazy academic and thinking things like, “How can I do work on groups and not have read Weber?” or “How can I do social network analysis when I don’t understand eigenvectors?”
And then I try to do those things. I sign up for classes and read more books. But I know that in the end, it is a losing battle. As long as I choose to remain an intersectionist I will never have the time to become a true expert in any of my subfields and will often feel the fool.
Maybe I’m just a bad structural hole broker. Perhaps the successful brokers are able to become true experts in a few disciplines. On the other hand, perhaps brokers are people who are willing to be a fool and to feel lost and that foolishness is actually the source of creativity and insight. I sure hope so!!
]]>Those are really big goals for only four hours. I decided to use the tidyverse as much as possible and not even teach base R syntax like ‘[,]’, apply, etc. I used the first session to show and explain code using the nycflights13 dataset. For the the second session we did a few more examples but mostly worked on exercises using a dataset from Wikia that I created (with help from Mako and Aaron Halfaker’s code and data).
Overall, I think that the workshops went pretty well. I think that students definitely have a better understanding and a better set of tools than I did after I had used R for four hours!
That being said, there was plenty of room for improvement. I am scheduled to teach another set of workshops early next year and I’m planning to make a few changes:
I found some pretty good resources already in existence for introducing students to R, but none of them quite fit the scope of what I was looking for. All of the code that I used (as well as some slides for the beginning of class) are on github and GPL licensed. Please reuse my work and submit pull requests!
]]>For a research paper that we recently published, we surveyed founders of new wikis on wikia.com. One of the surprising findings from that paper was that most founders were starting niche communities and had modest expectations about growth.
This came as a surprise to me, and has been surprising to other online community researchers that I have talked to about the findings. My assumption before doing this research was that most people were trying to start large communities, and simply failing.
There are a number of different possible explanations for this surprising finding. One is that the utility of founding a small community is actually larger than we assumed - perhaps even greater than founding a large community. A second is that perhaps people see the likelihood of success of small communities as larger than that of large communities. It’s this idea that I’d like to explore more. Why might people assume that they would have more success in starting a niche community? One possible reason is based on an idea from the economics literature called the Efficient Market Hypothesis.
The efficient market hypothesis basically says that prices will adjust based on all of the available information, and so investors can’t expect to beat the market unless they have insider information. In other words, you shouldn’t expect to excel at buying stocks. If that stock tip you heard really was a great bargain, then someone else with more money and more skill would have already bought it at that price, and would have pushed the price up to what it should be.
There’s an old economics joke that illustrates a particularly strong form of this theory:
Two economists were walking down the street together when one looks down and says, “Look, a $20 bill!”. The second economist says, “It can’t be - someone would have picked it up already.”
I’ve been thinking about whether there might be a similar Efficient Online Community hypothesis. The punchline of the theory is that all online community founders should expect to have limited success, whether they attempt to start general or niche communities.
The initial assumptions are that:
The corollary is that the costs of starting new communities should roughly equal the benefits from starting them. When costs are high – and costs might include not only money but also resources like time, skill, or social capital – then rational founders would only invest those resources into a community that they thought would be quite large and successful.
As many costs have lowered (e.g., with lower hosting costs, lower technical skills required, etc.), we should expect that founders have more modest expectations. The basic intuition is that if founding a particular community is such a good idea, then someone would have already done it. When the barriers to entry are so low, then all of the good ideas get taken.
This has an interesting consequence - all new founders should have modest expectations, even those whose communities eventually grow large. If my hypothesis is correct, then this will typically not be because founders poured resources into them. Rather, their growth and success will be something of a surprise to the founders.
There is some anecdotal evidence that this is true. We know, for example, that Wikipedia was started as a side project, designed to supplement the work of more professional encyclopedia editors. Larry Sanger introduced the idea to the community of editors like this:
No, this is not an indecent proposal. It's an idea to add a little feature to Nupedia. Jimmy Wales thinks that many people might find the idea objectionable, but I think not... As to Nupedia's use of a wiki, this is the ULTIMATE "open" and simple format for developing content. We have occasionally bandied about ideas for simpler, more open projects to either replace or supplement Nupedia. It seems to me wikis can be implemented practically instantly, need very little maintenance, and in general are very low-risk. They're also a potentially great source for content. So there's little downside, as far as I can determine.
Linus Torvalds announced the start of the Linux project with similarly modest expectations:
Hello everybody out there using minix - I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones. [...] Linus (torv...@kruuna.helsinki.fi) PS. Yes - it's free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that's all I have :-(.
I also looked at the first posts of some of the most popular subreddits on Reddit. We can assume that these were written by the founders of the communities. Looking at the text of the posts we certainly don’t get the sense that they knew that they were about to start huge, influential communities. Some look like spam posts, others are jokes.
Subreddit | First Post title | First post Text |
---|---|---|
AskReddit | test | |
pics | ctrl+c ctrl+v | |
funny | The Princess and Professor. Changing a processor. | |
worldnews | Scores killed in Pakistan clashes | |
gaming | Halo 3 tournament site | |
videos | Short film about love and robots. | |
politics | “Congressman Paul is just wrong, wrong, wrong” | |
news | Noble Resolve 08: National Security “experiment” | |
gifs | Looking for trouble. | |
todayilearned | 1. What did you learn today? | |
Showerthoughts | First thought in shower today was to make a subreddit about shower thoughts | Mission accomplished |
movies | 5 Hidden DVD Eggs You Shouldn’t Miss | Blockbuster Online Rentals |
[deleted] |
aww | Super cute baby bunny | |
Overwatch | First. | |
mildlyinteresting | So, hey. Kingdom is showing on Hulu. Pretty nice show. | |
IAmA | I am a 18 year old geek that just made a subreddit. | |
WTF | WTF | |
The_Donald | Dear GOP: Trump’s Fearless War with Univision Only Increases His Appeal | |
AdviceAnimals | DID SOMETHING ABOUT IT |
I think of online communities, and particularly peer production communities, as public goods, and I think that this idea has broader implications. There is a great literature on collective action and public goods production, and I’d like to think more formally about how and when it makes sense to create or contribute to small-scale versus large-scale public goods projects.
]]>Here are my slides, an instruction sheet, and my proposal (pdf) (tex).
The whole class was built around a message passing activity which my advisor Aaron Shaw uses in his undergraduate class. The basic idea is that a few students are “users”, a few are servers (e.g., Facebook and Instagram), and the rest of the students in the middle are routers.
As students filed in, I made sure that they sat as close together as possible to make a dense network. I then assigned users and servers at the edges of the class, such that the most direct routes would cross each other.
For the initial activity, users cut up memes into small squares, which they then had to get to a server using envelopes. They addressed them, and routers passed them where they needed to go.
I then made a few routers malicious - one would pass things the wrong way, and the other would conspicuously throw packets onto the floor. This set us up to talk about how IP addresses are very similar to street addresses, and help routers know where to send packets. We also talked about decentralization and the ability to avoid problems without top-down control or knowledge.
I had students talk in groups about the weaknesses of our original simulation, and how a system of envelopes and messages might solve them. They came up with some good ideas, which were very similar to TCP. I explained that TCP is a layer that sits above IP, where packets are numbered, and servers send “I got #X” messages back after every packet is received. We ran the simulation again, but for this and future runs we just had one user and one server participate. Note: in the second session, I switched which user and server participated in each run, which worked better and kept students more engaged. At this point I also showed the students Wireshark. I started sniffing packets while I downloaded a cat meme, and showed how the image was broken into packets, and how ACKs were received.
Again, I had students think about vulnerabilities of the system. At this point, we talked about how anyone could send packets pretending to be you. We talked about how authentication works - you send a username and password, and the server sends back a cookie, which you then include in all future messages (the “cookie” was a picture of a cookie).
At this point, I introduced two new nefarious nodes: a hacker and a spy. The hacker replaced the image you were trying to upload with her own image, and the spy took notes about who was sending what to whom. This helped students to recognize a huge flaw in the system - rogue routers can very easily impersonate even authenticated users.
To explain SSL, I pulled out a box and two locks, each with two keys, and asked the students to discuss in groups how they might send messages securely with these tools, where “securely” means knowing 1) who sent the packet and 2) that no one else viewed or changed it, even though neither the user nor the server can control the path that a packet takes.
Think about it yourself for a moment, and try to figure it out. It’s trickier than it sounds.
This is the part of the lesson I was most proud of, and most nervous about. I was really impressed that in both sessions the students pretty much figured it out. As they explored their ideas, I would have them try them out in the simulation, and talk through it together. In the end, we talked about how SSL works:
While SSL stopped the hacker, we talked about a final technical vulnerability - the spy can no longer see message content, but can still see who is sending messages to whom. We talked about how proxies, VPN, and Tor each work in different ways to try to solve that problem by hiding the user’s identity. We did one final simulation of Tor + SSL + cookies, and showed that we pretty well solved the vulnerabilities that we had identified. I also showed them the Tor browser, and talked about the weaknesses (speed, many sites don’t trust Tor exit nodes, IP address is unpredictable).
Finally, we talked about how most of the Internet’s current security problems (spam, stolen credit card info, etc.) are at a higher level in the stack - they are human problems, caused by phishing or people choosing weak passwords.
Overall, it was a really fun lecture. The students that I had were super great and interested in the topic, and the activities were really fun. Please feel free to use any and all of my materials; I’d love to hear if you do!
]]>