Formation and Growth of Collaborative Online Organizations

```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) knitr::opts_knit$set(root.dir = './') source("resources/preamble.R") f <- function (x) {formatC(x, format="d", big.mark=',')} bold <- function(x) {paste('{\\textbf{',x,'}}', sep ='')} gray <- function(x) {paste('{\\textcolor{gray}{',x,'}}', sep ='')} wrapify <- function (x) {paste("{", x, "}", sep="")} p <- function (x) {formatC(x, format='f', digits=1, big.mark=',')} library(tidyverse) set.seed(2018) dir = '~/Projects/collective_action_abm/' reddit_communities = read.csv(paste0(dir, '/data/reddit_users_per_subreddit_gt4_comments_201701.csv'), sep=',', header = T) reddit_people = read.csv(paste0(dir, '/data/reddit_subreddits_per_user_gt4_comments_201701_sample.csv'), sep=',', header = T) colnames(reddit_communities) <- c('subreddit', 'size') colnames(reddit_people) <- c('user', 'size') reddit_comm_sample = sample_n(reddit_communities, 200, replace = F) reddit_people_sample = sample_n(reddit_people, 10000, replace = F) ```

The Formation and Growth of Collaborative Online Organizations


Jeremy Foote
Northwestern University / Purdue University

September 26, 2019
## The Plan - Overall theoretical approach - Brief summary of three projects - Detailed summary of ABM project - Overall conclusions # The Big Question ## Why do some collaborative online organizations succeed? # Collaborative Online Organizations (COO) ## Examples

# Why do some COO succeed? ## Most COO do not succeed - COO size is highly skewed, with top organizations getting most of the contributions
```{r wiki_hist_0, cache=T, message=F, warning=F, fig.height=5} df <- read_tsv('./resources/editor_count') colnames(df) <- c('users','wiki') wiki_hist <- df %>% filter(!startsWith(wiki, 'techteam')) %>% mutate(big = users > 0) %>% ggplot() + geom_histogram(aes(x=users, y=stat(count), alpha = big), fill = 'darkblue', bins = 40) + scale_x_continuous(trans = 'log1p', breaks = 10 ^ (1:5), labels = round) + ylab('Number of wikis') + xlab('Contributors per wiki') + theme_minimal() + guides(alpha=FALSE) wiki_hist ```
## There have been three main approaches ## Approach 1: Why are Wikipedia and Linux so successful? >- Why people contribute (Nov, von Krogh, Lakhani, Lampe) >- Who contributes (Antin, Shaw and Hargittai) >- How work is organized (Arazy, Butler, Crowston, Keegan, Matei, Zhu) ## The weakness of Approach 1 > - Selecting on the dependent variable ## Approach 2: Predicting COO outcomes based on membership, structure, and design
## The weakness of approach 2 > - Groups are not independent ## Approach 3: Online Organizational Ecology predicts outcomes based on community-level relationships
## The weakness of approach 3 > - Organizational ecology treats organizations as agents and people as resources ## Studying organizational outcomes using individual decisions > - How do people allocate their efforts in a complex environment with lots of choices?## People are influenced by individual attributes, technology, and the state of the system ## What is the "system"? >- An earlier generation of communication scholars suggested "open systems" (Katz and Kahn, 1966; Rogers and Argawala, 1976; Farace, Monge, and Russell, 1977) > - A system takes in inputs, processes them, and produces outputs > - Systems are composed of subsystems and compose suprasystems (e.g., a firm composed of departments composed of work groups)## Digital trace data lets us study the relationship between people and systems >- Earlier researchers had difficulty gathering the type of data needed for open systems approaches >- COO data is: > - Fine-grained data about behavior and interactions > - Within and between groups > - Unobtrusive ## Four projects on individual decisions in new collaborative online organizations # Project 1 ## Early-stage communication networks and community outcomes ## Integrative work groups are more productive and successful >- Many theories suggest that integrative groups with low hierarchy should be successful: > - Coordination > - Information flow (Katz, 2005) > - Transactive memory / Shared mental models (Wegner, 1985; Mathiew, 2000) > - Social Integration > - Legitimate peripheral participation (Lave and Wenger, 1991) > - Group identity (Scott, 2007)## Integrative structures identified with social network analysis correlate with success >- Low hierarchy, few people on the periphery (Cummings and Cross, 2003) >- High density (Balkundi and Harrison, 2006) >- Early-stage COO should benefit even more from integrative networks ## Edits taken from wiki talk pages on Wikia ## Relationships between communication structures and productivity
Bootstrapped 95% confidence interval for β coefficents
## There is basically no relationship between communication structures and survival
Bootstrapped 95% confidence interval for β coefficents
# Project 2 ## Why do people start new communities?
Foote, Gergle, and Shaw. (2017). Starting Online Communities: Motivations and Goals of Wiki Founders, CHI 2017
## Previous research typically treats small communities as failures ```{r wiki_hist_alpha, cache=T, message=F, warning=F, fig.height=5} wiki_hist <- df %>% filter(!startsWith(wiki, 'techteam')) %>% mutate(big = users > 10) %>% ggplot() + geom_histogram(aes(x=users, y=stat(count), alpha = !big), fill = 'darkblue', bins = 40) + scale_x_continuous(trans = 'log1p', breaks = 10 ^ (1:5), labels = round) + ylab('Number of wikis') + xlab('Contributors per wiki') + theme_minimal() + guides(alpha=FALSE) wiki_hist ``` ## A puzzle Why do people keep starting communities if they are so likely to fail? ## Learning from founders >- 300+ founders responded about their: > - Motivations > - Goals > - Experience ## Top goals >- High-quality information >- Long-lasting community >- High-growth community ## Most projects are on niche topics for small communities Projected contributors after 30 days # Project 3 ## Who starts new communities?
Foote and Contractor. (2018). The behavior and network position of peer production founders. Lecture Notes in Computer Science.
## Starting new organizations >- Entrepreneurs: > - Have more diverse experience than others (Backes-Gellner & Moog, 2013) > - Are more likely to have worked with entrepreneurs (Nanda & Sørensen, 2010) >- Successful entrepreneurs: > - Have more experience (Cassar, 2014) > - Have large, diverse social networks (Stam et al., 2014)## We examined the behavior and network position of ~61,000 wiki editors
Timeline of data collection
Network graph of the Spongebob wiki from Wikia
## Many founders are learning the system >- Nearly 90% of wikis were founded by new users >- ~1% of existing users founded a wiki ## Non-newbie founders are more active with more diverse experience, but at the periphery of social networks ## Overall, past behavior and networks have little relationship with community growth # Project 4 ## Social exposure and participation processes in online communities > - How do people decide which groups to participate in? > - Exposure processes + decisions processes## Social computing research theorizes that people decide to participate in a group based on expected utility > - People estimate expected utility of joining based on future activity levels (Resnick et al.) > - Join if expected benefits exceed expected costs ## People are exposed to groups via social ties >- Two categories of exposure to COO (Kraut et al.) > - Impersonal exposure > - Interpersonal exposure ## These theories focus on individual level outcomes > - Decision rules should predict > - Group level outcomes > - Population level outcomes > - These are rarely tested ## Online group sizes have heavy-tailed distributions ```{r, echo=F, message=F, cache = T, fig.height=5} df = read.csv('../data/reddit_users_per_subreddit_gt4_comments_201701.csv') df %>% ggplot() + geom_histogram(aes(x=n/sum(n), y=stat(count/sum(count))), fill = 'darkgreen', bins = 40) + scale_x_continuous(trans = 'log', breaks = 10 ^ (1:6), labels = round) + ylab('Proportion of subreddits') + xlab('Proportion of total contributors (at least 5 comments)') + theme_minimal() ```

Data from [Stuck_in_the_Matrix](https://www.reddit.com/u/Stuck_In_the_Matrix) on [BigQuery](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05)

## The number of groups each user belongs to is also heavy-tailed ```{r, echo=F, message=F, cache = T, fig.height=5} df = read.csv('../data/reddit_subreddits_per_user_gt4_comments_201701_sample.csv') df %>% filter(n < 70) %>% ggplot() + geom_histogram(aes(x=n, y = stat(count/sum(count))), fill = 'orange', bins = 30) + scale_x_continuous(trans = 'log', breaks = 10 ^ (1:4), labels = round) + xlab('Unique subreddits posted in per user (at least 5 comments)') + ylab('Proportion of users') + theme_minimal() ```

Data from [Stuck_in_the_Matrix](https://www.reddit.com/u/Stuck_In_the_Matrix) on [BigQuery](https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05)

## A simple model that produces heavy-tailed distributions > - Cumulative advantage (Merton, 1968; Barabási, 1999) > - Future activity levels are based probabilistically on current activity## Do social computing theories of exposure and participation decisions explain heavy-tailed participation? > - Possible cumulative advantage mechanisms > - Expected utility is based on current size > - Large COO have larger set of neighbors to share with > - People in many communities have access to more neighbors ## Agent-based modeling > * Simplified, simulated system of interacting agents > * Allows us to: > * Explicitly model micro processes > * Test the macro-level implications of theories ## Our agent-based model > - Start with $N$ potential contributors and $X$ potential communities > - Every month: > - Each user is presented with an "exposure set" of $x$ communities (exposure) > - The user decides which communities to participate in (decision) > - Naive versions of each theory > - Combined version## Simulation Results ## Null model as baseline
## Expected utility models are skewed but not heavy-tailed
## Naive versions of social exposure are not skewed
## A combined version is robust with community sizes roughly similar to reddit
## Conclusion > - Word of mouth exposure plus expected utility participation is a partial explanation for heavy-tailed community size distributions > - Two main weaknesses > - No model was as skewed at both the head and tail > - No model explained the heavy tail of participation rates ## Conclusion > - Framework for theories to be informed by higher-level behavior > - Could test whether people actually share larger groups > - Are people more likely to join when they already belong to many COO? > - Simulation can enrich social computing theories # Overall Implications ## Systems of COO > - Communities are interdependent > - Founders and joiners are influenced by state of other COO > - Past experience and luck can influence future behavior > - COO data can help us understand these recursive processes ## Small, temporary organizations > - Most COO are intentionally small > - In aggregate, these are valuable > - Create narrow public goods without requiring oligarchy ## Social motivations aren't all that important > - Founders cared more about the artifact than the community > - New COO didn't require integrative networks to be productive or survive > - Explanations > - Strong selection effects mean only the motivated join > - Ease of leaving means dissenters leave ## Affordances matter > - Low costs to join and leave and create COO > - Porous boundaries of COO > - Also influences individuals and populations # The End # Appendix ## Productivity Model ## Survival Model ## Robustness tests Cutoff @ 500
## Robustness tests Cutoff @ 900
## Robustness tests Dichotomize edges @ 3
## Coreness description ## Degenerate graph example ## Density with size quartiles ## ABM Project ## Word of mouth results are fragile ## Combined models are much less fragile