Jeremy Foote's Qualifying Exam Project - Behavior and Networks of Wiki Founders

For this project, I extended some work done for the Computational Social Science course. The project now combines four data sources about founders:

  1. Network position in global communication and collaboration networks.

    • I wrote a script that will create an edgelist of all of the coediting behavior that occurs across all of the wikis, and wrote another program to extract individual network measures - degree centrality, betweenness centrality, and pagerank.
  2. Network position in local communication and collaboration networks.

    • I also wrote code to create measures for each user for each wiki that they participated in. For this analysis, I focus on the network position in the network they are most active in
  3. Overall editing behavior

    • I wrote code to generate user-centric data, and measures, such as the number of wikis contributed to, number of edits, how many days in the period they edited, etc.
  4. Wiki founding

    • A final dataset lists the first 3 editors of each wiki, as well as data about the growth of the wiki

I use these datasets to explore 1) the differences between founders and non-founders and 2) the relationship between founder activity and community growth.

In [19]:
# First, load and prepare the data

library(dplyr)
library(tidyr)
library(ggplot2)

# Parameters for which wikis should be included as foundings
START_DATE <- '2009-05-01'
END_DATE <- '2009-07-01'

behavior_stats <- read.csv('~/Projects/predicting_founders/data/editor_stats_20090301_20090501.csv')

talk_stats <- read.csv('~/Projects/predicting_founders/data/talk_stats.csv')
talk_stats <- subset(talk_stats, select=-c(anon))
colnames(talk_stats) <- paste0('talk_', colnames(talk_stats))

coedit_stats <- read.csv('~/Projects/predicting_founders/data/coedit_stats.csv')
coedit_stats <- subset(coedit_stats, select=-c(anon))
colnames(coedit_stats) <- paste0('coedit_', colnames(coedit_stats))

wiki_coedit <- read.csv('~/Projects/predicting_founders/data/wiki_coedit_stats.csv')
wiki_coedit <- subset(wiki_coedit, select=-c(anon))
# Re-normalize pagerank based on network size
wiki_coedit$page_rank <- wiki_coedit$page_rank * wiki_coedit$network_size
colnames(wiki_coedit) <- paste0('wiki_coedit_', colnames(wiki_coedit))

wiki_talk <- read.csv('~/Projects/predicting_founders/data/wiki_talk_stats.csv')
wiki_talk <- subset(wiki_talk, select=-c(anon))
wiki_talk$page_rank <- wiki_talk$page_rank * wiki_talk$network_size
colnames(wiki_talk) <- paste0('wiki_talk_', colnames(wiki_talk))

# Select the founders of interest based on the founding date of the wiki
founders <- read.csv('~/Projects/predicting_founders/data/wiki_founder_list.csv')
founders$founding_date <- as.Date(founders$founding_date)
founders <- founders[(founders$founding_date > START_DATE) & (founders$founding_date < END_DATE),]


# Calculate the median number of total editors that a given founder's projects have
median_editors <- founders[,c('first_editor','second_editor','third_editor','total_editors', 'founding_date')]  %>%
    gather(key, editor, -founding_date, -total_editors) %>%
    group_by(editor) %>%
    summarise_each(funs(median(., na.rm=T)), -key)

median_editors$total_editors <- as.integer(median_editors$total_editors)

# This merge removes users who we don't have behavior stats for 
# (b/c they don't have an editor id - they are in the NaN/ dir)
# It also removes users from the talk network who were talked to, but who never talked
# i.e., those where a message was left on their talk page, but they have no activity in the period

full_stats <- merge(behavior_stats, coedit_stats, by.x = 'editor' ,by.y = 'coedit_editor', all.x = T)
full_stats <- merge(full_stats, talk_stats, by.x = 'editor', by.y = 'talk_editor', all.x = T)
full_stats <- merge(full_stats, wiki_coedit, by.x=c('editor','top_wiki'), by.y=c('wiki_coedit_editor','wiki_coedit_wiki_name'), all.x=T)
full_stats <- merge(full_stats, wiki_talk, by.x=c('editor','top_wiki'), by.y=c('wiki_talk_editor','wiki_talk_wiki_name'), all.x=T)
full_stats <- merge(full_stats, median_editors, by = 'editor', all.x = T)

# Remove obvious bots
full_stats <- full_stats[!grepl('[Bb][Oo][Tt]$', full_stats$editor),]
#full_stats <- full_stats[full_stats$edits_in_period > 50,]
full_stats <- full_stats[full_stats$wiki_count < 100,]

# Identify whether a user counts as a founder
founder_list <- c(as.character(founders$first_editor), 
                  as.character(founders$second_editor),
                  as.character(founders$third_editor)
                 )
full_stats['Founder'] <- full_stats$editor %in% founder_list


# Log extremely skewed masures for analyses
to_log <- c('first_edit','last_edit',
            'total_edits','edits_in_period','tokens_added','median_tokens_added',
            'wiki_count','founding_count','earliest_edit','coedit_coreness',
            'coedit_degree','talk_coreness','talk_degree','talk_indegree','talk_outdegree',
            'talk_undirected_coreness', 'wiki_coedit_coreness','wiki_coedit_degree','wiki_talk_coreness',
            'wiki_talk_degree','wiki_talk_indegree','wiki_talk_outdegree', 'wiki_talk_undirected_coreness',
            'wiki_coedit_network_size','wiki_talk_network_size'
           )

ratio_log <- function(x) {
    return(log(100 * x + 1, 2))
}

normal_log <- function(x) {
    return(log(x+1,2))
}

to_ratio_log <- c('admin_ratio','coedit_betweenness','coedit_page_rank','talk_betweenness','talk_page_rank',
                  'wiki_coedit_betweenness','wiki_coedit_page_rank','wiki_talk_betweenness',
                  'wiki_talk_page_rank'
                 )

full_stats[,c('first_edit','last_edit')] <- lapply(full_stats[,c('first_edit','last_edit')], as.Date)
full_stats[,c('first_edit','last_edit')] <- lapply(full_stats[,c('first_edit','last_edit')], 
                                                   (function (x) as.integer(as.Date('2009-05-01') - x)))

# New dataframe with logged values, used for regression                                                       
tr_stats <- full_stats
tr_stats[,to_log] <- lapply(tr_stats[,to_log], normal_log)
tr_stats[,to_ratio_log] <- lapply(tr_stats[,to_ratio_log], ratio_log)
#colnames(tr_stats)[c(to_log, to_ratio_log)]
#colnames(tr_stats)[c(to_log, to_ratio_log)] <- paste0('logged_', colnames(tr_stats)[c(to_log, to_ratio_log)])
# Drop unneeded columns
tr_stats <- subset(tr_stats, select=-c(editor, top_wiki, editor_id))
Warning message:
“attributes are not identical across measure variables; they will be dropped”

Modeling

After loading the stats, I looked at the distributions of the variables (omitted here for brevity). They all seem reasonable, if highly skewed.

So, we go on to regression modeling.

In [3]:
# Start by creating sets of variables that can be added or removed from models

behavior <- c('first_edit','last_edit',
              'total_edits','edits_in_period','edit_days',
              'tokens_added','median_tokens_added',
              'wiki_count','wiki_gini',
              'talk_ratio','admin_ratio','founding_count','earliest_edit',
              'page_creation_count',
              'page_creation_ratio',
              'top_wiki_edits','top_wiki_talk_ratio')


global_net <- c('coedit_betweenness','coedit_coreness','coedit_degree','coedit_page_rank',
                'talk_betweenness',
                'talk_coreness',
                #'talk_degree',
                'talk_indegree','talk_outdegree',
                'talk_page_rank','talk_undirected_coreness')

local_net <- c('wiki_coedit_betweenness','wiki_coedit_coreness','wiki_coedit_degree',
               'wiki_coedit_network_size','wiki_coedit_page_rank','wiki_talk_betweenness',
               'wiki_talk_coreness',
               #'wiki_talk_degree',
               'wiki_talk_indegree','wiki_talk_network_size','wiki_talk_outdegree',
               'wiki_talk_page_rank','wiki_talk_undirected_coreness')

When we look at the correlation between measures, we see some that are highly correlated. Future analysis should definitely look into either collapsing or removing some of these measures.

In [7]:
library(corrplot)
corrplot(cor(tr_stats[,c(behavior, global_net, local_net)], use='complete.obs'), order = 'AOE', type = 'lower')

Despite these concerns, we press on :)

In [20]:
library(texreg)

#tr_stats <- tr_stats[!is.na(tr_stats$wiki_talk_betweenness),]

# Separate models for each 
founder_predict_behavior <- glm(Founder ~ ., data = tr_stats[,c('Founder',
                                                       behavior
                                                                     )],
                       family = "binomial"(link='logit'))
founder_predict_global <- glm(Founder ~ ., data = tr_stats[,c('Founder',
                                                       global_net
                                                                     )],
                       family = "binomial"(link='logit'))
founder_predict_local <- glm(Founder ~ ., data = tr_stats[,c('Founder',
                                                       local_net
                                                                     )],
                       family = "binomial"(link='logit'))

founder_predict_full <- glm(Founder ~ ., data = tr_stats[,c('Founder',
                                                            behavior,
                                                            global_net,
                                                       local_net
                                                                     )],
                       family = "binomial"(link='logit'))

IRdisplay::display_html(htmlreg(list(founder_predict_behavior, founder_predict_global, 
                                     founder_predict_local, founder_predict_full)))

#exp(founder_predict$coefficients)
Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) -3.68*** -4.57*** -3.21*** -1.12
(0.26) (0.17) (0.88) (1.06)
first_edit -0.07* -0.09*
(0.03) (0.04)
last_edit -0.29*** -0.39***
(0.03) (0.05)
total_edits 0.04 -0.05
(0.03) (0.05)
edits_in_period 0.19** 0.30**
(0.06) (0.09)
edit_days -0.02** -0.01
(0.01) (0.01)
tokens_added 0.02 -0.05
(0.03) (0.04)
median_tokens_added -0.03 -0.07*
(0.02) (0.04)
wiki_count 0.53*** 0.35**
(0.08) (0.14)
wiki_gini 0.62* 0.55
(0.28) (0.38)
talk_ratio 0.63 -1.06
(0.72) (1.19)
admin_ratio 0.09** 0.09*
(0.03) (0.04)
founding_count 0.28 0.38
(0.18) (0.28)
earliest_edit -0.08*** -0.07***
(0.01) (0.02)
page_creation_count 0.00* 0.00
(0.00) (0.00)
page_creation_ratio 0.20 1.02***
(0.13) (0.30)
top_wiki_edits -0.00* -0.00
(0.00) (0.00)
top_wiki_talk_ratio -0.31 1.13
(0.68) (1.09)
coedit_betweenness 4.85*** 0.63
(1.15) (1.29)
coedit_coreness -0.38*** 0.20
(0.11) (0.18)
coedit_degree 0.37*** -0.07
(0.08) (0.15)
coedit_page_rank -88.12*** -23.43
(15.63) (16.35)
talk_betweenness 2.91*** 0.35
(0.65) (0.76)
talk_coreness 0.87*** 0.48
(0.22) (0.33)
talk_indegree 0.65*** 0.35
(0.12) (0.20)
talk_outdegree -0.05 -0.31*
(0.08) (0.16)
talk_page_rank -37.13*** 0.90
(9.81) (11.12)
talk_undirected_coreness -1.10*** -0.30
(0.21) (0.31)
wiki_coedit_betweenness 0.13*** 0.01
(0.04) (0.04)
wiki_coedit_coreness 0.07 -0.20
(0.16) (0.19)
wiki_coedit_degree 0.02 0.04
(0.17) (0.20)
wiki_coedit_network_size -0.30*** 0.07
(0.06) (0.07)
wiki_coedit_page_rank 0.05 -0.13
(0.14) (0.15)
wiki_talk_betweenness 0.03 -0.01
(0.03) (0.04)
wiki_talk_coreness 0.05 -0.18
(0.25) (0.35)
wiki_talk_indegree 0.16 -0.01
(0.15) (0.22)
wiki_talk_network_size 0.10 -0.06
(0.07) (0.07)
wiki_talk_outdegree 0.25** 0.29
(0.09) (0.16)
wiki_talk_page_rank -0.08 -0.18
(0.12) (0.13)
wiki_talk_undirected_coreness -0.02 -0.00
(0.26) (0.35)
AIC 6878.71 4430.31 3747.39 3239.24
BIC 7041.03 4515.24 3846.56 3544.36
Log Likelihood -3421.36 -2204.15 -1860.70 -1579.62
Deviance 6842.71 4408.31 3721.39 3159.24
Num. obs. 60959 16663 15184 15183
***p < 0.001, **p < 0.01, *p < 0.05

In the full model, we see that activity (recency of the last edit, number of edits made), experience (earliest previous participation, page creation ratio) and diversity of experience (number of wikis contributed to) were all positive predictors (p < .01). The item with the highest estimated coefficient was the page creation ratio, where a user who only created new pages was 2.7 times more likely to be a founder than one who never did. While no network measures were significant in the full model, a model with only network measures was actually a better fit for the data than a model with only behavior measures (based on AIC). This suggests that network measures are important predictors, but p-values are deflated based on multicollinearity. When only complete observations are included, behavior items are better predictors than network measures, suggesting that users who are not in networks are much more difficult to predict.

In [21]:
library(MASS)

growth_behavior <- glm.nb(total_editors ~ ., data = tr_stats[,c('total_editors','founding_date',
                                                               behavior
                                                              )])

growth_global <- glm.nb(total_editors ~ ., data = tr_stats[,c('total_editors','founding_date',
                                                              global_net
                                                              )])

growth_local <- glm.nb(total_editors ~ ., data = tr_stats[,c('total_editors','founding_date',
                                                              local_net
                                                              )])

growth_full <- glm.nb(total_editors ~ ., data = tr_stats[,c('total_editors','founding_date',
                                                               behavior,
                                                              global_net,
                                                              local_net
                                                              )])
IRdisplay::display_html(htmlreg(list(growth_behavior, growth_global, growth_local, growth_full)))
Attaching package: ‘MASS’

The following object is masked from ‘package:dplyr’:

    select

Statistical models
Model 1 Model 2 Model 3 Model 4
(Intercept) 57.75* 33.70 11.22 6.44
(27.04) (32.30) (34.56) (32.63)
founding_date -0.00* -0.00 -0.00 -0.00
(0.00) (0.00) (0.00) (0.00)
first_edit -0.02 -0.03
(0.02) (0.03)
last_edit 0.11*** 0.00
(0.03) (0.04)
total_edits 0.05* 0.11**
(0.02) (0.04)
edits_in_period 0.18*** 0.00
(0.05) (0.07)
edit_days -0.00 -0.01*
(0.00) (0.00)
tokens_added -0.09*** -0.05
(0.02) (0.04)
median_tokens_added 0.10*** 0.09**
(0.02) (0.03)
wiki_count -0.18** -0.02
(0.06) (0.09)
wiki_gini 0.45* 0.23
(0.23) (0.28)
talk_ratio 0.18 1.24
(0.59) (0.90)
admin_ratio -0.07** -0.05
(0.02) (0.03)
founding_count -0.06 -0.10
(0.13) (0.17)
earliest_edit -0.01 0.02
(0.01) (0.01)
page_creation_count 0.00 0.00
(0.00) (0.00)
page_creation_ratio -0.63*** -0.13
(0.12) (0.24)
top_wiki_edits -0.00 -0.00
(0.00) (0.00)
top_wiki_talk_ratio -0.15 -0.96
(0.55) (0.82)
coedit_betweenness 0.50 -0.81
(0.71) (0.77)
coedit_coreness 0.23** 0.44***
(0.09) (0.12)
coedit_degree -0.05 -0.18
(0.06) (0.10)
coedit_page_rank -2.55 7.03
(8.60) (9.43)
talk_betweenness -1.80*** -0.45
(0.44) (0.46)
talk_coreness -0.08 0.36
(0.17) (0.22)
talk_indegree 0.01 -0.12
(0.08) (0.13)
talk_outdegree 0.18*** 0.11
(0.05) (0.09)
talk_page_rank 11.08 3.63
(6.41) (6.83)
talk_undirected_coreness -0.14 -0.48*
(0.16) (0.21)
wiki_coedit_betweenness -0.04 -0.03
(0.03) (0.03)
wiki_coedit_coreness 0.03 -0.36**
(0.12) (0.14)
wiki_coedit_degree -0.14 0.25
(0.13) (0.14)
wiki_coedit_network_size 0.17*** 0.09
(0.05) (0.05)
wiki_coedit_page_rank 0.16 0.02
(0.11) (0.11)
wiki_talk_betweenness -0.03 -0.05
(0.03) (0.03)
wiki_talk_coreness 0.24 -0.12
(0.20) (0.24)
wiki_talk_indegree 0.37** 0.41**
(0.12) (0.15)
wiki_talk_network_size -0.21*** -0.18**
(0.05) (0.06)
wiki_talk_outdegree 0.13* -0.02
(0.06) (0.10)
wiki_talk_page_rank -0.16 -0.05
(0.10) (0.10)
wiki_talk_undirected_coreness -0.60** -0.11
(0.21) (0.24)
AIC 5183.79 3479.97 2962.44 2918.87
BIC 5278.02 3536.10 3024.73 3093.28
Log Likelihood -2571.89 -1726.99 -1466.22 -1417.43
Deviance 815.43 546.66 463.40 451.78
Num. obs. 822 554 470 470
***p < 0.001, **p < 0.01, *p < 0.05

In this case, the experience of having many previous edits overall and the median edit size were positively associated with growth. Two indicators of network integration – coreness in the global collaboration network and indegree in the local communication network – also positively predicted growth, while the size of the local communication network and local collaboration coreness were negatively associated with growth.

Closer look

We can get a better sense of what's going on by looking at the data for a few of the measures. First, let's look at how many wikis a user edited.

In this density plot, we see that founders were likely to edit more wikis than non-founders

In [24]:
ggplot(full_stats, aes(x=wiki_count)) + 
	geom_density(aes(group=Founder, colour=Founder, fill=Founder), 
                      bw=.15,
                      alpha=0.3) + 
    scale_x_continuous(trans = 'log1p', breaks = c(1,5,10,20)) +
    ylab('Density') + xlab('Number of different wikis edited in previous 2 months (log scale)')

However, when we look at the relationship between wikis edited and eventual community size, there is not a strong relationship.

In [26]:
ggplot(full_stats[!is.na(full_stats$total_editors),], aes(x=wiki_count, y=total_editors)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm) +
    scale_y_log10() + 
    scale_x_continuous(trans = 'log1p', breaks = c(1,5,10,20)) +
    ylab('Total contributors') + xlab('Number of different wikis edited in previous 2 months (log scale)')

Another interesting measure is the talk indegree - those who are talked to more aren't much more likely to found new wikis

In [28]:
ggplot(full_stats[!is.na(full_stats$wiki_talk_indegree),], aes(x=wiki_talk_indegree)) + 
	geom_density(aes(group=Founder, colour=Founder, fill=Founder), 
                      bw=.2, 
                      alpha=0.3) + 
    scale_x_continuous(trans = 'log1p', breaks = c(1,10,100,1000,5000)) +
    ylab('Density') + xlab('Indegree in top wiki\'s communication network (log scale)')

However, when we look at the correlation between indegree and growth, it appears that those with high indegree are able to start larger communities, on average

In [31]:
ggplot(full_stats[!is.na(full_stats['total_editors']) & !is.na(full_stats$wiki_talk_indegree),], 
                                                               aes(x=wiki_talk_indegree, y=total_editors)) +
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm) +
    scale_y_log10() + 
    scale_x_continuous(trans = 'log1p', breaks = c(1,10,100,1000,5000)) +
    ylab('Total Contributors') + xlab('Indegree in top wiki\'s communication network (log scale)')

Discussion

These results echo results from entrepreneurship literature in some respects - users who are more active, with more diverse activity, are more likely to become founders. There also appears to be a proclivity for creation - those who create more pages are also more likely to be founders, as are those who have contributed to other early wikis. I know from my other research that many wikis are started on a whim, and that is one mechanism by which greater activity might occur - those on the site more have more opportunities for "whims".

While founders are often new editors and those who create new pages, high-growth wikis are created by users who are integrated into the fabric of the site's social networks and who have lots of experience editing. In particular, they have both many edits and many words per edit. These relationships suggest two complementary pathways to growth: through social capital (i.e., recruiting members of your network) and through seeding the site with initial content. The negative association between local integration and growth complicates this story, and the relationship between global and local integration bears further exploration. For example, it is possible that high local integration indicates a founder who is more dedicated to their current project and thus less willing to contribute to their new wiki, while those with high global and low local integration have both social capital and a willingness to use it.

Limitations

There are lots of limitations - one really important one is that founding, empirically, is often done by new users, with no history. Very few of these users show up in this dataset, and it is obviously impossible to predict anything about them since they found before they produce enough data to learn anything about them.

Further work could look at how the wikis of these new users compares to the users in my sample.