Founding New Knowledge Commons

With the help of the Wikia organization, we surveyed over 600 founders of new communities on the wiki hosting website Wikia.com.

Below, we explore some of their motivations, and plans for the future of their communities.

In [261]:
# First, load the libraries
%matplotlib inline

import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
matplotlib.rcParams['figure.figsize'] = (12.0, 8.0)
In [78]:
# Then, load and clean up the data a bit
d = pd.read_csv('/home/jeremy/Projects/founding_motivation/cleaned_survey.csv')
d.Progress = pd.to_numeric(d.Progress)
def to_numeric(n):
    try:
        return pd.to_numeric(n[0])
    except:
        return None
# This crazy regex makes a bunch of the columns numeric
d[list(d.filter(regex=r'^skill|motiv.*(?<!text)$').columns)] = d.filter(regex=r'^skill|motiv.*(?<!text)$').applymap(to_numeric)

# Next, for motivations, we assume that if they didn't answer half of them, then we should remove them.
# Otherwise, we assume that missings mean not a motivation.
motiv_answered = d.loc[:,'motivation-playing_around/learning':'motivation-Other_weight'].apply(lambda x: sum(x.notnull()) > 6, axis=1)
d.loc[motiv_answered,'motivation-playing_around/learning':'motivation-Other_weight'] = d.loc[motiv_answered,'motivation-playing_around/learning':'motivation-Other_weight'].fillna(1)

Motivations

We start off looking at motivations. We included a set of questions about what motivated founders to start the new community. Respondents were asked to rank from 1 (not a motivation at all) to 5 (a primary motivation) 13 different possible motivations.

In [221]:
# Get the data
motivations = d.loc[:,'motivation-playing_around/learning':'motivation-Other_weight']
X = motivations.dropna()
print('{} total responses'.format(len(X)))
r = pd.DataFrame([X.mean(), X.std(), X.apply(lambda x: sum(x == 5), axis=0)]).T
r.columns = ['mean', 'SD', 'Primary Motivation Count']
r
328 total responses
Out[221]:
mean SD Primary Motivation Count
motivation-playing_around/learning 2.719512 1.528721 64.0
motivation-joke 1.533537 1.156858 22.0
motivation-personal_material 3.115854 1.640092 106.0
motivation-communicate_with_friends 2.939024 1.559013 84.0
motivation-publicity 2.420732 1.567970 56.0
motivation-new_product 2.262195 1.623086 62.0
motivation-no_existing_community 3.289634 1.730713 139.0
motivation-poor_quality_existing 2.134146 1.583649 57.0
motivation-spread_information 4.067073 1.403931 201.0
motivation-disagreed_with_existing 1.932927 1.453167 43.0
motivation-build_a_community 3.557927 1.511323 138.0
motivation-governing_the_conversation 3.015244 1.648764 100.0
motivation-Other_weight 1.984756 1.674529 73.0

The desire to spread information is the most common, followed by building a community, and because there was no existing community. Creating communities as a joke or because the founder disagreed with how an existing community was being run were rare motivations, but both of them were listed as a primary motivation at least 22 times.

Correlation

We next look at the correlation between motivation responses, and see that there are few strong correlations. Unsurprisingly, the two motivations related to existing wikis are correlated.

A little more surprisingly, there is a correlation between the motivation to build a community and to govern the conversation.

In [ ]:
from biokit.viz import corrplot
In [262]:
c = corrplot.Corrplot(X.corr())
c.plot(upper='square')

Factor Analysis

We start with some factor analysis. This is a technique to figure out which responses are related to each other. We focus on motivations, and definition of success.

I haven't done much factor analysis, but it looks like R has some better tools for doing a complex factor analysis, but we can learn a little from looking at this simple one.

In [227]:
from sklearn.decomposition import FactorAnalysis
from sklearn import preprocessing
# Fit the model
X_norm = preprocessing.scale(X)
fa = FactorAnalysis(n_components=3)
fit = fa.fit(X_norm)

pd.DataFrame(fit.components_,columns=X.columns)
Out[227]:
motivation-playing_around/learning motivation-joke motivation-personal_material motivation-communicate_with_friends motivation-publicity motivation-new_product motivation-no_existing_community motivation-poor_quality_existing motivation-spread_information motivation-disagreed_with_existing motivation-build_a_community motivation-governing_the_conversation motivation-Other_weight
0 0.304799 0.201451 0.225908 0.362026 0.367095 0.332778 0.131988 0.662338 0.401281 0.619156 0.525146 0.561466 0.084039
1 -0.071324 0.070387 -0.214251 -0.323388 -0.243195 0.036417 -0.135294 0.460029 -0.272511 0.327754 -0.414283 -0.291444 0.025427
2 0.285512 0.344092 0.449547 0.359698 0.161334 -0.047414 -0.208144 -0.030066 -0.343197 0.005151 -0.252651 -0.033233 -0.018148

For example, Factor 0 appears to represent the general desire to create and spread information, Factor 1 the desire to break away from existing wikis, and Factor 3 as local motivations - learning, joking with friends, and creating and publicizing your own content.

Text Analysis

We also asked a free response question about why people started their wiki. I will code these responses, but we can also gain some insight based on topic modeling. This technique attempts to recover latent topics which are assumed to be behind the creation of text.

In [161]:
# Get all of the responses which have at least 5 words.
free_response = d.why_start_wiki
free_response = free_response[free_response.apply(lambda x: len(str(x).split()) > 5)]
In [178]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

n_features = 20000
n_topics = 6
n_top_words = 12


data_samples = free_response

# Use tf (raw term count) features for LDA.
tf_vectorizer = CountVectorizer(max_df=0.95, # Terms that show up in > max_df of documents are ignored
                                min_df=2, # Terms that show up in < min_df of documents are ignored
                                max_features=n_features, # Only use the top max_features 
                                stop_words='english',
                                ngram_range=(1,2))

tf = tf_vectorizer.fit_transform(data_samples)


lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0,
                                n_jobs=2)

model = lda.fit(tf)
transformed_model = lda.fit_transform(tf)

tf_feature_names = tf_vectorizer.get_feature_names()

topic_dist = [[topic/sum(curr_array) for topic in curr_array]
              for curr_array in transformed_model]

td = pd.DataFrame(topic_dist)

def get_top_words(model, feature_names, n_top_words):
    '''Takes the model, the words used, and the number of words requested.
    Returns a dataframe of the top n_top_words for each topic'''
    r = pd.DataFrame()
    # For each topic
    for i, topic in enumerate(model.components_):
        # Get the top feature names, and put them in that column
        r[i] = [add_quotes(feature_names[j])
                    for j in topic.argsort()[:-n_top_words - 1:-1]]
    return r

def add_quotes(s):
    if " " in s:
        s =  '"{}"'.format(s)
    return s

# Rearrange the columns by how often each topic is used
td = td.reindex_axis(sorted(td.columns, key = lambda x: td[x].sum(), reverse=True),axis=1)
topic_words = get_top_words(lda, tf_feature_names, 20)
topic_words = topic_words.reindex_axis(sorted(topic_words.columns, key = lambda x: td[x].sum(), reverse=True),axis=1)
topic_words.columns = ['Topic {}'.format(i) for i in range(1, len(topic_words.columns) + 1)]

print("\nTopics in LDA model, ordered by frequency:")
topic_words.head(15)
Topics in LDA model, ordered by frequency:
Out[178]:
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
0 information wiki make want wikia felt
1 wanted make wanted people thought wikis
2 game decided just help know info
3 wiki started wiki make shows reason
4 wikia people new characters games current
5 share wanted book place just walking
6 place wikis "wanted make" learn thing dead
7 people good series different wikis lack
8 help love like field admin players
9 create community world "want place" love just
10 just just clan story online different
11 site "decided make" website build like "walking dead"
12 ve friend "make wiki" way wanted knowledge
13 need learn idea ideas youtube role
14 start know let try "just thought" start

The results are not super intuitive, but we might describe Topic 1 as information sharing, Topic 2 and 3 as spur of the moment decisions, Topic 4 as creating a place for creativity. Topic 5 and 6 are less clear.

Defining Success

We apply a similar approach to look at how founders view success. We asked founders "Think about how you will assess whether or not your wiki is successful, and rank which of the following are the most important measures of success for you.", and provided a number of measures of success, such as the quality of the information, the number of contributors, etc.

As with motivation, we begin with descriptives.

In [240]:
success = d.loc[:,'successful-large_number_of_contributors':'successful-Other']
X = success.dropna()
print('{} total responses'.format(len(X)))
r = pd.DataFrame([X.mean(), X.std(), X.apply(lambda x: sum(x == 1), axis=0)]).T
r.columns = ['mean', 'SD', 'Top Measure Count']
r
268 total responses
Out[240]:
mean SD Top Measure Count
successful-large_number_of_contributors 4.660448 1.514104 10.0
successful-large_amount_of_information 2.988806 1.585239 31.0
successful-High_quality_information_ 2.302239 1.651828 124.0
successful-long_time_community 3.585821 1.495344 28.0
successful-active_community 3.776119 1.502267 28.0
successful-Meeting_new_people 4.634328 1.726814 20.0
successful-Other 6.052239 1.971015 27.0

We see that producing a large amount of high-quality information is by far the most important criteria for success, followed by community. The fact that the other category is so lowly rated suggests that we did a pretty good job in capturing the most important ways users measure success.

Correlation

We next look at the correlation between success responses, and see an interesting pattern. There is a correlation between creating a large amount of information and high quality information, but both of these goals are actually negatively correlated with all of the other goals.

In [263]:
c = corrplot.Corrplot(X.corr())
c.plot(upper='square')

Factor Analysis

As with motivation, we perform a simple factor analysis. In this case, Factor 0 also appears to represent a general desire for succes. Factor 1 community-building. Factor 2 is interesting, because it appears to show a desire for a large number of people producing a lot of information, but the longevity is not important. Finally, Factor 3 is all about information creation.

In [229]:
X_norm = preprocessing.scale(X)
# Fit the model
fa = FactorAnalysis(n_components=4)
fit = fa.fit(X_norm)

pd.DataFrame(fit.components_,columns=X.columns)
#print(fit.noise_variance_init)
Out[229]:
successful-large_number_of_contributors successful-large_amount_of_information successful-High_quality_information_ successful-long_time_community successful-active_community successful-Meeting_new_people successful-Other
0 0.169222 0.121574 0.135705 0.329810 0.210516 0.263791 -0.983277
1 0.098072 0.184010 0.242966 0.506480 0.023101 -0.895266 -0.044462
2 0.608608 0.267216 0.159714 -0.709807 0.098904 -0.281060 -0.106926
3 -0.756043 0.614027 0.654464 -0.340754 0.108272 -0.193855 -0.115713

Topic Modeling

Topic modeling follows a similar strategy. We look only at the text that people entered in the "Other" field as a measure of success. In that sense, this is an analysis of unexpected success definitions, not captured by our original survey options.

In [212]:
n_features = 20000
n_topics = 3
#n_top_words = 12


data_samples = d.loc[d['successful-Other_text'].apply(lambda x: len(str(x).split()) > 4),'successful-Other_text']

# Use tf (raw term count) features for LDA.
tf_vectorizer = CountVectorizer(max_df=0.95, # Terms that show up in > max_df of documents are ignored
                                min_df=2, # Terms that show up in < min_df of documents are ignored
                                max_features=n_features, # Only use the top max_features 
                                stop_words='english',
                                ngram_range=(1,2))

tf = tf_vectorizer.fit_transform(data_samples)


lda = LatentDirichletAllocation(n_topics=n_topics, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0,
                                n_jobs=2)

model = lda.fit(tf)
transformed_model = lda.fit_transform(tf)

tf_feature_names = tf_vectorizer.get_feature_names()

topic_dist = [[topic/sum(curr_array) for topic in curr_array]
              for curr_array in transformed_model]

td = pd.DataFrame(topic_dist)

# Rearrange the columns by how often each topic is used
td = td.reindex_axis(sorted(td.columns, key = lambda x: td[x].sum(), reverse=True),axis=1)
topic_words = get_top_words(lda, tf_feature_names, 20)
topic_words = topic_words.reindex_axis(sorted(topic_words.columns, key = lambda x: td[x].sum(), reverse=True),axis=1)
topic_words.columns = ['Topic {}'.format(i) for i in range(1, len(topic_words.columns) + 1)]

print("\nTopics in LDA model, ordered by frequency:")
topic_words.head(15)
Topics in LDA model, ordered by frequency:
Out[212]:
Topic 1 Topic 2 Topic 3
0 people want info
1 information groups useful
2 provide actually place
3 topic topic actually
4 actually place provide
5 groups people want
6 useful information groups
7 place useful topic
8 want info people
9 info provide information

These topics don't appear to provide any intuitive help. One of the limitations of topic modeling is that it can be noisy when looking at a small corpus, which is what we have. In this case, we can look directly at the responses.

In [239]:
for x in data_samples:
    print(x)
A safe place for me and my "clanmates"
unit members find it useful and help populate it
everyone interis Light man story.
A groups of people which cares about me
Information is added, updated and following. Nothing else matters.
spreading a ton of info to ma frends witout dem bein bord
Wish I could answer this about my main wiki...
People that actually benefit from my information
Empowering people to fight Congress
A complete 'encyclopedia' if you will
Allowing users to use their imaginations and write about what they want, if related to Venture
to learn about others and they learn about me
People who disagree and provide original opinion.
people having fun and showing it in some way
A place where people can go to find what they want about any topic of Azure Mines
Curiousity about the history of PiraĆ­ City
More publicity for my work
A large amount of contribution
Coherent, interesting, informative, magnanimous, and engaging information about characters, presentations of the fights, and discussions about the hypothetical encounters in question.
How useful other players find the info
seeing the material actually used by Democracy Spring groups and individuals
The medium being rich enough to house the information I want it to
Interaction with the topic in other areas (roleplay, fanfiction, fanart)
Helping people with their animals
attraction of potential audience from all over the world
High page views, being the #1 source on the topic
To keep track of the campaign
To provide people with my creations

We can quickly see why topic modeling had a difficult time. The responses are quite short, and often difficult to decipher. A few themes that do emerge are the desire for the wiki to be useful to others (e.g., "seeing the material actually used by Democracy Spring groups and individuals", "People that actually benefit from my information"), the desire for audience (e.g., "High page views", "More publicity for my work"), and the desire for relationships (e.g., "A groups of people which cares about me", "A safe place for me and my 'clanmates'").

Motivations and Success

An examination of the correlations between motivations and success determinants shows very few relationships. The one exception is that those whose goal is to meat new people are less likely to be motivated by learning, joking, or communicating with friends.

In [281]:
motiv_success = motivations.join(success)
c = corrplot.Corrplot(motiv_success.corr())
c.plot(upper='square')

Regression

We could extend this analysis by using regression to predict certain outcomes.

For example, below I predict the planned commitment to the project, based on motivations. I could also gather data to look at, for example, the success of a project based on motivations, or the activity of a founder based on motivations (i.e., "Which motivations actually motivate?")

In [321]:
import statsmodels.api as sm

# Binarize the data for logistic regression (ordinal is probably better)
motivations['high_commitment'] = d.time_dedicated.str.match(r'[M3]')
motivations = motivations.dropna()
Y = motivations.high_commitment.astype('float')
X = motivations.drop('high_commitment',1)
X['intercept'] = 1.0
logit = sm.Logit(Y,X).fit()
logit.summary()
Optimization terminated successfully.
         Current function value: 0.665872
         Iterations 5
Out[321]:
Logit Regression Results
Dep. Variable: high_commitment No. Observations: 310
Model: Logit Df Residuals: 296
Method: MLE Df Model: 13
Date: Thu, 11 Aug 2016 Pseudo R-squ.: 0.03923
Time: 11:31:30 Log-Likelihood: -206.42
converged: True LL-Null: -214.85
LLR p-value: 0.2058
coef std err z P>|z| [95.0% Conf. Int.]
motivation-playing_around/learning -0.2801 0.088 -3.171 0.002 -0.453 -0.107
motivation-joke 0.0936 0.110 0.854 0.393 -0.121 0.308
motivation-personal_material 0.0732 0.081 0.903 0.367 -0.086 0.232
motivation-communicate_with_friends 0.0724 0.087 0.832 0.406 -0.098 0.243
motivation-publicity -0.0649 0.086 -0.759 0.448 -0.233 0.103
motivation-new_product 0.0806 0.080 1.014 0.311 -0.075 0.236
motivation-no_existing_community 0.0152 0.073 0.210 0.834 -0.127 0.157
motivation-poor_quality_existing -0.0768 0.094 -0.814 0.416 -0.262 0.108
motivation-spread_information -0.0737 0.098 -0.754 0.451 -0.265 0.118
motivation-disagreed_with_existing 0.1462 0.101 1.444 0.149 -0.052 0.345
motivation-build_a_community 0.1065 0.098 1.085 0.278 -0.086 0.299
motivation-governing_the_conversation 0.0639 0.086 0.741 0.458 -0.105 0.233
motivation-Other_weight -0.0294 0.073 -0.401 0.689 -0.173 0.114
intercept -0.2807 0.521 -0.539 0.590 -1.302 0.741

In this case, we see that only the learning/playing around motivation has a significant relationship with the desire to have high commitment to the community.

Summary

Overall, the results of this survey show that peer production founders have diverse motivations. Many of these are built around the desire to be useful: to create, share, and organize information for others. Others are built around a desire for community and relationships. Many are founded as a response to poor management of other knowledge commons. A surprising amount of new wikis are founded with local goals: spreading one's own content, providing space for oneself and one's friends to be creative, or identifying like-minded others.