An Introduction

1 minute read



I am a Master's student at Purdue, working on my thesis project. For my project, I am looking at the edits made in the online genealogy community WeRelate, from the framework of Social Network Analysis.

My overall theory is that people in communities (like WeRelate) move through different patterns of behavior. For example, at first they are novices - they work mostly alone, and don't do too much. As they learn more about the community and the technology, they start to collaborate more, do more work, and do specialized work.

My goal is to use machine learning (specifically clustering) to identify different "behavioral signatures", and then to track how people move through those behavioral patterns. Specifically, I want to test how interactions with others affect how/whether/when people change their behavioral patterns.

Purpose of this Blog

In order to study this, I am using a number of tools that I have used only casually before. Namely, postgreSQL, R, and RSiena. While I have quite a lot of experience writing Python scripts to do data manipulation, the scale of this data is much larger than anything I've used before. There are 15.5 million edits that are tracked, stored in a giant XML document. I initially tried to manipulate the document directly with Python, but after far too much time spent waiting for my scripts to go through that giant file, realized that solution wouldn't work, and I moved things into a PSQL database.

I have a few goals for this blog:

  1. To write down what I'm working on, which will hopefully motivate me to keep going.
  2. To provide a resource for others who are trying to do similar things. For RSiena in particular, it's been very tough to find beginner-level resources.
  3. Ideally, to find some people who can give me advice and suggestions.
  4. To prove to my wife  (and committee) how hard I'm working. :)
So, for the most part, my posts will be quite technical, outlining what I'm working on, and what I (have or haven't) learned.