In the end, I will be using social network analysis as a driver for why people move through roles, but I will start with identifying what the roles are, and some summary statistics of how people move through them.
In order to identify roles, I created monthly activity snapshots for each user, and then used a clustering algorithm in R to automatically identify different "behavioral roles". There is some evidence that the data don't cluster cleanly, but clustering isn't a central component of my research, so I am moving forward anyway.
I decided to use the k-mediods (aka "partitioning around mediods" or "pam") algorithm in R. I used the silhouette function to identify the best k (which was 3 clusters).
The data I used to create the clusters only included those months where a user made at least 5 edits. So, I created a 4th cluster to represent months where a user made less than 5 edits, and used a python program to add the cluster results to the original stats file.
I then wrote another python script to rearrange this data, so that it is in the format
ID Month1 Month2 Month3 ...
1 1 2 2
2 1 0 2
so that Month1 is the user's role in their first month, Month2 the role in that user's 2nd month, etc.
For way, way too long today I've been trying to get R to display this data as a stacked area graph of the ratio of roles by month. It's been a huge pain to try to figure out how to reshape it, etc.
Once I can get that, I want to compare that graph to the graph of those who were in each role at least X times during their tenure.