COM 674

Dad Joke

Interviewer: “Where do you see yourself in 5 years?”

Me: “Listening. I would say listening is my biggest weakness.”


  • Proposal feedback is in
  • Self evals overview
    • Please share confusions!
  • Data collection probably needs to be happening
  • JuptyerLab
  • The rest of the class
    • Git tonight
    • Stats
    • Screen scraping
    • Additional topics

Paper review

Sara Klingenstein, Tim Hitchcock, and Simon DeDeo. 2014. The civilizing process in London’s Old Baily. Proceedings of the National Academy of Sciences

How can we detect changes in culture?

  • Digitization of historical records allows for comparison across time
    • Google N-gram viewer
    • Censuses
  • This paper uses court transcripts from London’s Old Bailey courthouse from 1760s-1910s

Violent and non-violent offenses are talked about differently

Some questions

  • What are some of the dangers of doing this kind of work?
  • What are some computational methods that are becoming “normalized” in your discipline?

Cleaning data

  • Why is data ‘dirty’?
    • Errors in transcription
    • Bugs in software that produced it
    • Missing data (e.g., when a date is unknown it’s recoded as Jan 1)
    • Can’t be read by software
      • Wrong date format
      • Multiple age formats - e.g., ‘4’, ‘4 yo’, ‘4 years’
    • Observations that shouldn’t be in the analysis
    • Inappropriate for statistical tests
      • log-transformations
      • Coding groups (e.g., high-risk and low-risk)


  • Making a construct measurable
    • Constructs are not empirical and can’t be tested directly
    • We must argue that our measures represent or at least are correlated with the concepts we really care about
    • Hypotheses relate concepts together, e.g., “socially cohesive groups are more willing to contribute to shared goals”
    • In order to test this, need to argue that you have something that represents social cohesion, and something that represents shared goals.

Online data

  • Online data is “raw”
    • This is generally wonderful - we have actual conversations, full text, etc.
  • However, it isn’t made for researchers
    • It isn’t designed to measure a construct
    • We have to do the work to create measures that measure them (and recognize when we can’t)

HW Review