Distinguishing Family Sites from Adult Sites

I looked at the full set of sites classified as Family versus those classified as 'XXX Adult', using basic demographic information and site statistics.

We begin by importing the data files

In [2]:
import pandas as pd
from sklearn import tree, linear_model
from sklearn.externals.six import StringIO

# Import the data and get rid of rows with missing data
p = pd.read_csv('./all_sites.csv')
d = p.ix[:,:-1].dropna().copy()

x = d.ix[:,1:]
y = d.ix[:,0]

# Train a decision tree
clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(x,y)

# Save the tree output
with open('tree.dot', 'w') as f:
    f = tree.export_graphviz(clf, out_file=f, feature_names=x.columns)
In [3]:
# Print the tree
from subprocess import check_call
from IPython.display import Image
check_call(['dot','-Tpng','tree.dot','-o','tree.png'])
Image("./tree.png")
Out[3]:

Analysis

The decision tree provides "rules" for classifying a given site. The bottom-most boxes represent "leaf nodes", and the "value" shows the number of Family and Porn sites in each node, respectively. The most salient feature is the Median Age, which is actually higher for Adult sites. After that, Minutes Per Page is important, with more minutes associated with Family Pages.

We see similar information when looking at aggregated statistics.

In [4]:
p.groupby('Site_Type').median()
Out[4]:
Target Audience (000) % Reach % Composition Unique Visitors Composition Index Target Lift Index Average Daily Visitors (000) Total Minutes (MM) Average Minutes per Usage Day Total Pages Viewed (MM) Average Pages per Usage Day Average Minutes per Page Average Usage Days per Visitor Average Minutes per Visitor Average Pages per Visitor Median Age Mean Age Median Income Mean Income
Site_Type
Family 175.7345 0.093 100 100 100 10.7290 1.097 3.1015 1.494 4.001 0.822 1.603 4.9595 6.459 40 41.3 72500 81660.75
Porn 87.3480 0.046 100 100 100 4.7095 0.206 1.2555 0.517 3.031 0.392 1.562 2.0010 4.926 54 51.7 67500 76055.05