In [1]:
import sys
import pickle
sys.path.append("../tools/")

from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
/home/willemolding/repos/scikit-learn/sklearn/cross_validation.py:42: DeprecationWarning: This module has been deprecated in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [2]:
with open("final_project_dataset.pkl", "r") as data_file:
    data_dict = pickle.load(data_file)
df = pd.DataFrame.from_dict(data_dict, orient='index')
df = df.replace('NaN', np.nan)

Understanding the Dataset

In [3]:
# statistics about the data set
print 'Number of people:', df['poi'].count()
print 'Number of POIs:', df.loc[df.poi == True, 'poi'].count()
print 'Fraction of examples that are POIs:', \
    float(df.loc[df.poi == True, 'poi'].count()) / df['poi'].count()
print 'Number of features:', df.shape[1]
Number of people: 146
Number of POIs: 18
Fraction of examples that are POIs: 0.123287671233
Number of features: 21
In [4]:
#The counts of NaNs in each column
print 'Missing data in each column'
df.isnull().sum()
Missing data in each column
Out[4]:
salary                        51
to_messages                   60
deferral_payments            107
total_payments                21
exercised_stock_options       44
bonus                         64
restricted_stock              36
shared_receipt_with_poi       60
restricted_stock_deferred    128
total_stock_value             20
expenses                      51
loan_advances                142
from_messages                 60
other                         53
from_this_person_to_poi       60
poi                            0
director_fees                129
deferred_income               97
long_term_incentive           80
email_address                 35
from_poi_to_this_person       60
dtype: int64

Looking at the distribution of the missing data we can see that the counts for all the email features are the same. This suggests that one of the processes that is causing the missing data is lack of email data for certain persons.

This suggests the possibility of a new feature to me. It may be the case that persons involved in the fraud were able to encrypt or otherwise prevent their emails from coming into the hands of the courts. To tests this hypothesis I will create the feature and look at the counts of POI/non_POI.

In [5]:
df['email_withheld'] = df.to_messages.isnull()
df.groupby(['email_withheld', 'poi'])['poi'].count()
Out[5]:
email_withheld  poi  
False           False    72
                True     14
True            False    56
                True      4
Name: poi, dtype: int64

From inspection this feature does not appear to be particularly informative however will include it in the model as it could be a useful feature for growing decision trees.

The financial features are less predictable in how the values are missing. One way of interpreting this data is that a NaN value is equivalent to a zero. For example if an employee did not receive a bonus then this value could be assigned NaN. This view is supported by the insider pay pdf document in how the totals are calculated. I decided to replace all the NaNs in the financial features with zero.

In [6]:
financial_features = ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']
email_features = ['to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']
In [7]:
df[financial_features] = df[financial_features].fillna(0)

As for the email features. Firstly I decided to drop the email address column as this is uninformative for a machine learning algorithm. Some possibilities for how to deal with the missing values are:

  • Drop rows that contain missing values
  • Impute the missing values with some function of the columns
  • Develop two independent classifiers for the email and financial features and combining their outputs

If we drop the rows containing NaNs this will mean losing almost half of our training data which is not acceptable. Creating two classifiers could be an elegant solution but its implementation may be too complex for this project. Additionally a learner would be unable to learn interactions between financial and email features.

I decided to go with the imputation method using the median of the columns.

In [8]:
df[email_features] = df[email_features].fillna(df[email_features].median())

Identifying Outliers

Now we have processed the raw data so it does not contain NaN values we need to remove any outlier records that may cause our algorithms to perform poorly further down the track. Probably the best way to identify outliers at this stage is by producing plots. The seaborn library is used to make a pair plot of a few variables of interest. Points are coloured red for POIs and blue for non-POIs.

In [9]:
import seaborn as sns
g = sns.PairGrid(df, vars=['salary','bonus','total_stock_value','total_payments'],
                hue='poi')
g.map_lower(plt.scatter)
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=1)

One particular points stands out in all of the scatter plots as being far greater than the others. It is also interesting to note that this point is not a POI. This point will be investigated further.

In [10]:
df[df.salary > 2.5e7]
Out[10]:
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... from_messages other from_this_person_to_poi poi director_fees deferred_income long_term_incentive email_address from_poi_to_this_person email_withheld
TOTAL 26704229 1211 32083396 309886585 311764000 97343619 130322299 740.5 -7576788 434509511 ... 41 42667589 8 False 1398517 -27992891 48521928 NaN 35 True

1 rows × 22 columns

On closer inspection it is revealed that this point is in fact a 'TOTAL' of all the other columns. This row was likely created during the data wrangling process and accidentically propagated through to this stage. As it does not represent an employee this rows should definitely be removed.

In [11]:
df = df.drop('TOTAL')

I also decided to investigate some of the email features to see if any outliers existed here. Particularly the to_messages and from_messages features.

In [12]:
pois = df[df.poi]
plt.scatter(pois.from_messages, pois.to_messages, c='red');
non_pois = df[~df.poi]
plt.scatter(non_pois.from_messages, non_pois.to_messages, c='blue');

plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['POIs', 'non-POIs'])
Out[12]:
<matplotlib.legend.Legend at 0x7feadb234990>

This plot does appear to contain some outiers. These are persons who send or receive far more emails than the average user. The point on the far right in particular sent over 14000 emails, the next closest being less then half of that. The three points that lie outside the main cluster of points are identified and examined in the code below.

In [13]:
# extract the outlier points
outliers = df[np.logical_or(df.from_messages > 6000, df.to_messages > 10000)]

#plot them in red with the originals
plt.scatter(df.from_messages, df.to_messages, c='blue');
plt.scatter(outliers.from_messages, outliers.to_messages, c='red')
plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['Inliers', 'Potential Outliers'])

#display in the notebook
outliers
Out[13]:
salary to_messages deferral_payments total_payments exercised_stock_options bonus restricted_stock shared_receipt_with_poi restricted_stock_deferred total_stock_value ... from_messages other from_this_person_to_poi poi director_fees deferred_income long_term_incentive email_address from_poi_to_this_person email_withheld
KAMINSKI WINCENTY J 275101 4607 0 1086821 850010 400000 126027 583 0 976037 ... 14368 4669 171 False 0 0 323466 vince.kaminski@enron.com 41 False
KEAN STEVEN J 404338 12754 0 1747522 2022048 1000000 4131594 3639 0 6153642 ... 6759 1231 387 False 0 0 300000 steven.kean@enron.com 140 False
SHAPIRO RICHARD S 269076 15149 0 1057548 607837 650000 379164 4527 0 987001 ... 1215 705 65 False 0 0 0 richard.shapiro@enron.com 74 False

3 rows × 22 columns

There does not appear to be anything strange about these points other than their sent/received emails being way above average. Also as the counts of emails were generated automatically from a script it is unlikely these values are due to an error. These points are not capturing the general trend of email usage but are instead most likely people in some special job position that requires them to do a lot of communication by email. Additionally none of these points are POI so the large number of emails is not indicitive of this. For these reasons I deemed it unlikely that these points would be useful to improve the generalization of a machine learning algorithm and decided to remove them.

In [14]:
# remove the outlier candidates from the data set
df = df[df.from_messages < 6000]
df = df[df.to_messages < 10000]

First attempty at Classifying the Dataset

In [15]:
# Convert the data frame into a sklearn compatible data set
X = df.drop(['poi', 'email_address'], axis=1)
y = df.poi
In [16]:
# firstly get an estimate of the kind of accuracies we can expect with the unprocessed features
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB

from sklearn.cross_validation import cross_val_score, StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import classification_report

from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

def assess_classifier(clf, X, y):


    cv = StratifiedShuffleSplit(y, 100, random_state=42)

    print 'accuracy', cross_val_score(clf, X, y, cv=cv).mean()
    print 'precision', cross_val_score(clf, X, y, scoring=make_scorer(precision_score), cv=cv).mean()
    print 'recall', cross_val_score(clf, X, y, scoring=make_scorer(recall_score), cv=cv).mean()
    
    
def assess_features(X, y):
    clf = RandomForestClassifier(random_state=42)
    assess_classifier(clf, X, y)
    
    print
    
    clf = GaussianNB()
    assess_classifier(clf, X, y)    
    print 
    
    clf = AdaBoostClassifier(random_state=42)
    assess_classifier(clf, X, y)
In [17]:
assess_features(X, y)
accuracy 0.867333333333
precision 0.238333333333
recall 0.16

accuracy 0.735333333333
precision 0.233980519481
recall 0.32

accuracy 0.834666666667
precision 0.268166666667
recall 0.265

Straight out of the box the random forest is doing much better in terms of accuracy. Looking at recall however it is a totally different story. To pass this project we require >0.3 for both precison and recall so clearly some work is needed here.

Feature Engineering

Based on the lessons I decided to implement the features fraction_of_messages[to|from]_poi and see how these influence classification performance.

In [18]:
X_new = X.copy()
X_new['fracion_of_messages_to_poi'] = X.from_this_person_to_poi / X.from_messages
X_new['fracion_of_messages_from_poi'] = X.from_poi_to_this_person / X.to_messages
X_new = X_new.drop('from_this_person_to_poi', axis=1)
X_new = X_new.drop('from_poi_to_this_person', axis=1)
In [19]:
assess_features(X_new, y)
accuracy 0.866666666667
precision 0.215
recall 0.16

accuracy 0.734666666667
precision 0.233980519481
recall 0.32

accuracy 0.861333333333
precision 0.452333333333
recall 0.41

While this new feature made little difference to the naive bayes classifier is yielded significant performance increases in the random forest and adaboost classifiers.

Feature Selection

It is thought that many of the features provided are redundant and the algorithms would perform better with them removed. The random forest is able to provide an estimate of feature importance directly so we will begin by looking at that

In [20]:
# http://stackoverflow.com/questions/16010869/python-plot-a-bar-using-matplotlib-using-a-dictionary

import operator, pprint, collections

clf = AdaBoostClassifier(random_state=42)

def plot_feature_importances(clf, X, y):
    clf.fit(X, y)
    importances = sorted(zip(X.columns, clf.feature_importances_), key=operator.itemgetter(1))[::-1]
    importances = collections.OrderedDict(importances)
    plt.bar(range(len(importances)), importances.values(), align='center')
    plt.xticks(range(len(importances)), importances.keys(), rotation='vertical')
    plt.title('Feature Importances')
    plt.ylabel('Importance')
    plt.show()
    return importances
In [21]:
print 'Random Forest Feature Importances'
rf_importances = plot_feature_importances(RandomForestClassifier(), X_new, y)
print 'AdaBoost Feature Importances'
ada_importances = plot_feature_importances(AdaBoostClassifier(), X_new, y)
Random Forest Feature Importances
AdaBoost Feature Importances

This plot gives us some very useful information. The email_withheld feature we created is extremely uninformative and should be dropped. It seems safe to say that the lowest 6 features could also be dropped.

It is also interesting to see which features are informative to the classifiers and the similarities between the models. There appears to be a good mix of financial (expenses, deferred_income, long_term_incentive) with email features (shared_recipt_with_poi) in the top 5.

I decided to drop the 5 lowest features as rated by the AdaBoost classifier and re-evaluate the model.

In [22]:
X_new_selected = X_new.drop(ada_importances.keys()[-6:], axis = 1)
X_new_selected.columns
Out[22]:
Index([u'salary', u'to_messages', u'deferral_payments', u'total_payments',
       u'exercised_stock_options', u'bonus', u'restricted_stock',
       u'shared_receipt_with_poi', u'expenses', u'from_messages', u'other',
       u'long_term_incentive', u'fracion_of_messages_to_poi',
       u'fracion_of_messages_from_poi'],
      dtype='object')
In [23]:
assess_features(X_new_selected, y)
accuracy 0.869333333333
precision 0.21
recall 0.125

accuracy 0.822
precision 0.183333333333
recall 0.175

accuracy 0.865333333333
precision 0.436166666667
recall 0.385

The yielded an increase in all classification metrics of the classifiers. At this stage the AdaBoost classifier has reached the desired level of precision and recall for a submission.

For all the classifiers considered features scaling is not required

Parameter tuning for RF and AdaBoost

Due to the consistent poor performance of the naive bayes classifier despite the lack of parameters I decided to abandon it for the remainder of the analysis.

To select values for the criterion, max_features, max_depth, min_samples_split and min_samples_leaf of the trees in the random forest and the trees in the AdaBoost I decided to apply a CV grid search using the same grid for both algorithms.

The question is what metric to maximize? As the classifier we submit must have a good result for both precision and recall it makes sense to optimize for these. The f1-score is defined as a combination of the precision and recall so I decided to use this as the metric to optimze parameters for.

In [24]:
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

n_features = X_new_selected.shape[1]
grid = {
    'criterion':('gini', 'entropy'),
    'min_samples_leaf':range(1, 50, 5),
    'max_depth': range(1, 10)
}

search = GridSearchCV(RandomForestClassifier(random_state=42),
                      grid, make_scorer(f1_score), cv=StratifiedKFold(y), n_jobs=-1)
search.fit(X_new_selected, y)
print search.best_score_
print search.best_params_
0.260060362173
{'criterion': 'entropy', 'max_depth': 6, 'min_samples_leaf': 1}
In [25]:
assess_classifier(search.best_estimator_, X_new_selected, y)
accuracy 0.876666666667
precision 0.3375
recall 0.195

In scikit learn documentation it suggests:

"The main parameters to tune to obtain good results are n_estimators and the complexity of the base estimators (e.g., its depth max_depth or minimum required number of samples at a leaf min_samples_leaf in case of decision trees)."

so I decided to grid search over these parameters.

In [30]:
grid = {
    'base_estimator__criterion': ('gini', 'entropy'),
    'base_estimator__min_samples_leaf':range(1, 50, 5),
    'base_estimator__max_depth': range(1, 10),
    'n_estimators': range(1,10)
}

search = GridSearchCV(AdaBoostClassifier(DecisionTreeClassifier(random_state=42), random_state=42), 
                      grid, make_scorer(f1_score), cv=StratifiedKFold(y), n_jobs=-1)
search.fit(X_new_selected, y)
print search.best_score_
print search.best_params_
0.555164319249
{'n_estimators': 4, 'base_estimator__criterion': 'gini', 'base_estimator__max_depth': 3, 'base_estimator__min_samples_leaf': 11}
In [31]:
assess_classifier(search.best_estimator_, X_new_selected, y)
accuracy 0.854666666667
precision 0.369166666667
recall 0.305

After optimization the AdaBoost calssifier performed much better in the precision and recall scores for this problem. I decided to use this classifier for my submission.

Finally I wanted to see if I could get away with using less base classifiers in the boosting algorithm. If so this could make the whole algorithm run faster at no cost. To assess this I used a validation curve over the number of classifiers.

In [32]:
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html

from sklearn.learning_curve import validation_curve
def plot_validation_curve(clf, X, y, param, param_range):

    train_scores, test_scores = validation_curve(
        clf, X, y, param_name=param, param_range=param_range,
        cv=StratifiedKFold(y), scoring=make_scorer(f1_score), n_jobs=1)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.ylabel("Score")
    plt.xlabel(param)
    plt.ylim(0.0, 1.1)
    plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
    plt.fill_between(param_range, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.2, color="r")
    plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
                 color="g")
    plt.fill_between(param_range, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.2, color="g")
    plt.legend(loc="best")
    plt.show()
In [33]:
plot_validation_curve(search.best_estimator_, X_new_selected, y, 'n_estimators', range(1,100,1))

From this plot it suggests that reducing the number of estimators below its current value is likely to reslt in a large decrease in f1 score so I chose to leave it at the optimized value.