import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
from tester import dump_classifier_and_data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
with open("final_project_dataset.pkl", "r") as data_file:
data_dict = pickle.load(data_file)
df = pd.DataFrame.from_dict(data_dict, orient='index')
df = df.replace('NaN', np.nan)
# statistics about the data set
print 'Number of people:', df['poi'].count()
print 'Number of POIs:', df.loc[df.poi == True, 'poi'].count()
print 'Fraction of examples that are POIs:', \
float(df.loc[df.poi == True, 'poi'].count()) / df['poi'].count()
print 'Number of features:', df.shape[1]
#The counts of NaNs in each column
print 'Missing data in each column'
df.isnull().sum()
Looking at the distribution of the missing data we can see that the counts for all the email features are the same. This suggests that one of the processes that is causing the missing data is lack of email data for certain persons.
This suggests the possibility of a new feature to me. It may be the case that persons involved in the fraud were able to encrypt or otherwise prevent their emails from coming into the hands of the courts. To tests this hypothesis I will create the feature and look at the counts of POI/non_POI.
df['email_withheld'] = df.to_messages.isnull()
df.groupby(['email_withheld', 'poi'])['poi'].count()
From inspection this feature does not appear to be particularly informative however will include it in the model as it could be a useful feature for growing decision trees.
The financial features are less predictable in how the values are missing. One way of interpreting this data is that a NaN value is equivalent to a zero. For example if an employee did not receive a bonus then this value could be assigned NaN. This view is supported by the insider pay pdf document in how the totals are calculated. I decided to replace all the NaNs in the financial features with zero.
financial_features = ['salary', 'deferral_payments', 'total_payments', 'loan_advances', 'bonus', 'restricted_stock_deferred', 'deferred_income', 'total_stock_value', 'expenses', 'exercised_stock_options', 'other', 'long_term_incentive', 'restricted_stock', 'director_fees']
email_features = ['to_messages', 'from_poi_to_this_person', 'from_messages', 'from_this_person_to_poi', 'shared_receipt_with_poi']
df[financial_features] = df[financial_features].fillna(0)
As for the email features. Firstly I decided to drop the email address column as this is uninformative for a machine learning algorithm. Some possibilities for how to deal with the missing values are:
If we drop the rows containing NaNs this will mean losing almost half of our training data which is not acceptable. Creating two classifiers could be an elegant solution but its implementation may be too complex for this project. Additionally a learner would be unable to learn interactions between financial and email features.
I decided to go with the imputation method using the median of the columns.
df[email_features] = df[email_features].fillna(df[email_features].median())
Now we have processed the raw data so it does not contain NaN values we need to remove any outlier records that may cause our algorithms to perform poorly further down the track. Probably the best way to identify outliers at this stage is by producing plots. The seaborn library is used to make a pair plot of a few variables of interest. Points are coloured red for POIs and blue for non-POIs.
import seaborn as sns
g = sns.PairGrid(df, vars=['salary','bonus','total_stock_value','total_payments'],
hue='poi')
g.map_lower(plt.scatter)
g.map_upper(plt.scatter)
g.map_diag(sns.kdeplot, lw=1)
One particular points stands out in all of the scatter plots as being far greater than the others. It is also interesting to note that this point is not a POI. This point will be investigated further.
df[df.salary > 2.5e7]
On closer inspection it is revealed that this point is in fact a 'TOTAL' of all the other columns. This row was likely created during the data wrangling process and accidentically propagated through to this stage. As it does not represent an employee this rows should definitely be removed.
df = df.drop('TOTAL')
I also decided to investigate some of the email features to see if any outliers existed here. Particularly the to_messages and from_messages features.
pois = df[df.poi]
plt.scatter(pois.from_messages, pois.to_messages, c='red');
non_pois = df[~df.poi]
plt.scatter(non_pois.from_messages, non_pois.to_messages, c='blue');
plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['POIs', 'non-POIs'])
This plot does appear to contain some outiers. These are persons who send or receive far more emails than the average user. The point on the far right in particular sent over 14000 emails, the next closest being less then half of that. The three points that lie outside the main cluster of points are identified and examined in the code below.
# extract the outlier points
outliers = df[np.logical_or(df.from_messages > 6000, df.to_messages > 10000)]
#plot them in red with the originals
plt.scatter(df.from_messages, df.to_messages, c='blue');
plt.scatter(outliers.from_messages, outliers.to_messages, c='red')
plt.xlabel('From messages')
plt.ylabel('To messages')
plt.legend(['Inliers', 'Potential Outliers'])
#display in the notebook
outliers
There does not appear to be anything strange about these points other than their sent/received emails being way above average. Also as the counts of emails were generated automatically from a script it is unlikely these values are due to an error. These points are not capturing the general trend of email usage but are instead most likely people in some special job position that requires them to do a lot of communication by email. Additionally none of these points are POI so the large number of emails is not indicitive of this. For these reasons I deemed it unlikely that these points would be useful to improve the generalization of a machine learning algorithm and decided to remove them.
# remove the outlier candidates from the data set
df = df[df.from_messages < 6000]
df = df[df.to_messages < 10000]
# Convert the data frame into a sklearn compatible data set
X = df.drop(['poi', 'email_address'], axis=1)
y = df.poi
# firstly get an estimate of the kind of accuracies we can expect with the unprocessed features
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.cross_validation import cross_val_score, StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import classification_report
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
def assess_classifier(clf, X, y):
cv = StratifiedShuffleSplit(y, 100, random_state=42)
print 'accuracy', cross_val_score(clf, X, y, cv=cv).mean()
print 'precision', cross_val_score(clf, X, y, scoring=make_scorer(precision_score), cv=cv).mean()
print 'recall', cross_val_score(clf, X, y, scoring=make_scorer(recall_score), cv=cv).mean()
def assess_features(X, y):
clf = RandomForestClassifier(random_state=42)
assess_classifier(clf, X, y)
print
clf = GaussianNB()
assess_classifier(clf, X, y)
print
clf = AdaBoostClassifier(random_state=42)
assess_classifier(clf, X, y)
assess_features(X, y)
Straight out of the box the random forest is doing much better in terms of accuracy. Looking at recall however it is a totally different story. To pass this project we require >0.3 for both precison and recall so clearly some work is needed here.
Based on the lessons I decided to implement the features fraction_of_messages[to|from]_poi and see how these influence classification performance.
X_new = X.copy()
X_new['fracion_of_messages_to_poi'] = X.from_this_person_to_poi / X.from_messages
X_new['fracion_of_messages_from_poi'] = X.from_poi_to_this_person / X.to_messages
X_new = X_new.drop('from_this_person_to_poi', axis=1)
X_new = X_new.drop('from_poi_to_this_person', axis=1)
assess_features(X_new, y)
While this new feature made little difference to the naive bayes classifier is yielded significant performance increases in the random forest and adaboost classifiers.
It is thought that many of the features provided are redundant and the algorithms would perform better with them removed. The random forest is able to provide an estimate of feature importance directly so we will begin by looking at that
# http://stackoverflow.com/questions/16010869/python-plot-a-bar-using-matplotlib-using-a-dictionary
import operator, pprint, collections
clf = AdaBoostClassifier(random_state=42)
def plot_feature_importances(clf, X, y):
clf.fit(X, y)
importances = sorted(zip(X.columns, clf.feature_importances_), key=operator.itemgetter(1))[::-1]
importances = collections.OrderedDict(importances)
plt.bar(range(len(importances)), importances.values(), align='center')
plt.xticks(range(len(importances)), importances.keys(), rotation='vertical')
plt.title('Feature Importances')
plt.ylabel('Importance')
plt.show()
return importances
print 'Random Forest Feature Importances'
rf_importances = plot_feature_importances(RandomForestClassifier(), X_new, y)
print 'AdaBoost Feature Importances'
ada_importances = plot_feature_importances(AdaBoostClassifier(), X_new, y)
This plot gives us some very useful information. The email_withheld feature we created is extremely uninformative and should be dropped. It seems safe to say that the lowest 6 features could also be dropped.
It is also interesting to see which features are informative to the classifiers and the similarities between the models. There appears to be a good mix of financial (expenses, deferred_income, long_term_incentive) with email features (shared_recipt_with_poi) in the top 5.
I decided to drop the 5 lowest features as rated by the AdaBoost classifier and re-evaluate the model.
X_new_selected = X_new.drop(ada_importances.keys()[-6:], axis = 1)
X_new_selected.columns
assess_features(X_new_selected, y)
The yielded an increase in all classification metrics of the classifiers. At this stage the AdaBoost classifier has reached the desired level of precision and recall for a submission.
For all the classifiers considered features scaling is not required
Due to the consistent poor performance of the naive bayes classifier despite the lack of parameters I decided to abandon it for the remainder of the analysis.
To select values for the criterion, max_features, max_depth, min_samples_split and min_samples_leaf of the trees in the random forest and the trees in the AdaBoost I decided to apply a CV grid search using the same grid for both algorithms.
The question is what metric to maximize? As the classifier we submit must have a good result for both precision and recall it makes sense to optimize for these. The f1-score is defined as a combination of the precision and recall so I decided to use this as the metric to optimze parameters for.
from sklearn.grid_search import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
n_features = X_new_selected.shape[1]
grid = {
'criterion':('gini', 'entropy'),
'min_samples_leaf':range(1, 50, 5),
'max_depth': range(1, 10)
}
search = GridSearchCV(RandomForestClassifier(random_state=42),
grid, make_scorer(f1_score), cv=StratifiedKFold(y), n_jobs=-1)
search.fit(X_new_selected, y)
print search.best_score_
print search.best_params_
assess_classifier(search.best_estimator_, X_new_selected, y)
In scikit learn documentation it suggests:
"The main parameters to tune to obtain good results are n_estimators and the complexity of the base estimators (e.g., its depth max_depth or minimum required number of samples at a leaf min_samples_leaf in case of decision trees)."
so I decided to grid search over these parameters.
grid = {
'base_estimator__criterion': ('gini', 'entropy'),
'base_estimator__min_samples_leaf':range(1, 50, 5),
'base_estimator__max_depth': range(1, 10),
'n_estimators': range(1,10)
}
search = GridSearchCV(AdaBoostClassifier(DecisionTreeClassifier(random_state=42), random_state=42),
grid, make_scorer(f1_score), cv=StratifiedKFold(y), n_jobs=-1)
search.fit(X_new_selected, y)
print search.best_score_
print search.best_params_
assess_classifier(search.best_estimator_, X_new_selected, y)
After optimization the AdaBoost calssifier performed much better in the precision and recall scores for this problem. I decided to use this classifier for my submission.
Finally I wanted to see if I could get away with using less base classifiers in the boosting algorithm. If so this could make the whole algorithm run faster at no cost. To assess this I used a validation curve over the number of classifiers.
#http://scikit-learn.org/stable/auto_examples/model_selection/plot_validation_curve.html
from sklearn.learning_curve import validation_curve
def plot_validation_curve(clf, X, y, param, param_range):
train_scores, test_scores = validation_curve(
clf, X, y, param_name=param, param_range=param_range,
cv=StratifiedKFold(y), scoring=make_scorer(f1_score), n_jobs=1)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.ylabel("Score")
plt.xlabel(param)
plt.ylim(0.0, 1.1)
plt.semilogx(param_range, train_scores_mean, label="Training score", color="r")
plt.fill_between(param_range, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.2, color="r")
plt.semilogx(param_range, test_scores_mean, label="Cross-validation score",
color="g")
plt.fill_between(param_range, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.2, color="g")
plt.legend(loc="best")
plt.show()
plot_validation_curve(search.best_estimator_, X_new_selected, y, 'n_estimators', range(1,100,1))
From this plot it suggests that reducing the number of estimators below its current value is likely to reslt in a large decrease in f1 score so I chose to leave it at the optimized value.