0%

Hidden - Census Income - Data Analysis

Data Science, Classification Analysis

Data Cleaning, Feature Engineering, Imputation, and Classification.

This Notepad has been designed to be run on top of the Jupyter Tensorflow Docker instance found in the link below:

Checking Number of CPU’s available to Docker container

Ideally, and for this Notebook to run in a reasonable time, your Docker container should have 4 cores or more available.

1
!cat /proc/cpuinfo | awk '/^processor/{print $3}' | tail -1
5

Import Standard Python Libraries

1
2
import io, os, sys, types, time, datetime, math, random, requests, subprocess, tempfile
from io import StringIO, BytesIO

Packages Install

We’ll now install a few more libraries. This is an easy way to install libraries in a way that are recognised and managed by conda. Do this once and then comment it out for subsequent runs.

1
2
#!conda install --yes -c conda-forge missingno
#!conda install --yes -c anaconda requests

Packages Update

There’s a lot of packages available to us, and most of them were installed when running the dockerfile that created the docker instance. Let’s make sure they are all up to date. Do this once and then comment it out for subsequent runs.

1
#!conda update --yes --all

Packages Import

These are all the packages we’ll be using. Importing individual libraries make it easy for us to use them without having to call the parent libraries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
# Data Manipulation 
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import missingno
import seaborn as sns
from pandas.plotting import scatter_matrix
from mpl_toolkits.mplot3d import Axes3D

# Feature Selection and Encoding
from sklearn.feature_selection import RFE, RFECV
from sklearn.svm import SVR
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize

# Machine learning
import sklearn.ensemble as ske
from sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
import tensorflow as tf

# Grid and Random Search
import scipy.stats as st
from scipy.stats import randint as sp_randint
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Metrics
from sklearn.metrics import precision_recall_fscore_support, roc_curve, auc

# Managing Warnings
import warnings
warnings.filterwarnings('ignore')

# Plot the Figures Inline
%matplotlib inline

Listing Installed Packages

We could list all installed packages to check whether a package has already been installed.

1
2
3
4
5
6
conda_packages_list = BytesIO(subprocess.Popen(["conda", "list"], 
stdout=subprocess.PIPE).communicate()[0])
conda_packages_list = pd.read_csv(conda_packages_list,
names=['Package Name','Version','Python Version','Repo','Other'],
delim_whitespace=True, engine='python', skiprows=3)
conda_packages_list.head(5)

Package Name Version Python Version Repo Other
0 _libgcc_mutex 0.1 conda_forge conda-forge NaN
1 _openmp_mutex 4.5 1_llvm conda-forge NaN
2 absl-py 0.10.0 pypi_0 pypi NaN
3 aiohttp 3.6.2 pypi_0 pypi NaN
4 alembic 1.4.3 pyh9f0ad1d_0 conda-forge NaN

Objective

In this Jupyter Notepad, we will using the Census Income Dataset to predict whether an individual’s income exceeds $50K/yr based on census data.

The dataset can be found here: https://archive.ics.uci.edu/ml/datasets/adult

Data Download and Loading

Let’s download the data and save it to a folder in our local directory called ‘dataset’. Download it once, and then comment the code out for subsequent runs.

After downloading the data, we load it directly from Disk into a Pandas Dataframe in Memory. Depending on the memory available to the Docker instance, this may be a problem.

The data comes separated into the Training and Test datasets. We will join the two for data exploration, and then separate them again before running our algorithms.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Download
DATASET = (
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names",
"http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test",
)

def download_data(path='dataset', urls=DATASET):
if not os.path.exists(path):
os.mkdir(path)

for url in urls:
response = requests.get(url)
name = os.path.basename(url)
with open(os.path.join(path, name), 'wb') as f:
f.write(response.content)

#download_data()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Load Training and Test Data Sets
headers = ['age', 'workclass', 'fnlwgt',
'education', 'education-num',
'marital-status', 'occupation',
'relationship', 'race', 'sex',
'capital-gain', 'capital-loss',
'hours-per-week', 'native-country',
'predclass']
training_raw = pd.read_csv('dataset/adult.data',
header=None,
names=headers,
sep=',\s',
na_values=["?"],
engine='python')
test_raw = pd.read_csv('dataset/adult.test',
header=None,
names=headers,
sep=',\s',
na_values=["?"],
engine='python',
skiprows=1)
1
2
3
4
# Join Datasets
dataset_raw = training_raw.append(test_raw)
dataset_raw.reset_index(inplace=True)
dataset_raw.drop('index',inplace=True,axis=1)
1
2
3
4
5
6
7
8
9
10
# Displaying the size of the Dataframe in Memory
def convert_size(size_bytes):
if size_bytes == 0:
return "0B"
size_name = ("Bytes", "KB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB")
i = int(math.floor(math.log(size_bytes, 1024)))
p = math.pow(1024, i)
s = round(size_bytes / p, 2)
return "%s %s" % (s, size_name[i])
convert_size(dataset_raw.memory_usage().sum())
'5.59 MB'

Data Exploration - Univariate

When exploring our dataset and its features, we have many options available to us. We can explore each feature individually, or compare pairs of features, finding the correlation between. Let’s start with some simple Univariate (one feature) analysis.

Features can be of multiple types:

  • Nominal: is for mutual exclusive, but not ordered, categories.
  • Ordinal: is one where the order matters but not the difference between values.
  • Interval: is a measurement where the difference between two values is meaningful.
  • Ratio: has all the properties of an interval variable, and also has a clear definition of 0.0.

There are multiple ways of manipulating each feature type, but for simplicity, we’ll define only two feature types:

  • Numerical: any feature that contains numeric values.
  • Categorical: any feature that contains categories, or text.
1
2
# Describing all the Numerical Features
dataset_raw.describe()

age fnlwgt education-num capital-gain capital-loss hours-per-week
count 48842.000000 4.884200e+04 48842.000000 48842.000000 48842.000000 48842.000000
mean 38.643585 1.896641e+05 10.078089 1079.067626 87.502314 40.422382
std 13.710510 1.056040e+05 2.570973 7452.019058 403.004552 12.391444
min 17.000000 1.228500e+04 1.000000 0.000000 0.000000 1.000000
25% 28.000000 1.175505e+05 9.000000 0.000000 0.000000 40.000000
50% 37.000000 1.781445e+05 10.000000 0.000000 0.000000 40.000000
75% 48.000000 2.376420e+05 12.000000 0.000000 0.000000 45.000000
max 90.000000 1.490400e+06 16.000000 99999.000000 4356.000000 99.000000
1
2
# Describing all the Categorical Features
dataset_raw.describe(include=['O'])

workclass education marital-status occupation relationship race sex native-country predclass
count 46043 48842 48842 46033 48842 48842 48842 47985 48842
unique 8 16 7 14 6 5 2 41 4
top Private HS-grad Married-civ-spouse Prof-specialty Husband White Male United-States <=50K
freq 33906 15784 22379 6172 19716 41762 32650 43832 24720
1
2
# Let's have a quick look at our data
dataset_raw.head()

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country predclass
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Let’s plot the distribution of each feature
def plot_distribution(dataset, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
for i, column in enumerate(dataset.columns):
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
if dataset.dtypes[column] == np.object:
g = sns.countplot(y=column, data=dataset)
substrings = [s.get_text()[:18] for s in g.get_yticklabels()]
g.set(yticklabels=substrings)
plt.xticks(rotation=25)
else:
g = sns.distplot(dataset[column])
plt.xticks(rotation=25)

plot_distribution(dataset_raw, cols=3, width=20, height=20, hspace=0.45, wspace=0.5)

png

1
2
# How many missing values are there in our dataset?
missingno.matrix(dataset_raw, figsize = (30,5))
<AxesSubplot:>

png

1
missingno.bar(dataset_raw, sort='ascending', figsize = (30,5))
<AxesSubplot:>

png

Feature Cleaning, Engineering, and Imputation

Cleaning:
To clean our data, we’ll need to work with:

  • Missing values: Either omit elements from a dataset that contain missing values or impute them (fill them in).
  • Special values: Numeric variables are endowed with several formalized special values including ±Inf, NA and NaN. Calculations involving special values often result in special values, and need to be handled/cleaned.
  • Outliers: They should be detected, but not necessarily removed. Their inclusion in the analysis is a statistical decision.
  • Obvious inconsistencies: A person’s age cannot be negative, a man cannot be pregnant and an under-aged person cannot possess a drivers license. Find the inconsistencies and plan for them.

Engineering:
There are multiple techniques for feature engineering:

  • Decompose: Converting 2014-09-20T20:45:40Z into categorical attributes like hour_of_the_day, part_of_day, etc.

  • Discretization: We can choose to either discretize some of the continuous variables we have, as some algorithms will perform faster. We are going to do both, and compare the results of the ML algorithms on both discretized and non discretised datasets. We’ll call these datasets:

  • dataset_bin => where Continuous variables are Discretised

  • dataset_con => where Continuous variables are Continuous

  • Reframe Numerical Quantities: Changing from grams to kg, and losing detail might be both wanted and efficient for calculation

  • Feature Crossing: Creating new features as a combination of existing features. Could be multiplying numerical features, or combining categorical variables. This is a great way to add domain expertise knowledge to the dataset.

Imputation:
We can impute missing values in a number of different ways:

  • Hot-Deck: The technique then finds the first missing value and uses the cell value immediately prior to the data that are missing to impute the missing value.
  • Cold-Deck: Selects donors from another dataset to complete missing data.
  • Mean-substitution: Another imputation technique involves replacing any missing value with the mean of that variable for all other cases, which has the benefit of not changing the sample mean for that variable.
  • Regression: A regression model is estimated to predict observed values of a variable based on other variables, and that model is then used to impute values in cases where that variable is missing.
1
2
3
# To perform our data analysis, let's create new dataframes.
dataset_bin = pd.DataFrame() # To contain our dataframe with our discretised continuous variables
dataset_con = pd.DataFrame() # To contain our dataframe with our continuous variables

Feature Predclass

This is the feature we are trying to predict. We’ll change the string to a binary 0/1. With 1 signifying over $50K.

1
2
3
4
5
6
7
8
# Let's fix the Class Feature
dataset_raw.loc[dataset_raw['predclass'] == '>50K', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass'] == '>50K.', 'predclass'] = 1
dataset_raw.loc[dataset_raw['predclass'] == '<=50K', 'predclass'] = 0
dataset_raw.loc[dataset_raw['predclass'] == '<=50K.', 'predclass'] = 0

dataset_bin['predclass'] = dataset_raw['predclass']
dataset_con['predclass'] = dataset_raw['predclass']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,1))
sns.countplot(y="predclass", data=dataset_bin);

png

Feature: Age

We will use the Pandas Cut function to bin the data in equally sized buckets. We will also add our original feature to the dataset_con dataframe.

1
2
dataset_bin['age'] = pd.cut(dataset_raw['age'], 10) # discretised 
dataset_con['age'] = dataset_raw['age'] # non-discretised
1
2
3
4
5
6
7
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1)
sns.countplot(y="age", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 1]['age'], kde_kws={"label": ">$50K"});
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 0]['age'], kde_kws={"label": "<$50K"});

png

Feature: Workclass

1
2
3
4
# Can we bucket some of these groups?
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,3))
sns.countplot(y="workclass", data=dataset_raw);

png

1
2
3
4
5
6
7
8
9
10
11
12
# There are too many groups here, we can group someof them together.
# Create buckets for Workclass
dataset_raw.loc[dataset_raw['workclass'] == 'Without-pay', 'workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Never-worked', 'workclass'] = 'Not Working'
dataset_raw.loc[dataset_raw['workclass'] == 'Federal-gov', 'workclass'] = 'Fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'State-gov', 'workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Local-gov', 'workclass'] = 'Non-fed-gov'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-not-inc', 'workclass'] = 'Self-emp'
dataset_raw.loc[dataset_raw['workclass'] == 'Self-emp-inc', 'workclass'] = 'Self-emp'

dataset_bin['workclass'] = dataset_raw['workclass']
dataset_con['workclass'] = dataset_raw['workclass']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,2))
sns.countplot(y="workclass", data=dataset_bin);

png

Feature: Occupation

1
2
3
4
# Can we bucket some of these groups?
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5))
sns.countplot(y="occupation", data=dataset_raw);

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Create buckets for Occupation
dataset_raw.loc[dataset_raw['occupation'] == 'Adm-clerical', 'occupation'] = 'Admin'
dataset_raw.loc[dataset_raw['occupation'] == 'Armed-Forces', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Craft-repair', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Exec-managerial', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Farming-fishing', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Handlers-cleaners', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Machine-op-inspct', 'occupation'] = 'Manual Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Other-service', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Priv-house-serv', 'occupation'] = 'Service'
dataset_raw.loc[dataset_raw['occupation'] == 'Prof-specialty', 'occupation'] = 'Professional'
dataset_raw.loc[dataset_raw['occupation'] == 'Protective-serv', 'occupation'] = 'Military'
dataset_raw.loc[dataset_raw['occupation'] == 'Sales', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Tech-support', 'occupation'] = 'Office Labour'
dataset_raw.loc[dataset_raw['occupation'] == 'Transport-moving', 'occupation'] = 'Manual Labour'

dataset_bin['occupation'] = dataset_raw['occupation']
dataset_con['occupation'] = dataset_raw['occupation']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="occupation", data=dataset_bin);

png

Feature: Native Country

1
2
3
4
# Can we bucket some of these groups?
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,10))
sns.countplot(y="native-country", data=dataset_raw);

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
dataset_raw.loc[dataset_raw['native-country'] == 'Cambodia'                    , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Canada' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'China' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Columbia' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Cuba' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Dominican-Republic' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Ecuador' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'El-Salvador' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'England' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'France' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Germany' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Greece' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Guatemala' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Haiti' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Holand-Netherlands' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Honduras' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Hong' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Hungary' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'India' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Iran' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Ireland' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'Italy' , 'native-country'] = 'Euro_Group_1'
dataset_raw.loc[dataset_raw['native-country'] == 'Jamaica' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Japan' , 'native-country'] = 'APAC'
dataset_raw.loc[dataset_raw['native-country'] == 'Laos' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Mexico' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Nicaragua' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Outlying-US(Guam-USVI-etc)' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Peru' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Philippines' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Poland' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Portugal' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Puerto-Rico' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'Scotland' , 'native-country'] = 'British-Commonwealth'
dataset_raw.loc[dataset_raw['native-country'] == 'South' , 'native-country'] = 'Euro_Group_2'
dataset_raw.loc[dataset_raw['native-country'] == 'Taiwan' , 'native-country'] = 'China'
dataset_raw.loc[dataset_raw['native-country'] == 'Thailand' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Trinadad&Tobago' , 'native-country'] = 'South-America'
dataset_raw.loc[dataset_raw['native-country'] == 'United-States' , 'native-country'] = 'United-States'
dataset_raw.loc[dataset_raw['native-country'] == 'Vietnam' , 'native-country'] = 'SE-Asia'
dataset_raw.loc[dataset_raw['native-country'] == 'Yugoslavia' , 'native-country'] = 'Euro_Group_2'

dataset_bin['native-country'] = dataset_raw['native-country']
dataset_con['native-country'] = dataset_raw['native-country']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="native-country", data=dataset_bin);

png

Feature: Education

1
2
3
4
# Can we bucket some of these groups?
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5))
sns.countplot(y="education", data=dataset_raw);

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
dataset_raw.loc[dataset_raw['education'] == '10th'          , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '11th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '12th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '1st-4th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '5th-6th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '7th-8th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == '9th' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-acdm' , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Assoc-voc' , 'education'] = 'Associate'
dataset_raw.loc[dataset_raw['education'] == 'Bachelors' , 'education'] = 'Bachelors'
dataset_raw.loc[dataset_raw['education'] == 'Doctorate' , 'education'] = 'Doctorate'
dataset_raw.loc[dataset_raw['education'] == 'HS-Grad' , 'education'] = 'HS-Graduate'
dataset_raw.loc[dataset_raw['education'] == 'Masters' , 'education'] = 'Masters'
dataset_raw.loc[dataset_raw['education'] == 'Preschool' , 'education'] = 'Dropout'
dataset_raw.loc[dataset_raw['education'] == 'Prof-school' , 'education'] = 'Professor'
dataset_raw.loc[dataset_raw['education'] == 'Some-college' , 'education'] = 'HS-Graduate'

dataset_bin['education'] = dataset_raw['education']
dataset_con['education'] = dataset_raw['education']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="education", data=dataset_bin);

png

Feature: Marital Status

1
2
3
# Can we bucket some of these groups?
plt.figure(figsize=(20,3))
sns.countplot(y="marital-status", data=dataset_raw);

png

1
2
3
4
5
6
7
8
9
10
dataset_raw.loc[dataset_raw['marital-status'] == 'Never-married'        , 'marital-status'] = 'Never-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-AF-spouse' , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-civ-spouse' , 'marital-status'] = 'Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Married-spouse-absent', 'marital-status'] = 'Not-Married'
dataset_raw.loc[dataset_raw['marital-status'] == 'Separated' , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Divorced' , 'marital-status'] = 'Separated'
dataset_raw.loc[dataset_raw['marital-status'] == 'Widowed' , 'marital-status'] = 'Widowed'

dataset_bin['marital-status'] = dataset_raw['marital-status']
dataset_con['marital-status'] = dataset_raw['marital-status']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
sns.countplot(y="marital-status", data=dataset_bin);

png

Feature: Final Weight

1
2
3
# Let's use the Pandas Cut function to bin the data in equally sized buckets
dataset_bin['fnlwgt'] = pd.cut(dataset_raw['fnlwgt'], 10)
dataset_con['fnlwgt'] = dataset_raw['fnlwgt']
1
2
3
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
sns.countplot(y="fnlwgt", data=dataset_bin);

png

Feature: Education Number

1
2
3
4
5
6
7
# Let's use the Pandas Cut function to bin the data in equally sized buckets
dataset_bin['education-num'] = pd.cut(dataset_raw['education-num'], 10)
dataset_con['education-num'] = dataset_raw['education-num']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y="education-num", data=dataset_bin);

png

Feature: Hours per Week

1
2
3
4
5
6
7
8
9
10
# Let's use the Pandas Cut function to bin the data in equally sized buckets
dataset_bin['hours-per-week'] = pd.cut(dataset_raw['hours-per-week'], 10)
dataset_con['hours-per-week'] = dataset_raw['hours-per-week']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
plt.subplot(1, 2, 1)
sns.countplot(y="hours-per-week", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['hours-per-week']);

png

Feature: Capital Gain

1
2
3
4
5
6
7
8
9
10
# Let's use the Pandas Cut function to bin the data in equally sized buckets
dataset_bin['capital-gain'] = pd.cut(dataset_raw['capital-gain'], 5)
dataset_con['capital-gain'] = dataset_raw['capital-gain']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
plt.subplot(1, 2, 1)
sns.countplot(y="capital-gain", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-gain']);

png

Feature: Capital Loss

1
2
3
4
5
6
7
8
9
10
# Let's use the Pandas Cut function to bin the data in equally sized buckets
dataset_bin['capital-loss'] = pd.cut(dataset_raw['capital-loss'], 5)
dataset_con['capital-loss'] = dataset_raw['capital-loss']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,3))
plt.subplot(1, 2, 1)
sns.countplot(y="capital-loss", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con['capital-loss']);

png

Features: Race, Sex, Relationship

1
2
3
4
# Some features we'll consider to be in good enough shape as to pass through
dataset_con['sex'] = dataset_bin['sex'] = dataset_raw['sex']
dataset_con['race'] = dataset_bin['race'] = dataset_raw['race']
dataset_con['relationship'] = dataset_bin['relationship'] = dataset_raw['relationship']

Bi-variate Analysis

So far, we have analised all features individually. Let’s now start combining some of these features together to obtain further insight into the interactions between them.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Plot a count of the categories from each categorical feature split by our prediction class: salary - predclass.
def plot_bivariate_bar(dataset, hue, cols=5, width=20, height=15, hspace=0.2, wspace=0.5):
dataset = dataset.select_dtypes(include=[np.object])
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(width,height))
fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace)
rows = math.ceil(float(dataset.shape[1]) / cols)
for i, column in enumerate(dataset.columns):
ax = fig.add_subplot(rows, cols, i + 1)
ax.set_title(column)
if dataset.dtypes[column] == np.object:
g = sns.countplot(y=column, hue=hue, data=dataset)
substrings = [s.get_text()[:10] for s in g.get_yticklabels()]
g.set(yticklabels=substrings)

plot_bivariate_bar(dataset_con, hue='predclass', cols=3, width=20, height=12, hspace=0.4, wspace=0.5)

png

1
2
3
4
# Effect of Marital Status and Education on Income, across Marital Status.
plt.style.use('seaborn-whitegrid')
g = sns.FacetGrid(dataset_con, col='marital-status', size=4, aspect=.7)
g = g.map(sns.boxplot, 'predclass', 'education-num')

png

1
2
3
4
5
6
7
8
9
10
11
# Historical Trends on the Sex, Education, HPW and Age impact on Income.
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,4))
plt.subplot(1, 3, 1)
sns.violinplot(x='sex', y='education-num', hue='predclass', data=dataset_con, split=True, scale='count');

plt.subplot(1, 3, 2)
sns.violinplot(x='sex', y='hours-per-week', hue='predclass', data=dataset_con, split=True, scale='count');

plt.subplot(1, 3, 3)
sns.violinplot(x='sex', y='age', hue='predclass', data=dataset_con, split=True, scale='count');

png

1
2
3
4
5
# Interaction between pairs of features.
sns.pairplot(dataset_con[['age','education-num','hours-per-week','predclass','capital-gain','capital-loss']],
hue="predclass",
diag_kind="kde",
size=4);

png

Feature Crossing: Age + Hours Per Week

So far, we have modified and cleaned features that existed in our dataset. However, we can go further and create a new new variables, adding human knowledge on the interaction between features.

1
2
3
4
5
6
7
8
9
10
11
12
13
# Crossing Numerical Features
dataset_con['age-hours'] = dataset_con['age'] * dataset_con['hours-per-week']

dataset_bin['age-hours'] = pd.cut(dataset_con['age-hours'], 10)
dataset_con['age-hours'] = dataset_con['age-hours']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
plt.subplot(1, 2, 1)
sns.countplot(y="age-hours", data=dataset_bin);
plt.subplot(1, 2, 2)
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 1]['age-hours'], kde_kws={"label": ">$50K"});
sns.distplot(dataset_con.loc[dataset_con['predclass'] == 0]['age-hours'], kde_kws={"label": "<$50K"});

png

1
2
3
4
5
6
# Crossing Categorical Features
dataset_bin['sex-marital'] = dataset_con['sex-marital'] = dataset_con['sex'] + dataset_con['marital-status']

plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(20,5))
sns.countplot(y="sex-marital", data=dataset_bin);

png

Feature Encoding

Remember that Machine Learning algorithms perform Linear Algebra on Matrices, which means all features need have numeric values. The process of converting Categorical Features into values is called Encoding.

Here only perform One-Hot but not Label encoding.

Additional Resources: http://pbpython.com/categorical-encoding.html

1
2
3
4
5
6
# One Hot Encodes all labels before Machine Learning
one_hot_cols = dataset_bin.columns.tolist()
one_hot_cols.remove('predclass')
dataset_bin_enc = pd.get_dummies(dataset_bin, columns=one_hot_cols)

dataset_bin_enc.head()

predclass age_(16.927, 24.3] age_(24.3, 31.6] age_(31.6, 38.9] age_(38.9, 46.2] age_(46.2, 53.5] age_(53.5, 60.8] age_(60.8, 68.1] age_(68.1, 75.4] age_(75.4, 82.7] ... sex-marital_FemaleMarried sex-marital_FemaleNever-Married sex-marital_FemaleNot-Married sex-marital_FemaleSeparated sex-marital_FemaleWidowed sex-marital_MaleMarried sex-marital_MaleNever-Married sex-marital_MaleNot-Married sex-marital_MaleSeparated sex-marital_MaleWidowed
0 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
2 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 116 columns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 'dataset_con' is original input dataset for this section

# build a new dataframe containing only the object columns

#obj_df = dataset_con.select_dtypes(include=['object']).copy()
#obj_df.head()

# use dropna() delete NaN rows

#obj_df = obj_df.dropna(axis=0)

# use most prevailing value to fill in the null values
# (Private -> NaN workclass)

#obj_df[obj_df.isnull().any(axis=1)]
#obj_df["workclass"].value_counts()
#obj_df = obj_df.fillna({"workclass": "Private"})
1
#dataset_con.dtypes
1
2
3
4
# delete the rows contains NaN values
dataset_con_enc = dataset_con.dropna(axis=0)
print(dataset_con_enc)
dataset_con_enc[dataset_con_enc.isnull().any(axis=1)]
      predclass  age    workclass     occupation native-country  education  \
0             0   39  Non-fed-gov          Admin  United-States  Bachelors   
1             0   50     Self-emp  Office Labour  United-States  Bachelors   
2             0   38      Private  Manual Labour  United-States    HS-grad   
3             0   53      Private  Manual Labour  United-States    Dropout   
4             0   28      Private   Professional  South-America  Bachelors   
...         ...  ...          ...            ...            ...        ...   
48836         0   33      Private   Professional  United-States  Bachelors   
48837         0   39      Private   Professional  United-States  Bachelors   
48839         0   38      Private   Professional  United-States  Bachelors   
48840         0   44      Private          Admin  United-States  Bachelors   
48841         1   35     Self-emp  Office Labour  United-States  Bachelors   

      marital-status  fnlwgt  education-num  hours-per-week  capital-gain  \
0      Never-Married   77516             13              40          2174   
1            Married   83311             13              13             0   
2          Separated  215646              9              40             0   
3            Married  234721              7              40             0   
4            Married  338409             13              40             0   
...              ...     ...            ...             ...           ...   
48836  Never-Married  245211             13              40             0   
48837      Separated  215419             13              36             0   
48839        Married  374983             13              50             0   
48840      Separated   83891             13              40          5455   
48841        Married  182148             13              60             0   

       capital-loss     sex                race   relationship  age-hours  \
0                 0    Male               White  Not-in-family       1560   
1                 0    Male               White        Husband        650   
2                 0    Male               White  Not-in-family       1520   
3                 0    Male               Black        Husband       2120   
4                 0  Female               Black           Wife       1120   
...             ...     ...                 ...            ...        ...   
48836             0    Male               White      Own-child       1320   
48837             0  Female               White  Not-in-family       1404   
48839             0    Male               White        Husband       1900   
48840             0    Male  Asian-Pac-Islander      Own-child       1760   
48841             0    Male               White        Husband       2100   

             sex-marital  
0      MaleNever-Married  
1            MaleMarried  
2          MaleSeparated  
3            MaleMarried  
4          FemaleMarried  
...                  ...  
48836  MaleNever-Married  
48837    FemaleSeparated  
48839        MaleMarried  
48840      MaleSeparated  
48841        MaleMarried  

[45222 rows x 17 columns]

predclass age workclass occupation native-country education marital-status fnlwgt education-num hours-per-week capital-gain capital-loss sex race relationship age-hours sex-marital
1
2
3
4
5
# Label Encode all labels
le = preprocessing.LabelEncoder()
dataset_con_enc = dataset_con_enc.apply(le.fit_transform)

dataset_con_enc.head()

predclass age workclass occupation native-country education marital-status fnlwgt education-num hours-per-week capital-gain capital-loss sex race relationship age-hours sex-marital
0 0 22 1 0 7 1 1 3217 12 39 26 0 1 4 1 655 6
1 0 33 4 3 7 1 0 3519 12 12 0 0 1 4 0 302 5
2 0 21 3 1 7 5 3 17196 8 39 0 0 1 4 1 644 8
3 0 36 3 1 7 3 0 18738 6 39 0 0 1 2 0 847 5
4 0 11 3 4 6 1 0 23828 12 39 0 0 0 2 5 494 0

Feature Reduction / Selection

Once we have our features ready to use, we might find that the number of features available is too large to be run in a reasonable timeframe by our machine learning algorithms. There’s a number of options available to us for feature reduction and feature selection.

  • Dimensionality Reduction:
    • Principal Component Analysis (PCA): Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.
    • Singular Value Decomposition (SVD): SVD is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any m×n matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.
  • Feature Importance/Relevance:
    • Filter Methods: Filter type methods select features based only on general metrics like the correlation with the variable to predict. Filter methods suppress the least interesting variables. The other variables will be part of a classification or a regression model used to classify or to predict data. These methods are particularly effective in computation time and robust to overfitting.
    • Wrapper Methods: Wrapper methods evaluate subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. The two main disadvantages of these methods are : The increasing overfitting risk when the number of observations is insufficient. AND. The significant computation time when the number of variables is large.
    • Embedded Methods: Embedded methods try to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously.

Feature Correlation

Correlation ia s measure of how much two random variables change together. Features should be uncorrelated with each other and highly correlated to the feature we’re trying to predict.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Create a correlation plot of both datasets.
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(25,10))

plt.subplot(1, 2, 1)
# Generate a mask for the upper triangle
mask = np.zeros_like(dataset_bin_enc.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_bin_enc.corr(),
vmin=-1, vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r", 100),
mask=mask,
linewidths=.5);

plt.subplot(1, 2, 2)
mask = np.zeros_like(dataset_con_enc.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(dataset_con_enc.corr(),
vmin=-1, vmax=1,
square=True,
cmap=sns.color_palette("RdBu_r", 100),
mask=mask,
linewidths=.5);

png

Feature Importance

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. When training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. This is the feature importance measure exposed in sklearn’s Random Forest implementations.

1
2
3
4
5
6
7
8
# Using Random Forest to gain an insight on Feature Importance
clf = RandomForestClassifier()
clf.fit(dataset_con_enc.drop('predclass', axis=1), dataset_con_enc['predclass'])

plt.style.use('seaborn-whitegrid')
importance = clf.feature_importances_
importance = pd.DataFrame(importance, index=dataset_con_enc.drop('predclass', axis=1).columns, columns=["Importance"])
importance.sort_values(by='Importance', ascending=True).plot(kind='barh', figsize=(20,len(importance)/2));

png

PCA

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

We can use PCA to reduce the number of features to use in our ML algorithms, and graphing the variance gives us an idea of how many features we really need to represent our dataset fully.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Calculating PCA for both datasets, and graphing the Variance for each feature, per dataset
std_scale = preprocessing.StandardScaler().fit(dataset_bin_enc.drop('predclass', axis=1))
X = std_scale.transform(dataset_bin_enc.drop('predclass', axis=1))
pca1 = PCA(n_components=len(dataset_bin_enc.columns)-1)
fit1 = pca1.fit(X)

std_scale = preprocessing.StandardScaler().fit(dataset_con_enc.drop('predclass', axis=1))
X = std_scale.transform(dataset_con_enc.drop('predclass', axis=1))
pca2 = PCA(n_components=len(dataset_con_enc.columns)-2)
fit2 = pca2.fit(X)

# Graphing the variance per feature
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(25,7))

plt.subplot(1, 2, 1)
plt.xlabel('PCA Feature')
plt.ylabel('Variance')
plt.title('PCA for Discretised Dataset')
plt.bar(range(0, fit1.explained_variance_ratio_.size), fit1.explained_variance_ratio_);

plt.subplot(1, 2, 2)
plt.xlabel('PCA Feature')
plt.ylabel('Variance')
plt.title('PCA for Continuous Dataset')
plt.bar(range(0, fit2.explained_variance_ratio_.size), fit2.explained_variance_ratio_);

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# PCA's components graphed in 2D and 3D
# Apply Scaling
std_scale = preprocessing.StandardScaler().fit(dataset_con_enc.drop('predclass', axis=1))
X = std_scale.transform(dataset_con_enc.drop('predclass', axis=1))
y = dataset_con_enc['predclass']

# Formatting
target_names = [0,1]
colors = ['navy','darkorange']
lw = 2
alpha = 0.3
# 2 Components PCA
plt.style.use('seaborn-whitegrid')
plt.figure(2, figsize=(20, 8))

plt.subplot(1, 2, 1)
pca = PCA(n_components=2)
X_r = pca.fit(X).transform(X)
for color, i, target_name in zip(colors, [0, 1], target_names):
plt.scatter(X_r[y == i, 0], X_r[y == i, 1],
color=color,
alpha=alpha,
lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('First two PCA directions');

# 3 Components PCA
ax = plt.subplot(1, 2, 2, projection='3d')

pca = PCA(n_components=3)
X_reduced = pca.fit(X).transform(X)
for color, i, target_name in zip(colors, [0, 1], target_names):
ax.scatter(X_reduced[y == i, 0], X_reduced[y == i, 1], X_reduced[y == i, 2],
color=color,
alpha=alpha,
lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.set_ylabel("2nd eigenvector")
ax.set_zlabel("3rd eigenvector")

# rotate the axes
ax.view_init(30, 10)

png

Recursive Feature Elimination

Feature ranking with recursive feature elimination and cross-validated selection of the best number of features.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Calculating RFE for non-discretised dataset, and graphing the Importance for each feature, per dataset
selector1 = RFECV(LogisticRegression(), step=1, cv=5, n_jobs=-1)
selector1 = selector1.fit(dataset_con_enc.drop('predclass', axis=1).values, dataset_con_enc['predclass'].values)
print("Feature Ranking For Non-Discretised: %s" % selector1.ranking_)
print("Optimal number of features : %d" % selector1.n_features_)
# Plot number of features VS. cross-validation scores
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(20,5))
plt.xlabel("Number of features selected - Non-Discretised")
plt.ylabel("Cross validation score (nb of correct classifications)")
plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_);

# Feature space could be subsetted like so:
dataset_con_enc = dataset_con_enc[dataset_con_enc.columns[np.insert(selector1.support_, 0, True)]]
Feature Ranking For Non-Discretised: [1 1 1 1 3 1 4 1 1 1 1 1 1 1 2 1]
Optimal number of features : 13

png

Selecting Dataset

We now have two datasets to choose from to apply our ML algorithms. The one-hot-encoded, and the label-encoded. For now, we have decided not to use feature reduction or selection algorithms.

1
2
3
4
5
6
7
# OPTIONS: 
# - dataset_bin_enc
# - dataset_con_enc

# Change the dataset to test how would the algorithms perform under a differently encoded dataset.

selected_dataset = dataset_bin_enc
1
selected_dataset.head(2)

predclass age_(16.927, 24.3] age_(24.3, 31.6] age_(31.6, 38.9] age_(38.9, 46.2] age_(46.2, 53.5] age_(53.5, 60.8] age_(60.8, 68.1] age_(68.1, 75.4] age_(75.4, 82.7] ... sex-marital_FemaleMarried sex-marital_FemaleNever-Married sex-marital_FemaleNot-Married sex-marital_FemaleSeparated sex-marital_FemaleWidowed sex-marital_MaleMarried sex-marital_MaleNever-Married sex-marital_MaleNot-Married sex-marital_MaleSeparated sex-marital_MaleWidowed
0 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

2 rows × 116 columns

Splitting Data into Training and Testing Datasets

We need to split the data back into the training and testing datasets. Remember we joined both right at the beginning.

1
2
3
# Splitting the Training and Test data sets
train = selected_dataset.loc[0:32560,:]
test = selected_dataset.loc[32560:,:]

Removing Samples with Missing data

We could have removed rows with missing data during feature cleaning, but we’re choosing to do it at this point. It’s easier to do it this way, right after we split the data into Training and Testing. Otherwise we would have had to keep track of the number of deleted rows in our data and take that into account when deciding on a splitting boundary for our joined data.

1
2
3
4
# Given missing fields are a small percentange of the overall dataset, 
# we have chosen to delete them.
train = train.dropna(axis=0)
test = test.dropna(axis=0)

Rename datasets before Machine Learning algos

1
2
3
4
5
X_train_w_label = train
X_train = train.drop(['predclass'], axis=1)
y_train = train['predclass'].astype('int64')
X_test = test.drop(['predclass'], axis=1)
y_test = test['predclass'].astype('int64')

Machine Learning Algorithms

Data Review

Let’s take one last peek at our data before we start running the Machine Learning algorithms.

1
X_train.shape
(32561, 115)
1
X_train.head()

age_(16.927, 24.3] age_(24.3, 31.6] age_(31.6, 38.9] age_(38.9, 46.2] age_(46.2, 53.5] age_(53.5, 60.8] age_(60.8, 68.1] age_(68.1, 75.4] age_(75.4, 82.7] age_(82.7, 90.0] ... sex-marital_FemaleMarried sex-marital_FemaleNever-Married sex-marital_FemaleNot-Married sex-marital_FemaleSeparated sex-marital_FemaleWidowed sex-marital_MaleMarried sex-marital_MaleNever-Married sex-marital_MaleNot-Married sex-marital_MaleSeparated sex-marital_MaleWidowed
0 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 0
1 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 0 1 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0

5 rows × 115 columns

1
y_train.head()
0    0
1    0
2    0
3    0
4    0
Name: predclass, dtype: int64
1
2
3
# Setting a random seed will guarantee we get the same results 
# every time we run our training and testing.
random.seed(1)

Algorithms

From here, we will be running the following algorithms.

  • KNN
  • Logistic Regression
  • Random Forest
  • Naive Bayes
  • Stochastic Gradient Decent
  • Linear SVC
  • Decision Tree
  • Gradient Boosted Trees

Because there’s a great deal of repetitiveness on the code for each, we’ll create a custom function to analyse this.

For some algorithms, we have also chosen to run a Random Hyperparameter search, to select the best hyperparameters for a given algorithm.

1
2
3
4
5
6
7
8
9
10
11
12
13
# calculate the fpr and tpr for all thresholds of the classification
def plot_roc_curve(y_test, preds):
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Function that runs the requested algorithm and returns the accuracy metrics
def fit_ml_algo(algo, X_train, y_train, X_test, cv):
# One Pass
model = algo.fit(X_train, y_train)
test_pred = model.predict(X_test)
if (isinstance(algo, (LogisticRegression,
KNeighborsClassifier,
GaussianNB,
DecisionTreeClassifier,
RandomForestClassifier,
GradientBoostingClassifier))):
probs = model.predict_proba(X_test)[:,1]
else:
probs = "Not Available"
acc = round(model.score(X_test, y_test) * 100, 2)
# CV
train_pred = model_selection.cross_val_predict(algo,
X_train,
y_train,
cv=cv,
n_jobs = -1)
acc_cv = round(metrics.accuracy_score(y_train, train_pred) * 100, 2)
return train_pred, test_pred, acc, acc_cv, probs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Logistic Regression - Random Search for Hyperparameters

# Utility function to report best scores
def report(results, n_top=5):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")

# Specify parameters and distributions to sample from
param_dist = {'penalty': ['l2', 'l1'],
'class_weight': [None, 'balanced'],
'C': np.logspace(-20, 20, 10000),
'intercept_scaling': np.logspace(-20, 20, 10000)}

# Run Randomized Search
n_iter_search = 10
lrc = LogisticRegression()
random_search = RandomizedSearchCV(lrc,
n_jobs=-1,
param_distributions=param_dist,
n_iter=n_iter_search)

start = time.time()
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time.time() - start), n_iter_search))
report(random_search.cv_results_)
RandomizedSearchCV took 6.84 seconds for 10 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.844 (std: 0.004)
Parameters: {'penalty': 'l2', 'intercept_scaling': 42370413880.09742, 'class_weight': None, 'C': 6.248554728170629e+17}

Model with rank: 2
Mean validation score: 0.800 (std: 0.004)
Parameters: {'penalty': 'l2', 'intercept_scaling': 5.356398592977186e-12, 'class_weight': 'balanced', 'C': 318529980510.9508}

Model with rank: 2
Mean validation score: 0.800 (std: 0.004)
Parameters: {'penalty': 'l2', 'intercept_scaling': 6.741908876164404e-13, 'class_weight': 'balanced', 'C': 1.753171420878381e+18}

Model with rank: 4
Mean validation score: 0.800 (std: 0.004)
Parameters: {'penalty': 'l2', 'intercept_scaling': 0.03646331805309427, 'class_weight': 'balanced', 'C': 431085.5408791511}

Model with rank: 5
Mean validation score: 0.759 (std: 0.000)
Parameters: {'penalty': 'l2', 'intercept_scaling': 52853324182.66478, 'class_weight': None, 'C': 3.311707756163145e-20}
1
2
3
4
5
6
7
8
9
10
11
# Logistic Regression
start_time = time.time()
train_pred_log, test_pred_log, acc_log, acc_cv_log, probs_log = fit_ml_algo(LogisticRegression(n_jobs = -1),
X_train,
y_train,
X_test,
10)
log_time = (time.time() - start_time)
print("Accuracy: %s" % acc_log)
print("Accuracy CV 10-Fold: %s" % acc_cv_log)
print("Running Time: %s" % datetime.timedelta(seconds=log_time))
Accuracy: 84.47
Accuracy CV 10-Fold: 84.33
Running Time: 0:00:09.857440
1
print(metrics.confusion_matrix(y_test, test_pred_log))
[[11501   934]
 [ 1595  2252]]
1
print(metrics.classification_report(y_train, train_pred_log))
              precision    recall  f1-score   support

           0       0.88      0.93      0.90     24720
           1       0.71      0.58      0.64      7841

    accuracy                           0.84     32561
   macro avg       0.79      0.75      0.77     32561
weighted avg       0.84      0.84      0.84     32561
1
print(metrics.classification_report(y_test, test_pred_log))
              precision    recall  f1-score   support

           0       0.88      0.92      0.90     12435
           1       0.71      0.59      0.64      3847

    accuracy                           0.84     16282
   macro avg       0.79      0.76      0.77     16282
weighted avg       0.84      0.84      0.84     16282
1
plot_roc_curve(y_test, probs_log)

png

1
2
3
4
5
6
7
8
9
10
11
12
# k-Nearest Neighbors
start_time = time.time()
train_pred_knn, test_pred_knn, acc_knn, acc_cv_knn, probs_knn = fit_ml_algo(KNeighborsClassifier(n_neighbors = 3,
n_jobs = -1),
X_train,
y_train,
X_test,
10)
knn_time = (time.time() - start_time)
print("Accuracy: %s" % acc_knn)
print("Accuracy CV 10-Fold: %s" % acc_cv_knn)
print("Running Time: %s" % datetime.timedelta(seconds=knn_time))
Accuracy: 81.02
Accuracy CV 10-Fold: 81.13
Running Time: 0:02:21.181324
1
print(metrics.classification_report(y_train, train_pred_knn))
              precision    recall  f1-score   support

           0       0.86      0.89      0.88     24720
           1       0.62      0.56      0.59      7841

    accuracy                           0.81     32561
   macro avg       0.74      0.73      0.73     32561
weighted avg       0.81      0.81      0.81     32561
1
print(metrics.classification_report(y_test, test_pred_knn))
              precision    recall  f1-score   support

           0       0.87      0.89      0.88     12435
           1       0.61      0.56      0.58      3847

    accuracy                           0.81     16282
   macro avg       0.74      0.72      0.73     16282
weighted avg       0.81      0.81      0.81     16282
1
plot_roc_curve(y_test, probs_knn)

png

1
2
3
4
5
6
7
8
9
10
11
# Gaussian Naive Bayes
start_time = time.time()
train_pred_gaussian, test_pred_gaussian, acc_gaussian, acc_cv_gaussian, probs_gau = fit_ml_algo(GaussianNB(),
X_train,
y_train,
X_test,
10)
gaussian_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gaussian)
print("Accuracy CV 10-Fold: %s" % acc_cv_gaussian)
print("Running Time: %s" % datetime.timedelta(seconds=gaussian_time))
Accuracy: 75.59
Accuracy CV 10-Fold: 74.51
Running Time: 0:00:00.479271
1
print(metrics.classification_report(y_train, train_pred_gaussian)) 
              precision    recall  f1-score   support

           0       0.95      0.70      0.81     24720
           1       0.48      0.88      0.62      7841

    accuracy                           0.75     32561
   macro avg       0.72      0.79      0.72     32561
weighted avg       0.84      0.75      0.76     32561
1
print(metrics.classification_report(y_test, test_pred_gaussian))
              precision    recall  f1-score   support

           0       0.94      0.72      0.82     12435
           1       0.49      0.86      0.63      3847

    accuracy                           0.76     16282
   macro avg       0.72      0.79      0.72     16282
weighted avg       0.84      0.76      0.77     16282
1
plot_roc_curve(y_test, probs_gau)

png

1
2
3
4
5
6
7
8
9
10
11
# Linear SVC
start_time = time.time()
train_pred_svc, test_pred_svc, acc_linear_svc, acc_cv_linear_svc, _ = fit_ml_algo(LinearSVC(),
X_train,
y_train,
X_test,
10)
linear_svc_time = (time.time() - start_time)
print("Accuracy: %s" % acc_linear_svc)
print("Accuracy CV 10-Fold: %s" % acc_cv_linear_svc)
print("Running Time: %s" % datetime.timedelta(seconds=linear_svc_time))
Accuracy: 84.42
Accuracy CV 10-Fold: 84.46
Running Time: 0:00:07.630441
1
print(metrics.classification_report(y_train, train_pred_svc))
              precision    recall  f1-score   support

           0       0.88      0.93      0.90     24720
           1       0.72      0.58      0.64      7841

    accuracy                           0.84     32561
   macro avg       0.80      0.76      0.77     32561
weighted avg       0.84      0.84      0.84     32561
1
print(metrics.classification_report(y_test, test_pred_svc)) 
              precision    recall  f1-score   support

           0       0.88      0.93      0.90     12435
           1       0.71      0.58      0.64      3847

    accuracy                           0.84     16282
   macro avg       0.79      0.75      0.77     16282
weighted avg       0.84      0.84      0.84     16282
1
2
3
4
5
6
7
8
9
10
11
# Stochastic Gradient Descent
start_time = time.time()
train_pred_sgd, test_pred_sgd, acc_sgd, acc_cv_sgd, _ = fit_ml_algo(SGDClassifier(n_jobs = -1),
X_train,
y_train,
X_test,
10)
sgd_time = (time.time() - start_time)
print("Accuracy: %s" % acc_sgd)
print("Accuracy CV 10-Fold: %s" % acc_cv_sgd)
print("Running Time: %s" % datetime.timedelta(seconds=sgd_time))
Accuracy: 84.15
Accuracy CV 10-Fold: 83.74
Running Time: 0:00:02.039138
1
print(metrics.classification_report(y_train, train_pred_sgd))
              precision    recall  f1-score   support

           0       0.88      0.91      0.89     24720
           1       0.69      0.60      0.64      7841

    accuracy                           0.84     32561
   macro avg       0.78      0.76      0.77     32561
weighted avg       0.83      0.84      0.83     32561
1
print(metrics.classification_report(y_test, test_pred_sgd))
              precision    recall  f1-score   support

           0       0.88      0.91      0.90     12435
           1       0.69      0.61      0.64      3847

    accuracy                           0.84     16282
   macro avg       0.78      0.76      0.77     16282
weighted avg       0.84      0.84      0.84     16282
1
2
3
4
5
6
7
8
9
10
11
# Decision Tree Classifier
start_time = time.time()
train_pred_dt, test_pred_dt, acc_dt, acc_cv_dt, probs_dt = fit_ml_algo(DecisionTreeClassifier(),
X_train,
y_train,
X_test,
10)
dt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_dt)
print("Accuracy CV 10-Fold: %s" % acc_cv_dt)
print("Running Time: %s" % datetime.timedelta(seconds=dt_time))
Accuracy: 79.93
Accuracy CV 10-Fold: 80.44
Running Time: 0:00:01.417276
1
print(metrics.confusion_matrix(y_test, test_pred_dt))
[[10956  1479]
 [ 1788  2059]]
1
print(metrics.classification_report(y_train, train_pred_dt))
              precision    recall  f1-score   support

           0       0.86      0.89      0.87     24720
           1       0.60      0.54      0.57      7841

    accuracy                           0.80     32561
   macro avg       0.73      0.72      0.72     32561
weighted avg       0.80      0.80      0.80     32561
1
print(metrics.classification_report(y_test, test_pred_dt))
              precision    recall  f1-score   support

           0       0.86      0.88      0.87     12435
           1       0.58      0.54      0.56      3847

    accuracy                           0.80     16282
   macro avg       0.72      0.71      0.71     16282
weighted avg       0.79      0.80      0.80     16282
1
plot_roc_curve(y_test, probs_dt)

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Random Forest Classifier - Random Search for Hyperparameters

# Utility function to report best scores
def report(results, n_top=5):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})".format(
results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")

# Specify parameters and distributions to sample from
param_dist = {"max_depth": [10, None],
"max_features": sp_randint(1, 11),
"min_samples_split": sp_randint(2, 20),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}

# Run Randomized Search
n_iter_search = 10
rfc = RandomForestClassifier(n_estimators=10)
random_search = RandomizedSearchCV(rfc,
n_jobs = -1,
param_distributions=param_dist,
n_iter=n_iter_search)

start = time.time()
random_search.fit(X_train, y_train)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time.time() - start), n_iter_search))
report(random_search.cv_results_)
RandomizedSearchCV took 2.68 seconds for 10 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.839 (std: 0.004)
Parameters: {'bootstrap': False, 'criterion': 'gini', 'max_depth': None, 'max_features': 4, 'min_samples_leaf': 2, 'min_samples_split': 13}

Model with rank: 2
Mean validation score: 0.838 (std: 0.005)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': None, 'max_features': 10, 'min_samples_leaf': 5, 'min_samples_split': 2}

Model with rank: 3
Mean validation score: 0.838 (std: 0.005)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': None, 'max_features': 7, 'min_samples_leaf': 9, 'min_samples_split': 4}

Model with rank: 4
Mean validation score: 0.838 (std: 0.004)
Parameters: {'bootstrap': True, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 10, 'min_samples_leaf': 2, 'min_samples_split': 13}

Model with rank: 5
Mean validation score: 0.834 (std: 0.004)
Parameters: {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 10, 'max_features': 7, 'min_samples_leaf': 5, 'min_samples_split': 2}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Random Forest Classifier
start_time = time.time()
rfc = RandomForestClassifier(n_estimators=10,
min_samples_leaf=2,
min_samples_split=17,
criterion='gini',
max_features=8)
train_pred_rf, test_pred_rf, acc_rf, acc_cv_rf, probs_rf = fit_ml_algo(rfc,
X_train,
y_train,
X_test,
10)
rf_time = (time.time() - start_time)
print("Accuracy: %s" % acc_rf)
print("Accuracy CV 10-Fold: %s" % acc_cv_rf)
print("Running Time: %s" % datetime.timedelta(seconds=rf_time))
Accuracy: 84.07
Accuracy CV 10-Fold: 84.05
Running Time: 0:00:01.423032
1
print(metrics.classification_report(y_train, train_pred_rf))
              precision    recall  f1-score   support

           0       0.87      0.93      0.90     24720
           1       0.71      0.57      0.63      7841

    accuracy                           0.84     32561
   macro avg       0.79      0.75      0.76     32561
weighted avg       0.83      0.84      0.83     32561
1
print(metrics.classification_report(y_test, test_pred_rf))
              precision    recall  f1-score   support

           0       0.87      0.93      0.90     12435
           1       0.70      0.56      0.63      3847

    accuracy                           0.84     16282
   macro avg       0.79      0.74      0.76     16282
weighted avg       0.83      0.84      0.83     16282
1
plot_roc_curve(y_test, probs_rf)

png

1
2
3
4
5
6
7
8
9
10
11
# Gradient Boosting Trees
start_time = time.time()
train_pred_gbt, test_pred_gbt, acc_gbt, acc_cv_gbt, probs_gbt = fit_ml_algo(GradientBoostingClassifier(),
X_train,
y_train,
X_test,
10)
gbt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gbt)
print("Accuracy CV 10-Fold: %s" % acc_cv_gbt)
print("Running Time: %s" % datetime.timedelta(seconds=gbt_time))
Accuracy: 84.53
Accuracy CV 10-Fold: 84.34
Running Time: 0:00:18.993168
1
print(metrics.classification_report(y_train, train_pred_gbt))
              precision    recall  f1-score   support

           0       0.87      0.93      0.90     24720
           1       0.72      0.57      0.64      7841

    accuracy                           0.84     32561
   macro avg       0.80      0.75      0.77     32561
weighted avg       0.84      0.84      0.84     32561
1
print(metrics.classification_report(y_test, test_pred_gbt))
              precision    recall  f1-score   support

           0       0.88      0.93      0.90     12435
           1       0.71      0.58      0.64      3847

    accuracy                           0.85     16282
   macro avg       0.79      0.75      0.77     16282
weighted avg       0.84      0.85      0.84     16282
1
plot_roc_curve(y_test, probs_gbt)

png

Ranking Results

Let’s rank the results for all the algorithms we have used

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
models = pd.DataFrame({
'Model': ['KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree', 'Gradient Boosting Trees'],
'Score': [
acc_knn,
acc_log,
acc_rf,
acc_gaussian,
acc_sgd,
acc_linear_svc,
acc_dt,
acc_gbt
]})
models.sort_values(by='Score', ascending=False)

Model Score
7 Gradient Boosting Trees 84.53
1 Logistic Regression 84.47
5 Linear SVC 84.42
4 Stochastic Gradient Decent 84.15
2 Random Forest 84.07
0 KNN 81.02
6 Decision Tree 79.93
3 Naive Bayes 75.59
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
models = pd.DataFrame({
'Model': ['KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree', 'Gradient Boosting Trees'],
'Score': [
acc_cv_knn,
acc_cv_log,
acc_cv_rf,
acc_cv_gaussian,
acc_cv_sgd,
acc_cv_linear_svc,
acc_cv_dt,
acc_cv_gbt
]})
models.sort_values(by='Score', ascending=False)

Model Score
5 Linear SVC 84.46
7 Gradient Boosting Trees 84.34
1 Logistic Regression 84.33
2 Random Forest 84.05
4 Stochastic Gradient Decent 83.74
0 KNN 81.13
6 Decision Tree 80.44
3 Naive Bayes 74.51
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(10,10))

models = [
'KNN',
'Logistic Regression',
'Random Forest',
'Naive Bayes',
'Decision Tree',
'Gradient Boosting Trees'
]
probs = [
probs_knn,
probs_log,
probs_rf,
probs_gau,
probs_dt,
probs_gbt
]
colors = [
'blue',
'green',
'red',
'cyan',
'magenta',
'yellow',
]

plt.title('Receiver Operating Characteristic')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

def plot_roc_curves(y_test, prob, model):
fpr, tpr, threshold = metrics.roc_curve(y_test, prob)
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label = model + ' AUC = %0.2f' % roc_auc, color=colors[i])
plt.legend(loc = 'lower right')

for i, model in list(enumerate(models)):
plot_roc_curves(y_test, probs[i], models[i])

plt.show()

png

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(10,10))

models = [
'Logistic Regression',
'Decision Tree',
]
probs = [
probs_log,
probs_dt,
]
colors = [
'blue',
'green',
]

plt.title('Receiver Operating Characteristic')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')

def plot_roc_curves(y_test, prob, model):
fpr, tpr, threshold = metrics.roc_curve(y_test, prob)
roc_auc = metrics.auc(fpr, tpr)
plt.plot(fpr, tpr, 'b', label = model + ' AUC = %0.2f' % roc_auc, color=colors[i])
plt.legend(loc = 'lower right')

for i, model in list(enumerate(models)):
plot_roc_curves(y_test, probs[i], models[i])

plt.show()

png

1