Machine Learning Project: Titanic (Kaggle)
Exploratory data analysis
In this section, we’ll do the following:
- Data extraction : we’ll load the dataset and have a first look at it.
- Cleaning : we’ll fill in missing values.
- Plotting : we’ll create some interesting charts that’ll (hopefully) spot correlations and hidden insights out of the data.
- Assumptions : we’ll formulate hypotheses from the charts.
from IPython.core.display import HTML
HTML("""
<style>
 {
display: table-cell;
text-align: center;
vertical-align: middle;
}
</style>
""");
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', category=DeprecationWarning)
import pandas as pd
pd.options.display.max_columns = 100
from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns
import pylab as plot
params = {
'axes.labelsize': "large",
'xtick.labelsize': 'x-large',
'legend.fontsize': 20,
'figure.dpi': 150,
'figure.figsize': [25, 7]
}
plot.rcParams.update(params)
We will use the training data set to build our predictive model and the testing data set to score it and generate an output file to submit on the Kaggle evaluation system.
data = pd.read_csv('./data/train.csv')
print (data.shape)
(891, 12)
We have:
- 891 rows
- 12 columns
Peek at our data.
data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Pandas allows you to a have a high-level simple statistical description of the numerical features. This can be done using the describe method.
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
We chaeck is we have any missing data.
data.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
The count variable shows that 177 values are missing in the Age column.
A solution to null values is to insert the median age. We would rather use the median as opposed to the mean as it is not affected as adversely by ouliers.
data['Age'] = data['Age'].fillna(data['Age'].median())
Let’s check the result.
data.isna().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
Let us see how the median has affected our summary.
data.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.361582 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 13.019697 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 35.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
Perfect.
Let’s now make some charts.
Let’s visualize survival based on the gender.
data['Died'] = 1 - data['Survived']
data.groupby('Sex').agg('sum')[['Survived', 'Died']].plot(kind='bar', figsize=(25, 7), stacked=True, color=['b', 'tab:pink']);
(./images/article_1/1.png)
Looks like male passengers are more likely to die.
Ratios would provide a clearer picture.
data.groupby('Sex').agg('mean')[['Survived', 'Died']].plot(kind='bar', figsize=(25, 7), stacked=True, color=['b', 'tab:pink']);
“Sex” seems to be a discriminative feature. Women are more likely to survive.
Let’s now correlate the survival with the age variable.
fig = plt.figure(figsize=(25, 7))
sns.violinplot(x='Sex', y='Age',
hue='Survived', data=data,
split=True,
palette={0: "tab:pink", 1: "b"}
);
From the Chart Above we can come to the following conclusions:
- Women survive more than men, as depicted by the larger blue area on Violin plot.
- How age affects male passengers:
- Young boys tend to survive
- A large number of passengers between 20 and 40 die
- The age doesn’t seem to have a direct impact on the female survival
- These violin plots confirm that sailors and captains used to follow the protocol of: “Women and children first !” in case of emergencies.
We will now look at ticket fare and see how it could impact survival.
figure = plt.figure(figsize=(25, 7))
plt.hist([data[data['Survived'] == 1]['Fare'], data[data['Survived'] == 0]['Fare']],
stacked=True, color = ['tab:pink','b'],
bins = 50, label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend();
Passengers with cheaper ticket fares are more likely to die. Put differently, passengers with more expensive tickets, and therefore higher social status, seem to have higher priority.
As a matter of fact, the ticket fare correlates with the class as we see it in the chart below.
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar', figsize=(25, 7), ax = ax);
Let’s now see how the embarkation site affects the survival.
fig = plt.figure(figsize=(25, 7))
sns.violinplot(x='Embarked', y='Fare', hue='Survived', data=data, split=True, palette={0: "tab:pink", 1: "b"});
Seems like site C have a wider range of ticket fares and might be a metropolitan site.
Embarking site S has a slightly smaller range followed by site Q, which looks like a blue collar site as all tickets are under 100.
Feature engineering
In the previous part, we looked at the data and spotted some interesting correlations. We’ll see how to process and transform these variables in such a way the data becomes manageable by a machine learning algorithm. We’ll also create, or “engineer” additional features that will be useful in building the model.
Let’s define a print function that determines whether or not a feature has been processed.
def status(feature):
print ('Processing', feature, ': ok')
Loading the data
One trick when starting a machine learning problem is to append the training set to the test set together.
We’ll engineer new features using the train set to prevent information leakage. Then we’ll add these variables to the test set.
Let’s load the train and test sets and append them together.
def get_combined_data():
# read training data
train = pd.read_csv('./data/train.csv')
# read test data
test = pd.read_csv('./data/test.csv')
# extracting and then removing the targets from the training data
targets = train.Survived
train.drop(['Survived'], 1, inplace=True)
# merging train data and test data for future feature engineering
# we'll also remove the PassengerID since this is not an informative feature
combined = train.append(test)
combined.reset_index(inplace=True)
combined.drop(['index', 'PassengerId'], inplace=True, axis=1)
return combined
combined = get_combined_data()
Let’s have a look at the shape :
print (combined.shape)
(1309, 10)
training and testing sets are combined.
You will notice that the total number of rows (1309) is the exact summation of the number of rows in the train set and the test set.
combined.head()
Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Extracting the passenger titles
Each name has a title in it. In that case, we might introduce an additional information about the social/marital status by simply parsing the name and extracting the title and converting to a binary variable.
Lets see what the different titles are in the data set.
titles = set()
for name in data['Name']:
titles.add(name.split(',')[1].split('.')[0].strip())
print(titles)
{'the Countess', 'Lady', 'Don', 'Capt', 'Mr', 'Dr', 'Master', 'Ms', 'Mlle', 'Col', 'Miss', 'Mme', 'Sir', 'Jonkheer', 'Mrs', 'Major', 'Rev'}
Title_Dictionary = {
"Capt": "Officer",
"Col": "Officer",
"Major": "Officer",
"Jonkheer": "Royalty",
"Don": "Royalty",
"Sir" : "Royalty",
"Dr": "Officer",
"Rev": "Officer",
"the Countess":"Royalty",
"Mme": "Mrs",
"Mlle": "Miss",
"Ms": "Mrs",
"Mr" : "Mr",
"Mrs" : "Mrs",
"Miss" : "Miss",
"Master" : "Master",
"Lady" : "Royalty"
}
def get_titles():
# we extract the title from each name
combined['Title'] = combined['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
# a map of more aggregated title
# we map each title
combined['Title'] = combined.Title.map(Title_Dictionary)
status('Title')
return combined
This function parses the names and removes the titles. Then maps the titles to column of titles. We selected :
- Officer
- Royalty
- Mr
- Mrs
- Miss
- Master
combined = get_titles()
Processing Title : ok
combined.head()
Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Mr |
1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | Mrs |
2 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Miss |
3 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | Mrs |
4 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Mr |
We neeed to check if the titles have been filled correctly.
combined[combined['Title'].isnull()]
Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|
1305 | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9 | C105 | C | NaN |
we have a null value in the line 1305. In fact the corresponding name is Oliva y Ocana, Dona. Fermina.
Processing the ages
In the first part that the Age variable was missing 177 values. This is a large number ( ~ 13% of the dataset). Simply replacing them with the mean or the median age might not be the best solution since the age may differ with different categories of passengers.
To understand why, let’s group our dataset by sex, Title and passenger class and for each subset compute the median age.
To avoid data leakage from the test set, we fill in missing ages in the data set using the training data.
Number of missing ages in training data set
print (combined.iloc[:891].Age.isnull().sum())
177
Number of missing ages in testing set
print (combined.iloc[891:].Age.isnull().sum())
86
grouped_train = combined.iloc[:891].groupby(['Sex','Pclass','Title'])
grouped_median_train = grouped_train.median()
grouped_median_train = grouped_median_train.reset_index()[['Sex', 'Pclass', 'Title', 'Age']]
grouped_median_train.head()
Sex | Pclass | Title | Age | |
---|---|---|---|---|
0 | female | 1 | Miss | 30.0 |
1 | female | 1 | Mrs | 40.0 |
2 | female | 1 | Officer | 49.0 |
3 | female | 1 | Royalty | 40.5 |
4 | female | 2 | Miss | 24.0 |
This dataframe will help us input missing ages based on different criteria.
Let’s create a function that fills in the missing age in combined based on these different attributes.
def fill_age(row):
condition = (
(grouped_median_train['Sex'] == row['Sex']) &
(grouped_median_train['Title'] == row['Title']) &
(grouped_median_train['Pclass'] == row['Pclass'])
)
return grouped_median_train[condition]['Age'].values[0]
def process_age():
global combined
# a function that fills the missing values of the Age variable
combined['Age'] = combined.apply(lambda row: fill_age(row) if np.isnan(row['Age']) else row['Age'], axis=1)
status('age')
return combined
combined = process_age()
Processing age : ok
The missing ages have been replaced.
Processing Names
def process_names():
global combined
# we clean the Name variable
combined.drop('Name', axis=1, inplace=True)
# encoding in dummy variable
titles_dummies = pd.get_dummies(combined['Title'], prefix='Title')
combined = pd.concat([combined, titles_dummies], axis=1)
# removing the title variable
combined.drop('Title', axis=1, inplace=True)
status('names')
return combined
This function drops the Name column since we won’t be using it anymore because we created a Title column.
Then we encode the title values using a dummy encoding.
You can learn about dummy coding and how to easily do it in Pandas here.
combined = process_names()
Processing names : ok
combined.head()
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | 0 | 1 | 0 | 0 | 0 |
As you can see :
- there is no longer a name feature.
- new variables (Title_X) appeared. These features are binary.
- For example, If Title_Mr = 1, the corresponding Title is Mr.
Processing Fare
We input the missing fare value by the average fare computed on the training data set
def process_fares():
global combined
# there's one missing fare value - replacing it with the mean.
combined.Fare.fillna(combined.iloc[:891].Fare.mean(), inplace=True)
status('fare')
return combined
This function simply replaces one missing Fare value by the mean.
combined = process_fares()
Processing fare : ok
Processing Embarked
def process_embarked():
global combined
# two missing embarked values - filling them with the most frequent one in the train set(S)
combined.Embarked.fillna('S', inplace=True)
# dummy encoding
embarked_dummies = pd.get_dummies(combined['Embarked'], prefix='Embarked')
combined = pd.concat([combined, embarked_dummies], axis=1)
combined.drop('Embarked', axis=1, inplace=True)
status('embarked')
return combined
This functions replaces the two missing values of Embarked with the most frequent Embarked location.
combined = process_embarked()
Processing embarked : ok
combined.head()
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
2 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
4 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Processing Cabin
train_cabin, test_cabin = set(), set()
for c in combined.iloc[:891]['Cabin']:
try:
train_cabin.add(c[0])
except:
train_cabin.add('U')
for c in combined.iloc[891:]['Cabin']:
try:
test_cabin.add(c[0])
except:
test_cabin.add('U')
print (train_cabin)
{'D', 'E', 'A', 'G', 'C', 'T', 'U', 'F', 'B'}
print (test_cabin)
{'D', 'E', 'A', 'G', 'C', 'U', 'F', 'B'}
We don’t have any cabin letters in the test data set that are not present in the training data set.
def process_cabin():
global combined
# replacing missing cabins with U (for Uknown)
combined.Cabin.fillna('U', inplace=True)
# mapping each Cabin value with the cabin letter
combined['Cabin'] = combined['Cabin'].map(lambda c: c[0])
# dummy encoding ...
cabin_dummies = pd.get_dummies(combined['Cabin'], prefix='Cabin')
combined = pd.concat([combined, cabin_dummies], axis=1)
combined.drop('Cabin', axis=1, inplace=True)
status('cabin')
return combined
This function replaces null values with U (for Unknow). It then maps each Cabin value to the first letter. Then it encodes the cabin values using dummy encoding again.
combined = process_cabin()
Processing cabin : ok
Ok no missing values now.
combined.head()
Pclass | Sex | Age | SibSp | Parch | Ticket | Fare | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | Embarked_C | Embarked_Q | Embarked_S | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | 1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 3 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
3 | 1 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 3 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Processing Sex
def process_sex():
global combined
# mapping string values to numerical one
combined['Sex'] = combined['Sex'].map({'male':1, 'female':0})
status('Sex')
return combined
This function maps the string values male and female to 1 and 0 respectively.
combined = process_sex()
Processing Sex : ok
Processing Pclass
def process_pclass():
global combined
# encoding into 3 categories:
pclass_dummies = pd.get_dummies(combined['Pclass'], prefix="Pclass")
# adding dummy variable
combined = pd.concat([combined, pclass_dummies],axis=1)
# removing "Pclass"
combined.drop('Pclass',axis=1,inplace=True)
status('Pclass')
return combined
This function encodes the values of Pclass (1,2,3) using a dummy encoding.
combined = process_pclass()
Processing Pclass : ok
Processing Ticket
Let’s first see how the different ticket prefixes we have in our dataset
def cleanTicket(ticket):
ticket = ticket.replace('.', '')
ticket = ticket.replace('/', '')
ticket = ticket.split()
ticket = map(lambda t : t.strip(), ticket)
ticket = list(filter(lambda t : not t.isdigit(), ticket))
if len(ticket) > 0:
return ticket[0]
else:
return 'XXX'
tickets = set()
for t in combined['Ticket']:
tickets.add(cleanTicket(t))
print (len(tickets))
37
def process_ticket():
global combined
# a function that extracts each prefix of the ticket, returns 'XXX' if no prefix (i.e the ticket is a digit)
def cleanTicket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = ticket.split()
ticket = map(lambda t : t.strip(), ticket)
ticket = filter(lambda t : not t.isdigit(), ticket)
if len(ticket) > 0:
return ticket[0]
else:
return 'XXX'
# Extracting dummy variables from tickets:
combined['Ticket'] = combined.Ticket.str.replace('\W', '').str.extract('(\D+)').fillna('XXX')
tickets_dummies = pd.get_dummies(combined['Ticket'], prefix='Ticket')
combined = pd.concat([combined, tickets_dummies], axis=1)
combined.drop('Ticket', inplace=True, axis=1)
status('Ticket')
return combined
combined = process_ticket()
Processing Ticket : ok
Processing Family
This part includes creating new variables based on the size of the family (the size is by the way, another variable we create).
The creation of this new variables is done under a realistic assumption: Large families are grouped together, hence they are more likely to get higher priority for life boats as opposed to those travelling alone.
def process_family():
global combined
# introducing a new feature : the size of families (including the passenger)
combined['FamilySize'] = combined['Parch'] + combined['SibSp'] + 1
# introducing other features based on the family size
combined['Singleton'] = combined['FamilySize'].map(lambda s: 1 if s == 1 else 0)
combined['SmallFamily'] = combined['FamilySize'].map(lambda s: 1 if 2 <= s <= 4 else 0)
combined['LargeFamily'] = combined['FamilySize'].map(lambda s: 1 if 5 <= s else 0)
status('family')
return combined
This function introduces 4 new features:
- FamilySize : the total number of relatives including the passenger (him/her)self.
- Singleton : a boolean variable that describes families of size = 1
- SmallFamily : a boolean variable that describes families of 2 <= size <= 4
- LargeFamily : a boolean variable that describes families of 5 < size
combined = process_family()
Processing family : ok
print (combined.shape)
(1309, 63)
We end up with a total of 63 features/columns.
combined.head()
Sex | Age | SibSp | Parch | Fare | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | Embarked_C | Embarked_Q | Embarked_S | Cabin_A | Cabin_B | Cabin_C | Cabin_D | Cabin_E | Cabin_F | Cabin_G | Cabin_T | Cabin_U | Pclass_1 | Pclass_2 | Pclass_3 | Ticket_A | Ticket_AQ | Ticket_AS | Ticket_C | Ticket_CA | Ticket_CASOTON | Ticket_FC | Ticket_FCC | Ticket_Fa | Ticket_LINE | Ticket_LP | Ticket_PC | Ticket_PP | Ticket_PPP | Ticket_SC | Ticket_SCA | Ticket_SCAH | Ticket_SCAHBasle | Ticket_SCOW | Ticket_SCPARIS | Ticket_SCParis | Ticket_SOC | Ticket_SOP | Ticket_SOPP | Ticket_SOTONO | Ticket_SOTONOQ | Ticket_SP | Ticket_STONO | Ticket_STONOQ | Ticket_SWPP | Ticket_WC | Ticket_WEP | Ticket_XXX | FamilySize | Singleton | SmallFamily | LargeFamily | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 22.0 | 1 | 0 | 7.2500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
1 | 0 | 38.0 | 1 | 0 | 71.2833 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 1 | 0 |
2 | 0 | 26.0 | 0 | 0 | 7.9250 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
3 | 0 | 35.0 | 1 | 0 | 53.1000 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 1 | 0 |
4 | 1 | 35.0 | 0 | 0 | 8.0500 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
III - Modeling
In this part, we use our knowledge of the passengers based on the features we created and then build a statistical model. There is a wide variety of models to use, from logistic regression to decision trees and more sophisticated ones such as random forests and gradient boosted trees.
We’ll be using Random Forests. Random Froests has proven a great efficiency in Kaggle competitions.
For more details about why ensemble methods perform well, you can refer to these posts:
- http://mlwave.com/kaggle-ensembling-guide/
- http://www.overkillanalytics.net/more-is-always-better-the-power-of-simple-ensembles/
Back to our problem, we now have to:
- Break the combined dataset in train set and test set.
- Use the training set to build a predictive model.
- Evaluate the model using the training set.
- Test the model using the test set and generate and output file for the submission.
This is an iterative process…….
Let’s start by importing the useful libraries.
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble.gradient_boosting import GradientBoostingClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
To evaluate our model we’ll be using a 5-fold cross validation with the accuracy since it’s the metric that the competition uses in the leaderboard.
To do that, we’ll define a small scoring function.
def compute_score(clf, X, y, scoring='accuracy'):
xval = cross_val_score(clf, X, y, cv = 5, scoring=scoring)
return np.mean(xval)
Recovering the train set and the test set from the combined dataset is an easy task.
def recover_train_test_target():
global combined
targets = pd.read_csv('./data/train.csv', usecols=['Survived'])['Survived'].values
train = combined.iloc[:891]
test = combined.iloc[891:]
return train, test, targets
train, test, targets = recover_train_test_target()
Feature selection
We’ve come up to more than 30 features so far.
When feature engineering is done, we usually tend to decrease the dimensionality by selecting the “right” number of features that capture the essentials.
In fact, feature selection comes with many benefits:
- It decreases redundancy among the data
- It speeds up the processing
- It reduces overfitting
Tree-based estimators can be used to compute feature importances, which in turn can be used to discard irrelevant features.
clf = RandomForestClassifier(n_estimators=50, max_features='sqrt')
clf = clf.fit(train, targets)
Let’s have a look at the importance of each feature.
features = pd.DataFrame()
features['feature'] = train.columns
features['importance'] = clf.feature_importances_
features.sort_values(by=['importance'], ascending=True, inplace=True)
features.set_index('feature', inplace=True)
features.plot(kind='barh', figsize=(25, 25))
<matplotlib.axes._subplots.AxesSubplot at 0xd1a4470>
As you may notice, there is a great importance linked to Title_Mr, Age, Fare, and Sex. Let’s now transform our training data set and testing data set into a more compact dataset.
model = SelectFromModel(clf, prefit=True)
train_reduced = model.transform(train)
print (train_reduced.shape)
(891, 13)
test_reduced = model.transform(test)
print (test_reduced.shape)
(418, 13)
Let’s try different base models
logreg = LogisticRegression()
logreg_cv = LogisticRegressionCV()
rf = RandomForestClassifier()
gboost = GradientBoostingClassifier()
models = [logreg, logreg_cv, rf, gboost]
for model in models:
print ('Cross-validation of : {0}'.format(model.__class__))
score = compute_score(clf=model, X=train_reduced, y=targets, scoring='accuracy')
print ('CV score = {0}'.format(score))
print ('****')
Cross-validation of : <class 'sklearn.linear_model.logistic.LogisticRegression'>
CV score = 0.8125832554019151
****
Cross-validation of : <class 'sklearn.linear_model.logistic.LogisticRegressionCV'>
CV score = 0.8204360116561997
****
Cross-validation of : <class 'sklearn.ensemble.forest.RandomForestClassifier'>
CV score = 0.8092691752958645
****
Cross-validation of : <class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
CV score = 0.8249176976842577
****
Hyperparameters tuning
As mentioned in the beginning of the Modeling part, we will be using a Random Forest model. It may not be the best model for this task but we’ll show how to tune.
# turn run_gs to True if you want to run the gridsearch again.
run_gs = True
if run_gs:
parameter_grid = {
'max_depth' : [4, 6, 8],
'n_estimators': [50, 10],
'max_features': ['sqrt', 'auto', 'log2'],
'min_samples_split': [2, 3, 10],
'min_samples_leaf': [1, 3, 10],
'bootstrap': [True, False],
}
forest = RandomForestClassifier()
cross_validation = StratifiedKFold(n_splits=5)
grid_search = GridSearchCV(forest,
scoring='accuracy',
param_grid=parameter_grid,
cv=cross_validation,
verbose=1
)
grid_search.fit(train, targets)
model = grid_search
parameters = grid_search.best_params_
print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))
else:
parameters = {'bootstrap': False, 'min_samples_leaf': 3, 'n_estimators': 50,
'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 6}
model = RandomForestClassifier(**parameters)
model.fit(train, targets)
Fitting 5 folds for each of 324 candidates, totalling 1620 fits
Best score: 0.8451178451178452
Best parameters: {'bootstrap': True, 'max_depth': 6, 'max_features': 'auto', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 10}
[Parallel(n_jobs=1)]: Done 1620 out of 1620 | elapsed: 1.5min finished
Now that the model is built by scanning several combinations of the hyperparameters, we can generate an output file to submit on Kaggle.
output = model.predict(test).astype(int)
df_output = pd.DataFrame()
aux = pd.read_csv('./data/test.csv')
df_output['PassengerId'] = aux['PassengerId']
df_output['Survived'] = output
df_output[['PassengerId','Survived']].to_csv('./predictions/gridsearch_rf.csv', index=False)
Blending different models
I haven’t personally uploaded a submission based on model blending but here’s how you could do it
trained_models = []
for model in models:
model.fit(train, targets)
trained_models.append(model)
predictions = []
for model in trained_models:
predictions.append(model.predict_proba(test)[:, 1])
predictions_df = pd.DataFrame(predictions).T
predictions_df['out'] = predictions_df.mean(axis=1)
predictions_df['PassengerId'] = aux['PassengerId']
predictions_df['out'] = predictions_df['out'].map(lambda s: 1 if s >= 0.5 else 0)
predictions_df = predictions_df[['PassengerId', 'out']]
predictions_df.columns = ['PassengerId', 'Survived']
predictions_df.to_csv('./predictions/blending_base_models.csv', index=False)
To have a good blending submission, the base models should be different and their correlations uncorrelated.