Predicting Survival of Titanic Passengers in Python

In this project, I am going to design a Machine Learning algorithm for predicting the survival of the titanic passengers using Python programming language basing on the Titanic data set. I will include in here both, my code and the outcomes of all the chunks.

I am going to perform a classification the aim of which is to form a model able to predict whether a passenger survived (1) or not (0). I will base my analysis on the “titanic.csv” dataset.The ‘survived’ column is to be used as the true outcome (i.e. the label), while the rest of the variables could be used as the inputs to the model.

IMPORT LIBRARIES


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

from sklearn.preprocessing import LabelEncoder
from sklearn.neural_network import MLPClassifier   # Multilayer Perceptron
from sklearn.neighbors import KNeighborsClassifier # K Nearest Neighbours
from sklearn.svm import SVC                        # Support Vector Machines
from sklearn.gaussian_process import GaussianProcessClassifier # Gaussian Process
from sklearn.naive_bayes import GaussianNB         # Naive Bayes
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis 

from warnings import filterwarnings
filterwarnings('ignore')

from sklearn.model_selection import train_test_split
from sklearn import preprocessing

LOAD AND EXAMINE DATA

df = pd.read_csv('titanic.csv')

Explore the data

Check top 10 rows of the dataframe
df.head(10) 

'''
   pclass  survived  ...    cabin embarked
0       1         1  ...       B5        S
1       1         1  ...  C22 C26        S
2       1         0  ...  C22 C26        S
3       1         0  ...  C22 C26        S
4       1         0  ...  C22 C26        S
5       1         1  ...      E12        S
6       1         1  ...       D7        S
7       1         0  ...      A36        S
8       1         1  ...     C101        S
9       1         0  ...      NaN        C

[10 rows x 11 columns]
'''

Display names of the columns
df.columns

'''
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked'], 
      dtype='object')

'''

Display data types of each column
df.dtypes

'''
pclass        int64
survived      int64
name         object
sex          object
age         float64
sibsp         int64
parch         int64
ticket       object
fare        float64
cabin        object
embarked     object
dtype: object
'''

Summary statistics

Maximum values for all the columns

maxi = df.max()
print('The maximum values for each column are:\n' + str(maxi))

'''
The maximum values for each column are:
pclass                                3
survived                              1
name        van Melkebeke, Mr. Philemon
sex                                male
age                                  80
sibsp                                 8
parch                                 9
ticket                        WE/P 5735
fare                            512.329
dtype: object
'''

Minimum values for all the columns

mini = df.min()
print('The minimum values for each column are:\n' + str(mini))

'''
The minimum values for each column are:
pclass                        1
survived                      0
name        Abbing, Mr. Anthony
sex                      female
age                      0.1667
sibsp                         0
parch                         0
ticket                   110152
fare                          0
dtype: object
'''

Search for NaN values in each column

df.isnull().sum()

'''
pclass         0
survived       0
name           0
sex            0
age          263
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64
'''

The most missing values are locaed in columns ‘age’ (263) and ‘cabin’ (1014). I will later take care of them using interpolation.


Drop two columns that will not be useful for the model - sibsp & ticket

df.drop(['ticket', 'embarked'], axis = 1, inplace = True)

Examine survival based on gender

percGenderSurvived = df.groupby(['sex'])['survived'].sum().transform(lambda x: x/x.sum()).copy()  
print("Percentage of passengers survived based on their gender is as follows:\n" + str(percGenderSurvived))

'''
Percentage of passengers survived based on their gender is as follows:
sex
female    0.678
male      0.322
Name: survived, dtype: float64
'''

DATA PRE-PROCESSING

Encode ‘sex’ column to be female - 0, and male - 1

Display top 5 rows of the column
df.sex.head(5)

'''
0    female
1      male
2    female
3      male
4    female
Name: sex, dtype: object
'''

Transformation to bool type
df.sex = df.sex == "male"
Display top 5 rows of the column for confirmation
df.sex.head(5)

'''
0    False
1     True
2    False
3     True
4    False
Name: sex, dtype: bool
'''

Extract only titles from names

Disply top 5 rows of the column ‘name’
df.name.head(5)
Extract the titles and display unique values
titles = df.name.str.extract(pat = '([A-Za-z]+)\.').copy()
titles = np.unique(titles)
print(titles)

'''
array(['Capt', 'Col', 'Countess', 'Don', 'Dona', 'Dr', 'Jonkheer', 'Lady',
       'Major', 'Master', 'Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms', 'Rev',
       'Sir'], dtype=object)
'''

Check for titles type

type(titles)

'''
numpy.ndarray
'''

Overwrite ‘name’ column values

df.name = df.name.str.extract(pat = '([A-Za-z]+)\.')

Disply 5 rows of the column ‘name’ for confirmation

df.name.head(5)

'''
0      Miss
1    Master
2      Miss
3        Mr
4       Mrs
Name: name, dtype: object
'''

Display survival of differently titled individuals

fig = plt.figure(figsize=(12,8))
counter = 1
col = ['blue','orange']
for titles in df['name'].unique():
    fig.add_subplot(3, 6, counter)
    plt.title('Title : {}'.format(titles))
    s = df.survived[df['name'] == titles].value_counts() # series object
    if len(s) > 1: 
        s.sort_index().plot(kind = 'pie', colors = col)
    else: 
        i = s.index[0]
        s.sort_index().plot(kind = 'pie', colors = [col[i]])
    counter += 1
Indepth look at the class survival
survivalTitles = s = df.groupby(['name', 'survived']).agg({'survived': 'count'})
sirvivalinTitle = df.groupby(['name']).agg({'survived': 'count'})
finalTitles = survivalTitles.div(sirvivalinTitle, level='name') * 100

'''

name     survived            
Capt     0         100.000000
Col      0          50.000000
         1          50.000000
Countess 1         100.000000
Don      0         100.000000
Dona     1         100.000000
Dr       0          50.000000
         1          50.000000
Jonkheer 0         100.000000
Lady     1         100.000000
Major    0          50.000000
         1          50.000000
Master   0          49.180328
         1          50.819672
Miss     0          32.307692
         1          67.692308
Mlle     1         100.000000
Mme      1         100.000000
Mr       0          83.751651
         1          16.248349
Mrs      0          21.319797
         1          78.680203
Ms       0          50.000000
         1          50.000000
Rev      0         100.000000
Sir      1         100.000000

'''

Interpolate missing age entries in the ‘age’ column.
gp = df.groupby('name') #Group the data by title
val = gp.transform('median').age #Find the median value for each title
df['age'].fillna(val, inplace = True) #Fill in missing values
Check if the age missing values were fixed
df.isnull().sum()

'''
pclass         0
survived       0
name           0
sex            0
age            0
sibsp          0
parch          0
ticket         0
fare           1
cabin       1014
embarked       2
dtype: int64
'''

Change titles to numbers

Check the median of age by title
df.groupby('name').age.median()

'''
name
Capt        70.0
Col         54.5
Countess    33.0
Don         40.0
Dona        39.0
Dr          49.0
Jonkheer    38.0
Lady        48.0
Major       48.5
Master       4.0
Miss        22.0
Mlle        24.0
Mme         24.0
Mr          29.0
Mrs         35.5
Ms          28.0
Rev         41.5
Sir         49.0
Name: age, dtype: float64
'''

Count how many passengers hold each title
df.groupby(['name'])['name'].count()

'''
name
Capt          1 -----------> 0 Capitan is much older than all the rest of the crew
Col           4 -----------> 7 Army title
Countess      1 -----------> 3 Royal female
Don           1 -----------> 6 Mature male
Dona          1 -----------> 3 Royal female
Dr            8 -----------> 6 Matue male
Jonkheer      1 -----------> 1 Member of the crew
Lady          1 -----------> 3 Royal female
Major         2 -----------> 7 Army title
Master       61 -----------> 5 Master
Miss        260 -----------> 2 Young female probably unmarried
Mlle          2 -----------> 2 Young female probably unmarried
Mme           1 -----------> 2 Young female probably unmarried
Mr          757 -----------> 4 Mister
Mrs         197 -----------> 8 Married female
Ms            2 -----------> 2 Young female probably unmarried
Rev           8 -----------> 1 Member of the crew
Sir           1 -----------> 6  Mature male
Name: name, dtype: int64

To sum up:
    0 - Capitan
    1 - Other crew members
    2 - Miss + unmarried young females
    3 - Royal female
    4 - Mr
    5 - Master
    6 - 7 out of 8 Dr ( 1 female) + mature males Sir & Don
    7 - Army title
    8 - Mrs + female Dr
'''

Change titles to numerical values

df['name'] = df['name'].replace(['Capt'],0)
df['name'] = df['name'].replace(['Rev','Jonkheer'],1)
df['name'] = df['name'].replace(['Miss','Mlle','Mme','Ms' ],2)
df['name'] = df['name'].replace(['Lady','Dona', 'Countess'],3)
df['name'] = df['name'].replace(['Mr'],4)
df['name'] = df['name'].replace(['Master'],5)
df['name'] = df['name'].replace(['Sir', 'Don', 'Dr'],6)
df['name'] = df['name'].replace(['Major','Col'],7)
df['name'] = df['name'].replace(['Mrs'],8)
Checking for female Doctor
df.loc[df['name'] == 6, 'sex']

'''
40     True
93     True
100    True
119    True
181    False
206    True
278    True
299    True
508    True
525    True
Name: sex, dtype: bool
'''

df['name'][181] = 8

Check unique values for titles

df.name.unique()

'''
array([2, 5, 4, 8, 7, 6, 0, 3, 1])
'''

Interpolate missing ticket fare

gp = df.groupby('pclass') #Group the data by class
val = gp.transform('median').fare #Find the median value for each title
df['fare'].fillna(val, inplace = True) #Fill in missing values

Check if the ticket fare missing values were fixed

df.isnull().sum()

'''
pclass         0
survived       0
name           0
sex            0
age            0
sibsp          0
parch          0
fare           0
cabin       1014
dtype: int64
'''

Replace ‘cabin’ identification by only one letter
df.cabin.unique()

cabinClass= df.cabin.str.extract(pat = '([A-Z])').copy()
print(cabinClass)
Overwrite ‘name’ column values
df.cabin = df.cabin.str.extract(pat = '([A-Z])')
Check the unique values for now
df.cabin.unique()

'''
array(['B', 'C', 'E', 'D', 'A', nan, 'T', 'F', 'G'], dtype=object)
'''

Fill the missing values with ‘Z’
df.cabin = df['cabin'].fillna(value = 'Z')
Check the unique values after filling missing values
df.cabin.unique()

'''
array(['B', 'C', 'E', 'D', 'A', 'Z', 'T', 'F', 'G'], dtype=object)
'''

Change to numeric values
df['cabin'] = LabelEncoder().fit_transform(df['cabin'].astype(str))
Check the unique values
df['cabin'].unique()

'''
array([1, 2, 4, 3, 0, 8, 7, 5, 6])
'''

#Numeric values
Examine the unique values in the ‘sibsp’ column
df.sibsp.unique()

'''
array([0, 1, 2, 3, 4, 5, 8])

'''


#Only numeric values

Make sure there are not any more missing values in the dataframe
df.isnull().sum()

'''
pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
fare        0
cabin       0
dtype: int64
'''

Make sure there are not any more missing values in the dataframe
df.isnull().sum()

'''
pclass      0
survived    0
name        0
sex         0
age         0
sibsp       0
parch       0
fare        0
cabin       0
dtype: int64
'''
# No more missing values in the data set

Machine Learning algo training and testing

Seed random number generator for reproducible results
random.seed(1234)
Split the data into features and label (true outcome, i.e. survived)
label = df['survived'] #initialise feature
feature = df.drop(['survived'], axis=1)  #initalise feature
Sanity check
label

'''
       1
1       1
2       0
3       0
4       0
       ..
1304    0
1305    0
1306    0
1307    0
1308    0
Name: survived, Length: 1309, dtype: int64
'''

Feature column
feature.columns

'''
Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'fare', 'cabin'], dtype='object')
'''

Split the data & make sure it is randomised (shuffle=True)
random.seed(1234)
X_train, X_test, y_train, y_test = train_test_split(feature, label, test_size = 0.25,shuffle = True)
Scale the data
X_train_scaled = preprocessing.scale(X_train, with_mean = True, with_std = True)
scaler = preprocessing.StandardScaler().fit(X_train) #scaler to sclae the test data as well
Sanity check
X_train_scaled[0]

'''
array([ 0.8273289 , -0.12183624,  0.73980985, -0.02862179, -0.47325356,
       -0.45944255, -0.49725754,  0.50891413])
       
'''

Standardise X_test
X_test_scaled = scaler.transform(X_test)

X_test_scaled[0]

'''
array([ 0.8273289 , -1.21836239, -1.35169869, -0.55685232,  0.4247382 ,
       -0.45944255, -0.34078308,  0.50891413])
'''

Specify models as elements of a list
models = [MLPClassifier(), 
          KNeighborsClassifier(n_neighbors = 5), 
          SVC(kernel = 'poly', gamma = 'auto', degree = 5),
          GaussianProcessClassifier(),
          GaussianNB(),
          QuadraticDiscriminantAnalysis()]
Loop over models, train and test
random.seed(1234)
model=[]
for model in models:
    model.fit(X_train_scaled, y_train)
    score = model.score(X_test_scaled, y_test)
    print('Test Set Score:', '%.4f' % score)

'''
MLP:     Test Set Score: 0.8323
KNN:     Test Set Score: 0.7805
SVC:     Test Set Score: 0.7744
GPC:     Test Set Score: 0.8262
GNB:     Test Set Score: 0.7744
QDA:     Test Set Score: 0.7988
'''