Titanic machine learning from disaster¶

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

Getting Started¶

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_style('whitegrid')

train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')

train.head(4)

From the data, we can see the features presented for each passager on the ship:

Survived: Outcome of target (0 = No survival, 1 = Survival)
PassengerId: the ID for each passenger
Pclass: ticket class (a proxy for socio-economic class 1 = upper, 2 = middle, 3 = lower)
Name: the name for each passenger
Sex: the gender of each passenger
Age: the age of each passenger
SibSp: # of siblings/spouses aboard the Titanic
Parch: # of parents/children aboard the Titanic
Ticket: ticket number
Fare: passenage fare
Cabin: cabin number
Embarked: port of embarkation.

train.describe()

For numeric features, we can use describe() function to generate the count, mean, std, min, the first quartile, median, the third quartile, max. From the count row, we can see that the Age feature has 177 missing values.

print train.shape
print "----------------"
print train.dtypes

(891, 12)
----------------
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

shape attribute returns the size of data frame. dtypes attribute returns the data type of features.

Some visualization is better for understanding the data¶

sns.countplot(train.Survived, order=[0,1]);
plt.xlabel('Survived or not (0 for not survived, 1 for survived)');
plt.ylabel('Count of passager');

## Age figure
fig = plt.figure(figsize=[8,2])

age_data_full = train[['Age','Survived']]
age_data = age_data_full.ix[~np.isnan(age_data_full.Age),:]
plt.figure(figsize=[12,12])
sns.distplot(age_data.Age,ax=plt.subplot(311),kde=False);
plt.xlim([0,80])
plt.ylabel('Count of Passengers')
plt.title('Age distribution (177 missing values)');

sns.barplot(x='Age',y='Survived',data=age_data,ax=plt.subplot(312),ci=None);
plt.xticks([0,88],[0,80])
age_data_missing = age_data_full.ix[np.isnan(age_data_full.Age),:]
plt.subplots_adjust(hspace=0.4)
plt.subplot(313)
sns.countplot(age_data_missing['Survived'])
plt.title("The survived count in missing values of Age")
plt.show();

<matplotlib.figure.Figure at 0x7ff4083ab090>

From the figures, we could see that age would be a good indicator for the target variable (younger and older age have obviously higher survivial rate.). The missing value did not provide any useful information.

### Sex figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Sex'],ax=plt.subplot(121));
plt.title('Sex distribution')
plt.subplots_adjust(wspace=0.5)
sns.barplot(x='Sex',y='Survived',ax=plt.subplot(122),
            data=train,ci=None);
plt.title('Survived rate for different gender')

<matplotlib.text.Text at 0x7ff40b82cc90>

From the figure, we could see that female has a significantly higher survivial rate than male.

### Pclass figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Pclass'],ax=plt.subplot(121))
plt.title('Pclass distribution')
plt.subplots_adjust(wspace=0.8)
sns.barplot(x='Pclass',y='Survived',data=train,ci=None,
           ax=plt.subplot(122))
plt.title('Survived rate for different socio-economic class')

<matplotlib.text.Text at 0x7ff407d5afd0>

Hah, economic status has a significant effect on survivial rate.

### Relatives aboard 
fig = plt.figure(figsize=[9,6])
plt.subplots_adjust(hspace=0.8,wspace=0.8)
sns.countplot(train['Parch'],ax=plt.subplot(231))
plt.title('ParCh distribution')
sns.countplot(train['SibSp'],ax=plt.subplot(232))
plt.title('SibSp distribution')
plt.subplot(233)
sns.regplot(x='Parch',y='SibSp',data=train, fit_reg=False)
plt.title("The scatter plot of SibSp and Parch.\n \
The correlation coefficient is {0:.2}"
          .format(np.corrcoef(train['Parch'],train['SibSp'])[0,1]))
sns.barplot(x='Parch',y='Survived',data=train,
            ci=None,ax=plt.subplot(234))
plt.title('Survived rate for ParCh')
sns.barplot(x='SibSp',y='Survived',data=train,
            ci=None,ax=plt.subplot(235))
plt.title('Survived rate for SibSp')
sns.barplot(x=train['SibSp']+train['Parch'],y=train['Survived'],
            ci=None,ax=plt.subplot(236))
plt.title('Survived rate for SibSp + Parch')

<matplotlib.text.Text at 0x7ff40665a350>

### cabin figure
# Preprocessing Cabin number to 
train.ix[pd.isnull(train['Cabin']),'Cabin']=0
train.ix[train['Cabin']!=0,'Cabin'] = \
[len(ele.split()) for ele in train.Cabin if ele != 0]

plt.figure(figsize=[8,3])
plt.subplots_adjust(wspace=0.8)
sns.countplot(train.Cabin,ax=plt.subplot(121))
plt.title("Cabin number distribution")
sns.barplot(x='Cabin',y='Survived',data=train,
            ci=None,ax=plt.subplot(122))
plt.title("Survived rate for different Cabin number")

<matplotlib.text.Text at 0x7ff40645c2d0>

The passenagers with more cabins have higher survivial rate. However, we need to explore the collinearity between Cabin number, Fare and Pclass;

### Fare figure
plt.figure(figsize=[6,8])
plt.subplots_adjust(hspace=0.4)
sns.distplot(train.Fare,ax=plt.subplot(211))
plt.title("The distribution of Fare")
sns.violinplot(x='Survived',y='Fare',data=train,ax=plt.subplot(212))
plt.title("The Fare distribution versus survived or not")

<matplotlib.text.Text at 0x7ff408174250>

Of course, passenager with higher fare have a higher survivied rate.

### bivariate visualization
plt.figure(figsize=[9,12])
plt.subplots_adjust(hspace=0.4)
sns.violinplot(x='Pclass',y='Fare',data=train,jitter=True,
              ax=plt.subplot(311))
sns.violinplot(x='Pclass',y='Cabin',data=train,
             ax=plt.subplot(312),jitter=True)
sns.violinplot(x='Cabin',y='Fare',data=train, jitter=True,
             ax=plt.subplot(313))

<matplotlib.axes._subplots.AxesSubplot at 0x7ff407ef69d0>

np.corrcoef(train.ix[:,['Pclass','Cabin','Fare']].transpose())

array([[ 1.        , -0.6471164 , -0.54949962],
       [-0.6471164 ,  1.        ,  0.59617121],
       [-0.54949962,  0.59617121,  1.        ]])

There are certain correlation among these three features, but not very strong.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200