The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')
train.head(4)
From the data, we can see the features presented for each passager on the ship:
train.describe()
For numeric features, we can use describe() function to generate the count, mean, std, min, the first quartile, median, the third quartile, max. From the count row, we can see that the Age feature has 177 missing values.
print train.shape
print "----------------"
print train.dtypes
shape attribute returns the size of data frame. dtypes attribute returns the data type of features.
sns.countplot(train.Survived, order=[0,1]);
plt.xlabel('Survived or not (0 for not survived, 1 for survived)');
plt.ylabel('Count of passager');
## Age figure
fig = plt.figure(figsize=[8,2])
age_data_full = train[['Age','Survived']]
age_data = age_data_full.ix[~np.isnan(age_data_full.Age),:]
plt.figure(figsize=[12,12])
sns.distplot(age_data.Age,ax=plt.subplot(311),kde=False);
plt.xlim([0,80])
plt.ylabel('Count of Passengers')
plt.title('Age distribution (177 missing values)');
sns.barplot(x='Age',y='Survived',data=age_data,ax=plt.subplot(312),ci=None);
plt.xticks([0,88],[0,80])
age_data_missing = age_data_full.ix[np.isnan(age_data_full.Age),:]
plt.subplots_adjust(hspace=0.4)
plt.subplot(313)
sns.countplot(age_data_missing['Survived'])
plt.title("The survived count in missing values of Age")
plt.show();
From the figures, we could see that age would be a good indicator for the target variable (younger and older age have obviously higher survivial rate.). The missing value did not provide any useful information.
### Sex figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Sex'],ax=plt.subplot(121));
plt.title('Sex distribution')
plt.subplots_adjust(wspace=0.5)
sns.barplot(x='Sex',y='Survived',ax=plt.subplot(122),
data=train,ci=None);
plt.title('Survived rate for different gender')
From the figure, we could see that female has a significantly higher survivial rate than male.
### Pclass figure
fig = plt.figure(figsize=[6,4])
sns.countplot(train['Pclass'],ax=plt.subplot(121))
plt.title('Pclass distribution')
plt.subplots_adjust(wspace=0.8)
sns.barplot(x='Pclass',y='Survived',data=train,ci=None,
ax=plt.subplot(122))
plt.title('Survived rate for different socio-economic class')
Hah, economic status has a significant effect on survivial rate.
### Relatives aboard
fig = plt.figure(figsize=[9,6])
plt.subplots_adjust(hspace=0.8,wspace=0.8)
sns.countplot(train['Parch'],ax=plt.subplot(231))
plt.title('ParCh distribution')
sns.countplot(train['SibSp'],ax=plt.subplot(232))
plt.title('SibSp distribution')
plt.subplot(233)
sns.regplot(x='Parch',y='SibSp',data=train, fit_reg=False)
plt.title("The scatter plot of SibSp and Parch.\n \
The correlation coefficient is {0:.2}"
.format(np.corrcoef(train['Parch'],train['SibSp'])[0,1]))
sns.barplot(x='Parch',y='Survived',data=train,
ci=None,ax=plt.subplot(234))
plt.title('Survived rate for ParCh')
sns.barplot(x='SibSp',y='Survived',data=train,
ci=None,ax=plt.subplot(235))
plt.title('Survived rate for SibSp')
sns.barplot(x=train['SibSp']+train['Parch'],y=train['Survived'],
ci=None,ax=plt.subplot(236))
plt.title('Survived rate for SibSp + Parch')
### cabin figure
# Preprocessing Cabin number to
train.ix[pd.isnull(train['Cabin']),'Cabin']=0
train.ix[train['Cabin']!=0,'Cabin'] = \
[len(ele.split()) for ele in train.Cabin if ele != 0]
plt.figure(figsize=[8,3])
plt.subplots_adjust(wspace=0.8)
sns.countplot(train.Cabin,ax=plt.subplot(121))
plt.title("Cabin number distribution")
sns.barplot(x='Cabin',y='Survived',data=train,
ci=None,ax=plt.subplot(122))
plt.title("Survived rate for different Cabin number")
The passenagers with more cabins have higher survivial rate. However, we need to explore the collinearity between Cabin number, Fare and Pclass;
### Fare figure
plt.figure(figsize=[6,8])
plt.subplots_adjust(hspace=0.4)
sns.distplot(train.Fare,ax=plt.subplot(211))
plt.title("The distribution of Fare")
sns.violinplot(x='Survived',y='Fare',data=train,ax=plt.subplot(212))
plt.title("The Fare distribution versus survived or not")
Of course, passenager with higher fare have a higher survivied rate.
### bivariate visualization
plt.figure(figsize=[9,12])
plt.subplots_adjust(hspace=0.4)
sns.violinplot(x='Pclass',y='Fare',data=train,jitter=True,
ax=plt.subplot(311))
sns.violinplot(x='Pclass',y='Cabin',data=train,
ax=plt.subplot(312),jitter=True)
sns.violinplot(x='Cabin',y='Fare',data=train, jitter=True,
ax=plt.subplot(313))
np.corrcoef(train.ix[:,['Pclass','Cabin','Fare']].transpose())
There are certain correlation among these three features, but not very strong.