本项目所使用的数据包括泰坦尼克号上 2224 名乘客和船员中 891 名的人口学数据和乘客基本信息。
希望通过简单的数据分析,解释一下问题:
- 有哪些因素会让影响乘客的生还率?
- 这些因素分别会对生还率造成什么样的影响?
- 这些因素的影响效果如何?
影响乘客生还率的各个可能因素及猜想:
- 经济地位:社会经济地位越高,越有可能生还。可以从三个变量考虑:Pclass, Fare, Cabin
- 性别:女性比男性的生还率高。变量:Sex
- 年龄:年龄越大,生还可能性越小。变量:Age
- 同伴:同行人数越多,生还可能性越大。变量:SibSp, Parch
- 港口:不同的港口生还率可能不一样,需要进一步探索。变量: Embarked
import pandas as pd
import numpy as np
import matplotlib
import seaborn as sns
%pylab inline
Populating the interactive namespace from numpy and matplotlib
titanic_df = pd.read_csv("titanic-data.csv")
titanic_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
titanic_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic_df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
上表显示了数据集中数值型变量的一般统计量数据。
'Titanic数据集一共有{}条成员记录,有{}名成员生还,总体生还率是{}%'.format(titanic_df['PassengerId'].count(),
titanic_df['Survived'].sum(),
round((100*titanic_df['Survived'].sum())/titanic_df['Survived'].count(),2))
'Titanic数据集一共有891条成员记录,有342名成员生还,总体生还率是38.38%'
在上文中展示的数据基本信息中,数据集的信息记录相对完整,可以进行初步探索,有现异常情况则可在每项分析前特别解决。在Age一栏只有714条记录,关于这部分的数据调查将在Age变量分析之前进行。
titanic_df.groupby('Survived')['Survived'].count().plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0xb01cc88>
可以看出,生存下来的乘客明显少于未生存下来的乘客。
Pclass_ct=(titanic_df.groupby('Pclass')['PassengerId'].count()) #以Pclass为分组依据,对数据集进行分组。
Pclass_ct.plot(kind='bar') #对分组结果绘图
<matplotlib.axes._subplots.AxesSubplot at 0xb1fcba8>
Pclass_ct
Pclass
1 216
2 184
3 491
Name: PassengerId, dtype: int64
船上三等类别的乘客人数最多,有491人,二等类别乘客最少,只有184人。
sns.factorplot('Sex',data=titanic_df,kind='count')
<seaborn.axisgrid.FacetGrid at 0xb2570b8>
船内男性乘客多于女性乘客。
之前提到有177条乘客年龄记录缺失,我们查看一下这部分记录。
titanic_df[titanic_df['Age'].isnull()]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | male | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S |
31 | 32 | 1 | 1 | Spencer, Mrs. William Augustus (Marie Eugenie) | female | NaN | 1 | 0 | PC 17569 | 146.5208 | B78 | C |
32 | 33 | 1 | 3 | Glynn, Miss. Mary Agatha | female | NaN | 0 | 0 | 335677 | 7.7500 | NaN | Q |
36 | 37 | 1 | 3 | Mamee, Mr. Hanna | male | NaN | 0 | 0 | 2677 | 7.2292 | NaN | C |
42 | 43 | 0 | 3 | Kraeff, Mr. Theodor | male | NaN | 0 | 0 | 349253 | 7.8958 | NaN | C |
45 | 46 | 0 | 3 | Rogers, Mr. William John | male | NaN | 0 | 0 | S.C./A.4. 23567 | 8.0500 | NaN | S |
46 | 47 | 0 | 3 | Lennon, Mr. Denis | male | NaN | 1 | 0 | 370371 | 15.5000 | NaN | Q |
47 | 48 | 1 | 3 | O'Driscoll, Miss. Bridget | female | NaN | 0 | 0 | 14311 | 7.7500 | NaN | Q |
48 | 49 | 0 | 3 | Samaan, Mr. Youssef | male | NaN | 2 | 0 | 2662 | 21.6792 | NaN | C |
55 | 56 | 1 | 1 | Woolner, Mr. Hugh | male | NaN | 0 | 0 | 19947 | 35.5000 | C52 | S |
64 | 65 | 0 | 1 | Stewart, Mr. Albert A | male | NaN | 0 | 0 | PC 17605 | 27.7208 | NaN | C |
65 | 66 | 1 | 3 | Moubarek, Master. Gerios | male | NaN | 1 | 1 | 2661 | 15.2458 | NaN | C |
76 | 77 | 0 | 3 | Staneff, Mr. Ivan | male | NaN | 0 | 0 | 349208 | 7.8958 | NaN | S |
77 | 78 | 0 | 3 | Moutal, Mr. Rahamin Haim | male | NaN | 0 | 0 | 374746 | 8.0500 | NaN | S |
82 | 83 | 1 | 3 | McDermott, Miss. Brigdet Delia | female | NaN | 0 | 0 | 330932 | 7.7875 | NaN | Q |
87 | 88 | 0 | 3 | Slocovski, Mr. Selman Francis | male | NaN | 0 | 0 | SOTON/OQ 392086 | 8.0500 | NaN | S |
95 | 96 | 0 | 3 | Shorney, Mr. Charles Joseph | male | NaN | 0 | 0 | 374910 | 8.0500 | NaN | S |
101 | 102 | 0 | 3 | Petroff, Mr. Pastcho ("Pentcho") | male | NaN | 0 | 0 | 349215 | 7.8958 | NaN | S |
107 | 108 | 1 | 3 | Moss, Mr. Albert Johan | male | NaN | 0 | 0 | 312991 | 7.7750 | NaN | S |
109 | 110 | 1 | 3 | Moran, Miss. Bertha | female | NaN | 1 | 0 | 371110 | 24.1500 | NaN | Q |
121 | 122 | 0 | 3 | Moore, Mr. Leonard Charles | male | NaN | 0 | 0 | A4. 54510 | 8.0500 | NaN | S |
126 | 127 | 0 | 3 | McMahon, Mr. Martin | male | NaN | 0 | 0 | 370372 | 7.7500 | NaN | Q |
128 | 129 | 1 | 3 | Peter, Miss. Anna | female | NaN | 1 | 1 | 2668 | 22.3583 | F E69 | C |
140 | 141 | 0 | 3 | Boulos, Mrs. Joseph (Sultana) | female | NaN | 0 | 2 | 2678 | 15.2458 | NaN | C |
154 | 155 | 0 | 3 | Olsen, Mr. Ole Martin | male | NaN | 0 | 0 | Fa 265302 | 7.3125 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
718 | 719 | 0 | 3 | McEvoy, Mr. Michael | male | NaN | 0 | 0 | 36568 | 15.5000 | NaN | Q |
727 | 728 | 1 | 3 | Mannion, Miss. Margareth | female | NaN | 0 | 0 | 36866 | 7.7375 | NaN | Q |
732 | 733 | 0 | 2 | Knight, Mr. Robert J | male | NaN | 0 | 0 | 239855 | 0.0000 | NaN | S |
738 | 739 | 0 | 3 | Ivanoff, Mr. Kanio | male | NaN | 0 | 0 | 349201 | 7.8958 | NaN | S |
739 | 740 | 0 | 3 | Nankoff, Mr. Minko | male | NaN | 0 | 0 | 349218 | 7.8958 | NaN | S |
740 | 741 | 1 | 1 | Hawksford, Mr. Walter James | male | NaN | 0 | 0 | 16988 | 30.0000 | D45 | S |
760 | 761 | 0 | 3 | Garfirth, Mr. John | male | NaN | 0 | 0 | 358585 | 14.5000 | NaN | S |
766 | 767 | 0 | 1 | Brewe, Dr. Arthur Jackson | male | NaN | 0 | 0 | 112379 | 39.6000 | NaN | C |
768 | 769 | 0 | 3 | Moran, Mr. Daniel J | male | NaN | 1 | 0 | 371110 | 24.1500 | NaN | Q |
773 | 774 | 0 | 3 | Elias, Mr. Dibo | male | NaN | 0 | 0 | 2674 | 7.2250 | NaN | C |
776 | 777 | 0 | 3 | Tobin, Mr. Roger | male | NaN | 0 | 0 | 383121 | 7.7500 | F38 | Q |
778 | 779 | 0 | 3 | Kilgannon, Mr. Thomas J | male | NaN | 0 | 0 | 36865 | 7.7375 | NaN | Q |
783 | 784 | 0 | 3 | Johnston, Mr. Andrew G | male | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
790 | 791 | 0 | 3 | Keane, Mr. Andrew "Andy" | male | NaN | 0 | 0 | 12460 | 7.7500 | NaN | Q |
792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
793 | 794 | 0 | 1 | Hoyt, Mr. William Fisher | male | NaN | 0 | 0 | PC 17600 | 30.6958 | NaN | C |
815 | 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0000 | B102 | S |
825 | 826 | 0 | 3 | Flynn, Mr. John | male | NaN | 0 | 0 | 368323 | 6.9500 | NaN | Q |
826 | 827 | 0 | 3 | Lam, Mr. Len | male | NaN | 0 | 0 | 1601 | 56.4958 | NaN | S |
828 | 829 | 1 | 3 | McCormack, Mr. Thomas Joseph | male | NaN | 0 | 0 | 367228 | 7.7500 | NaN | Q |
832 | 833 | 0 | 3 | Saad, Mr. Amin | male | NaN | 0 | 0 | 2671 | 7.2292 | NaN | C |
837 | 838 | 0 | 3 | Sirota, Mr. Maurice | male | NaN | 0 | 0 | 392092 | 8.0500 | NaN | S |
839 | 840 | 1 | 1 | Marechal, Mr. Pierre | male | NaN | 0 | 0 | 11774 | 29.7000 | C47 | C |
846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
849 | 850 | 1 | 1 | Goldenberg, Mrs. Samuel L (Edwiga Grabowska) | female | NaN | 1 | 0 | 17453 | 89.1042 | C92 | C |
859 | 860 | 0 | 3 | Razi, Mr. Raihed | male | NaN | 0 | 0 | 2629 | 7.2292 | NaN | C |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
177 rows × 12 columns
关于这部分缺失的数据,最直接的处理方法是在分析过程中直接去除。而在绘图过程中,去除NAN是自动进行的。值得注意的是,此处需要进一步探索一部分缺失数据的背景,以排除大批具有相似特征的年龄数据缺失的可能性。
titanic_df['Age'].hist(bins=80)
<matplotlib.axes._subplots.AxesSubplot at 0xb33da90>
船内乘客年龄在大于10的部分接近正态分布。我们可以计算一些年龄的集中趋势。
print("均值",titanic_df['Age'].mean())
print("中位数",titanic_df['Age'].median())
print("最大值",titanic_df['Age'].max())
print("最小值",titanic_df['Age'].min())
print("标准差",titanic_df['Age'].std())
均值 29.69911764705882
中位数 28.0
最大值 80.0
最小值 0.42
标准差 14.526497332334044
#将船票价格分布
titanic_df['Fare'].hist(bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0xce68438>
票价-数量分布图,低价位的船票数量高度集中,如若要进行分组,按等距分组可能不是一个便于观察的好选择。
猜想反应经济地位的因素包括Pclass, Fare, Cabin, Embarked,首先探索这些变量之间的内部关系。
titanic_df.groupby(titanic_df['Embarked'])['PassengerId'].count()
Embarked
C 168
Q 77
S 644
Name: PassengerId, dtype: int64
P1 = sns.factorplot(x='Embarked',y='Fare',data=titanic_df, order=["C","S","Q"])
P2 = sns.boxplot(x='Embarked', y='Fare', hue='Pclass', data=titanic_df,order=["C","S","Q"])
P2.set(ylim=(0,200))
[(0, 200)]
总体上,C舱的票价水平高于S舱,高于Q舱;一等类别乘客高于二等、三等。
#定义生还率计算函数
def survival_rate(data):
return data.sum()/data.count()
#按Pclass进行分组,提取Survived列,再计算生还比率
Pclass_group=titanic_df.groupby('Pclass')['Survived']
Pclass_group_rate=Pclass_group.apply(survival_rate)
Pclass_group_rate.plot(kind='bar')
plt.xlabel('Passenger Class')
plt.ylabel('Survival Rate')
<matplotlib.text.Text at 0x20545c18>
Pclass_gp.head()
Pclass
1 0.629630
2 0.472826
3 0.242363
Name: Survived, dtype: float64
从上图可观察得,社会经济地位是影响生还率的重要因素。经济地位越高,生还的可能性越高。一等乘客拥有63%的生还率,而三等乘客只有24%左右。
#将船票价格按照quantile分组
Fare_group = titanic_df.groupby(pd.qcut(titanic_df['Fare'],5, precision=0))['Survived']
Fare_group_rate=Fare_group.apply(survival_rate)
Fare_group_rate.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x21653780>
从船票价格分布图来看,我们可以尝试按照价格比例分为5组,再计算每一组内的生还率,可以观察得,价格越高,生还率越高。
# 观察舱位对生还率的影响时,首先要对数据进行适当的转化和整理,提取代表舱位类别的首字母
def cabin_class(data):
return str(data)[0]
titanic_df['Cabin_class']=titanic_df['Cabin'].apply(cabin_class)
cabin_group=titanic_df.groupby(titanic_df['Cabin_class'])['Survived']
cabin_group_rate=cabin_group.apply(survival_rate)
print(cabin_group.count())
print("")
print(cabin_group_rate)
Cabin_class
A 15
B 47
C 59
D 33
E 32
F 13
G 4
T 1
n 687
Name: Survived, dtype: int64
Cabin_class
A 0.466667
B 0.744681
C 0.593220
D 0.757576
E 0.750000
F 0.615385
G 0.500000
T 0.000000
n 0.299854
Name: Survived, dtype: float64
T舱位似乎是一个异常值。
# 寻找异常值的列
a=0
for i in titanic_df['Cabin']:
if str(i)[0]=='T':
print(titanic_df.loc[a])
a=a+1
# 排除异常值后的柱状图
cabin_group_rate[["A","B","C","D","E","F","G","n"]].plot(kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x2054b358>
有舱位的乘客生还率高于无舱位的乘客,而在各类有舱位的乘客中,B,D,E类舱位生还率最高。
Sex_group=titanic_df.groupby('Sex')['Survived']
Sex_group_rate=Sex_group.apply(survival_rate)
Sex_group_rate.plot(kind='bar')
plt.xlabel('Sex')
plt.ylabel('Survival Rate')
<matplotlib.text.Text at 0x21752a90>
Sex_group_rate
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
Sex_group_rate.plot(kind='bar')
性别是影响某位乘客能否生还的重要因素。女性拥有超过74%的生还率,而男性却只有18.9%。
#将NAN部分数据清除
Aged = titanic_df[['Survived', 'Age']].dropna()
#对年龄进行分组
Grouped_Age=Aged.groupby(pd.cut(Aged['Age'],8,labels=["0~10","10~20","20~30","30~40","40~50","50~60","60~70","70~80"]))
#计算分组后的生还率
Age_group=Grouped_Age['Survived']
Age_group_rate = Age_group.apply(survival_rate)
Age_group_rate
Age
0~10 0.593750
10~20 0.382609
20~30 0.365217
30~40 0.445161
40~50 0.383721
50~60 0.404762
60~70 0.235294
70~80 0.200000
Name: Survived, dtype: float64
Age_group_rate.plot()
plt.xlabel("Age")
plt.ylabel("Survival Rate")
plt.title("Influence of Age on the Survival Prob")
<matplotlib.text.Text at 0x24c624e0>
整体上来看,生还率随着年龄的增加而下降。
sns.factorplot('SibSp','Survived',data=titanic_df)
sns.factorplot('Parch','Survived',data=titanic_df)
<seaborn.axisgrid.FacetGrid at 0x24848f28>
SibSp和Parch的影响大致相同,将两者统一为Compa变量,意味同伴人数
titanic_df['Compa']=titanic_df['SibSp'] + titanic_df['Parch']
sns.factorplot('Compa','Survived',data=titanic_df)
<seaborn.axisgrid.FacetGrid at 0x245f8320>
观察上图可知,当同伴人数为1~3人时,生存率会高于独身情况,而当同伴人数多于3人时,生还率反而很低。
Sex_Pclass_group = titanic_df.groupby(['Sex','Pclass'])['Survived']
Sex_Pclass_group_rate = Sex_Pclass_group.apply(survival_rate)
Sex_Pclass_group_rate
Sex Pclass
female 1 0.968085
2 0.921053
3 0.500000
male 1 0.368852
2 0.157407
3 0.135447
Name: Survived, dtype: float64
Sex_Pclass_group_rate.plot(kind='bar')
plt.xlabel("Sex and Passenger Class")
plt.ylabel("Survival Rate")
plt.title("Influence of Sex and Class on the Survival Prob")
plt.show()
#与上图表现内容相似,一种更简单直观的绘图方式。
sns.factorplot('Pclass','Survived',hue='Sex',data=titanic_df, order=[3,2,1])
<seaborn.axisgrid.FacetGrid at 0x24bce630>
从上图我们可以看出,性别影响大于社会经济地位。即使是第三阶级的女性生还率也还是高于第一阶级男性的生还率。
实际数据分析结果与猜想基本吻合。
- 经济地位:经济地位越高,越有可能生还。经济地位可以通过四个因素反应,座位等级、船票价格、是否有舱位以及登船港口。 a. 座位等级越高,生还率越高; b. 船票价格越高,生还率越高; c. 有舱位比无舱位的生还率高; d. C港登船乘客的生还率高于S港,Q港;
- 性别:女性比男性的生还率高。
- 年龄:年龄越大,生还可能性越小。
- 同伴:同行人数在1~3人时,生还可能性大。
根据以上结论,可以模糊得出一个有高生还率的乘客:一位0~10岁的女性,拥有一等座,随同3人,船票高于40,D舱,在Cherbourg登船。
- 结论中,“0
10岁女性”的群体分类较为不合理,常理来说小孩被救援的可能性较少会受到性别影响。因此在进行人群个人特征分类时,可以考虑将年龄与性别相结合,分为小孩(016岁)、女性、男性、老人(60岁以上),这样也许更合乎常理。 - 177条年龄缺失记录有可能具有某种年龄特征,这种缺失有可能对分析结果造成影响,分析过程尚未排除这种影响存在的可能性。
- 在分析同伴人数时,在同班人数较高的区间,由于缺乏足够多的数据,结论存在较大的偶然性。
- 尚未给出各项因素对生还率的影响权重,分析结论依然较为笼统,可以尝试建立模型来精确预测生还率。
- 在图形可视化上表现单一,需要尝试更多的图形来表现数据特征。