Question 8: Build a machine learning model that predicts the type of people who survived the Titanic shipwreck using passenger data (name, age, gender, etc). (Mini Project) Download hole Program / Project code, by clicking following link: How do you preprocess the Titanic dataset to predict passenger survival ?
Preprocessing the Titanic dataset involves cleaning and preparing the data before training a model:
- Handle Missing Values:
- Fill missing Age values using median or predictive models.
- Fill Embarked with the most common port.
- Drop or impute missing Cabin values.
- Convert Categorical Features:
- Use one-hot encoding for Sex, Embarked, and Pclass.
- Extract titles from the Name column (e.g., Mr, Miss, etc.).
- Feature Engineering:
- Create features like FamilySize = SibSp + Parch + 1.
- Bin Age and Fare into ranges to reduce skewness.
- Split the dataset into training and testing sets.
This cleaned data can then be used for model training.
- Handle Missing Values:
- Fill missing Age values using median or predictive models.
- Fill Embarked with the most common port.
- Drop or impute missing Cabin values.
- Convert Categorical Features:
- Use one-hot encoding for Sex, Embarked, and Pclass.
- Extract titles from the Name column (e.g., Mr, Miss, etc.).
- Feature Engineering:
- Create features like FamilySize = SibSp + Parch + 1.
- Bin Age and Fare into ranges to reduce skewness.
- Split the dataset into training and testing sets.
Which classification models are suitable for predicting survival, and how are they evaluated ?
The Titanic dataset is well-suited for binary classification (Survived: Yes or No). Suitable models include:
- Logistic Regression
- Random Forest Classifier
- Support Vector Machines (SVM)
- XGBoost
- K-Nearest Neighbors
Model Evaluation Metrics:
- Accuracy overall correctness.
- Precision true positives over predicted positives.
- Recall true positives over actual positives.
- F1 Score harmonic mean of precision and recall.
- Confusion Matrix detailed breakdown of classification results.
Using these metrics ensures the model is reliable, especially in unbalanced datasets.
- Logistic Regression
- Random Forest Classifier
- Support Vector Machines (SVM)
- XGBoost
- K-Nearest Neighbors
- Accuracy overall correctness.
- Precision true positives over predicted positives.
- Recall true positives over actual positives.
- F1 Score harmonic mean of precision and recall.
- Confusion Matrix detailed breakdown of classification results.
Model Evaluation Metrics:
Programming Code: Following code write in: ML_P08.py # ML Project Program 08
# machine learning model that predicts who survived the Titanic shipwreck using passenger data
# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# import dataset
train_data = pd.read_csv("./Titanic_shipwreck_dataset/train.csv")
test_data = pd.read_csv("./Titanic_shipwreck_dataset/test.csv")
train_data.info()
train_data.shape
test_data.info()
test_data.shape
# chek null values
train_data.isnull().sum()
# Age having 177 null values
# Cabin having 687 null values
# Embarked having 2 null values
train_data['Age'].fillna(train_data['Age'].mean(), inplace = True)
train_data = train_data.dropna()
train_data.isnull().sum()
train_data.isnull().any()
train_data.loc[train_data['Sex'] != 'male', 'Sex'] = 0
train_data.loc[train_data['Sex'] == 'male', 'Sex'] = 1
train_data.describe()
# Graph Plotting
fig = plt.figure()
ax0 = fig.add_subplot(1, 3, 1)
ax1 = fig.add_subplot(1, 3, 2)
ax2 = fig.add_subplot(1, 3, 3)
# Survive rate based on Fare
z = train_data.loc[:, ['Fare', 'Survived']]
z = z.sort_values(['Fare'], ascending = True, axis=0)
z['group'] = pd.cut(z['Fare'], 9, labels = ['1', '2', '3', '4', '5', '6', '7', '8', '9'])
t_f1 = train_data.Survived[z.group == '1'].value_counts().sort_index()
t_f2 = train_data.Survived[z.group == '2'].value_counts().sort_index()
t_f3 = train_data.Survived[z.group == '3'].value_counts().sort_index()
t_f4 = train_data.Survived[z.group == '4'].value_counts().sort_index()
t_f5 = train_data.Survived[z.group == '5'].value_counts().sort_index()
t_f6 = train_data.Survived[z.group == '6'].value_counts().sort_index()
t_f7 = train_data.Survived[z.group == '7'].value_counts().sort_index()
t_f8 = train_data.Survived[z.group == '8'].value_counts().sort_index()
t_f9 = train_data.Survived[z.group == '9'].value_counts().sort_index()
z1 = pd.concat([t_f1, t_f2, t_f3, t_f4, t_f5, t_f9], axis = 1)
z1 = z1.fillna(0)
z1.index = ['not survived', 'survived']
z1.columns = ['low', 'median-low', 'median', 'median-high', 'high', 'extreme-high']
z1 = z1.T
z1['survive rate'] = z1['survived'] / (z1['survived'] + z1['not survived']) * 100
z2 = z1.drop(['not survived', 'survived'], axis = 1)
z2.plot(kind = 'line', ax=ax0, figsize = (15, 5))
ax0.set_title("Survived Rate based on Fare")
ax0.set_ylabel('Rate %')
# Survive Rate based on Siblings
t_s0 = train_data.Survived[train_data.SibSp == 0].value_counts().sort_index()
t_s1 = train_data.Survived[train_data.SibSp == 1].value_counts().sort_index()
t_s2 = train_data.Survived[train_data.SibSp == 2].value_counts().sort_index()
t_s3 = train_data.Survived[train_data.SibSp == 3].value_counts().sort_index()
t_s4 = train_data.Survived[train_data.SibSp == 4].value_counts().sort_index()
t_s5 = train_data.Survived[train_data.SibSp == 5].value_counts().sort_index()
t_s8 = train_data.Survived[train_data.SibSp == 8].value_counts().sort_index()
d = pd.concat([t_s0, t_s1, t_s2, t_s3, t_s4, t_s5, t_s8], axis = 1)
d.index = ['not survived', 'survived']
d.columns = ['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S8']
d = d.fillna(0)
d = d.T
d['survived rate'] = d['survived'] / (d['survived'] + d['not survived']) * 100
d1 = d.drop(['not survived', 'survived'], axis = 1)
d1.plot(kind = 'line', ax = ax1 , figsize = (15, 5))
ax1.set_title('Survived Rate based on Siblings')
ax1.set_ylabel('Rate %')
# Survive Rate based on Parents
t_p0 = train_data.Survived[train_data.Parch == 0].value_counts().sort_index()
t_p1 = train_data.Survived[train_data.Parch == 1].value_counts().sort_index()
t_p2 = train_data.Survived[train_data.Parch == 2].value_counts().sort_index()
t_p3 = train_data.Survived[train_data.Parch == 3].value_counts().sort_index()
t_p4 = train_data.Survived[train_data.Parch == 4].value_counts().sort_index()
t_p5 = train_data.Survived[train_data.Parch == 5].value_counts().sort_index()
t_p6 = train_data.Survived[train_data.Parch == 6].value_counts().sort_index()
f = pd.concat([t_p0, t_p1, t_p2, t_p3, t_p4, t_p5, t_p6], axis = 1)
f = f.fillna(0)
f.index = ['not survived', 'survived']
f.columns = ['P0', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6']
f = f.T
f['survived rate'] = f['survived'] / (f['survived'] + f['not survived']) * 100
f1 = f.drop(['not survived', 'survived'], axis = 1)
f1.plot(kind = 'line', ax = ax2, figsize = (15, 5))
ax2.set_title('Survived based on Parents')
ax2.set_ylabel('Rate %')
fig = plt.figure()
ax0 = fig.add_subplot(1, 4, 1)
ax1 = fig.add_subplot(1, 4, 2)
ax2 = fig.add_subplot(1, 4, 3)
ax3 = fig.add_subplot(1, 4, 4)
# Plot Survived information
a = pd.DataFrame(train_data.Survived.value_counts())
a.index = ['not survived', 'survived']
a.columns = ['number']
a.plot(kind = 'bar', ax = ax0)
ax0.set_title('Number of Survival')
# Plot Survived info of class Factor
t_p1 = train_data.Survived[train_data.Pclass == 1].value_counts().sort_index()
t_p2 = train_data.Survived[train_data.Pclass == 2].value_counts().sort_index()
t_p3 = train_data.Survived[train_data.Pclass == 3].value_counts().sort_index()
b = pd.concat([t_p1, t_p2, t_p3], axis = 1)
b.index = ['not survived', 'survived']
b.columns = ['class1', 'class2', 'class3']
b.plot(kind = 'bar', stacked = True, ax = ax1, figsize = (15, 5))
ax1.set_title('Number of Survival for class')
plt.xticks(rotation = 360)
# Plot Survived info of Sex Factor
tm = train_data.Survived[train_data.Sex == 1].value_counts().sort_index()
tf = train_data.Survived[train_data.Sex == 0].value_counts().sort_index()
e = pd.concat([tm, tf], axis = 1)
e.index = ['not survived', 'survived']
e.columns = ['male', 'female']
e.plot(kind = 'bar', stacked=True, ax=ax2)
ax2.set_title("Survived by Gender")
# Plot Survived info of Embarked Factors
t_es = train_data.Survived[train_data.Embarked == "S"].value_counts().sort_index()
t_ec = train_data.Survived[train_data.Embarked == "C"].value_counts().sort_index()
t_eq = train_data.Survived[train_data.Embarked == "Q"].value_counts().sort_index()
c = pd.concat([t_es, t_ec, t_eq], axis = 1)
c.index = ['not survived', 'survived']
c.columns = ['Embarked S', 'Embarked C', 'Embarked Q']
c.plot(kind = 'bar', stacked=True, ax=ax3, figsize = (15, 5))
ax3.set_title("Survived by Embarked")
train_data.info()
train_data = train_data.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)
train_data.info()
# compose corelation map
float_columns = [x for x in train_data.columns if x not in ['Embarked', 'Sex']]
corr_max = train_data[float_columns].corr()
for x in range(len(float_columns)):
corr_max.iloc[x, x] = 0.0
corr_max
corr_max.abs().idxmax()
import seaborn as sns
# construct the pair for scaled variable
sns.set_context('notebook')
sns.pairplot(train_data[float_columns], hue = 'Survived', hue_order = [0, 1])
# skew data
skew_columns = (train_data[float_columns].skew().sort_values(ascending=False))
skew_columns = skew_columns.loc[skew_columns > 0.75]
skew_columns
# plot Skew data
fig = plt.figure()
ax0 = fig.add_subplot(1, 3, 1)
ax1 = fig.add_subplot(1, 3, 2)
ax2 = fig.add_subplot(1, 3, 3)
train_data.Fare.plot(kind='bar', ax = ax0, figsize = (15, 5))
ax0.set_title('Fare Distribution')
train_data.SibSp.plot(kind = 'bar', ax = ax1)
ax1.set_title("Sibling Number Distribution")
train_data.Parch.plot(kind='bar', ax = ax2)
ax2.set_title("Parent Number Distribution")
# Taje log of skewd data
for col in skew_columns.index.tolist():
train_data[col] = np.log1p(train_data[col])
train_data.head()
# Pre-Processing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.model_selection import learning_curve
mm = MinMaxScaler()
# Drop non discription data from testing data
test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1)
# clean testing data set
edt = pd.get_dummies(test_data.Embarked)
test_data.loc[test_data['Sex'] != 'male', 'Sex'] = 0
test_data.loc[test_data['Sex'] == 'male', 'Sex'] = 1
test_data['Age'].fillna(test_data['Age'].mean(), inplace = True)
test1 = pd.concat([test_data, edt], axis = 1)
test2 = test1. drop(['Embarked'], axis=1)
test3 = test2.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1)
mean_value = test_data['Fare'].mean()
test3.fillna(value = mean_value, inplace = True)
test4 = mm.fit_transform(test3)
print(f"The Range of feature inputs are within {test4.min()} to {test4.max()}")
train_data.Embarked.unique()
y_train = train_data.iloc[:, 0:1]
y_train.head()
x_train = train_data.iloc[:, 1:8]
x_train.head
# get dummies for embark column
ed = pd.get_dummies(x_train.Embarked)
ed.head()
# contact dummies columns of embark and training data
x_train2 = pd.concat([x_train, ed], axis = 1)
x_train2 = x_train2.drop(['Embarked'], axis=1)
x_train2
x = mm.fit_transform(x_train2)
# make sure all dataset is range from 0 to 1
print(f" The Range feature inputs are within {x.min()} to {x.max()}")
# Machine Learning Model - Logistic Regression
# import Lib.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_fscore_support
LG = LogisticRegression()
# set cv grid search to find best Hyper parameter for Logistic Regression model
grid = {"C": np.logspace(-3, 3, 7), "penalty": ["l1","l2"]}
log_model_cv = GridSearchCV(LG, grid, cv=10)
log_model_cv.fit(x, y_train)
log_model_cv.best_params_
# Thw top performance Hyper parameter was find to be c = 1 and penalty to be 12
lgtuned = LogisticRegression(C = 1, penalty="l2")
lgtuned.fit(x, y_train)
y_pred = lgtuned.predict(x)
print(classification_report(y_train, y_pred))
print("Accuracy score: ", round(accuracy_score(y_train, y_pred), 2))
print("F1 score: ", round(f1_score(y_train, y_pred), 2))
y_pred = lgtuned.predict(test4)
y_test = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv")
y_test = y_test.iloc[:, -1]
# Print test score, Accuracy for Logistic Regression
print(classification_report(y_test, y_pred))
print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2))
print("F1 Score: ", round(f1_score(y_test, y_pred), 2))
# Coefficient for each factor for survival rate
coef_dict = lgtuned.coef_
coef = pd.DataFrame(coef_dict).T
coef.columns = ['Coefficient']
coef.index = ['scio-class', '# Sibling', '# Parent', 'Fare Price', 'Female', 'Male', 'Embark C',
'Embark Q', 'Embark S']
coef = coef.sort_values(['Coefficient'], ascending = True)
coef
coef.plot(kind = 'barh')
plt.title("Coefficient of Survival")
confusion_matrix(y_test, y_pred)
import seaborn as sns
sns.set_palette(sns.color_palette())
_, ax = plt.subplots(figsize = (12, 12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, fmt='d', annot_kws = {"size": 40,
"weight": "bold"})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize=25)
ax.set_yticklabels(labels[::-1], fontsize=25)
ax.set_ylabel('Prediction', fontsize=30)
ax.set_xlabel('Ground Truth', fontsize=30)
# KNN Model
from sklearn.neighbors import KNeighborsClassifier
# Search for best K number for modeling
max_k = 100
error_rates = list()
f1_scores = list()
for k in range(1, max_k):
knn = KNeighborsClassifier(n_neighbors=k, weights="distance")
knn = knn.fit(x, y_train)
y_pred = knn.predict(test4)
f1 = f1_score(y_pred, y_test)
f1_scores.append((k, round(f1_score(y_test, y_pred), 4)))
error = 1-round(accuracy_score(y_test, y_pred), 4)
error_rates.append((k,error))
f1_results = pd.DataFrame(f1_scores, columns=['K', 'F1 Score'])
error_results = pd.DataFrame(error_rates, columns=['K', 'Error Rate'])
f1_results
error_results
fig = plt.figure()
ax0 = fig.add_subplot(2, 1, 1)
ax1 = fig.add_subplot(2, 1, 2)
f1_results.plot(kind='line', x='K', y='F1 Score', ax=ax0, figsize=(20, 10))
ax0.set_title('F1 Score', fontsize=15)
xinterval = np.arange(0, 101, 2)
ax0.set_xticks(xinterval)
error_results.plot(kind='line', x='K', y="Error Rate", ax = ax1, figsize=(20, 10))
ax1.set_title('Error Rate', fontsize=15)
# KNN performance score
knn = KNeighborsClassifier(n_neighbors=12, weights='distance')
knn = knn.fit(x, y_train)
y_pred = knn.predict(test4)
print(classification_report(y_test, y_pred))
print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2))
print("F1 Score: ", round(f1_score(y_test, y_pred), 2))
# Plot confusion matrix
sns.set_palette(sns.color_palette())
_, ax= plt.subplots(figsize=(12, 12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d',
annot_kws={'size': 40, 'weight':'bold'})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize = 25)
ax.set_yticklabels(labels[::-1], fontsize = 25)
ax.set_xlabel("Ground Truth", fontsize=30)
ax.set_ylabel("Prediction", fontsize=30)
# Random Forest Model
# import lib.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_text, export_graphviz, plot_tree
RM = RandomForestClassifier()
# set up sv grid search best hyperparameter
param_grid = {'n_estimators': [2*n+1 for n in range(20)],
'max_depth': [2*n+1 for n in range(10)],
'max_features': ['suto', 'sqrt', 'log2']}
# Best score and parameter
print(search.best_score_)
print(search.best_params_)
# best hyper parameter was found to be max depth of 7 max feature of sqrt
random_forest = RandomForestClassifier(n_estimators=15, max_depth=10, max_features='sqrt')
random_forest = random_forest.fit(x, y_train)
y_pred = random_forest.predict(test4)
print(classification_report(y_test, y_pred))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred)))
print("F1 score: ", round(f1_score(y_test, y_pred), 2))
# plot heat map
sns.set_palette(sns.color_palette())
_, ax= plt.subplots(figsize=(12,12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', annot_kws={
'size': 40, 'weight': 'bold'
})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize=25)
ax.set_yticklabels(labels[::-1], fontsize=25)
ax.set_xlabel("Ground Truth", fontsize=30)
ax.set_ylabel("Prediction", fontsize=30)
!pip install pydotplus
from io import StringIO
!pip install yellowbrick
from yellowbrick.model_selection import FeatureImportances
visual = FeatureImportances(random_forest)
visual.fit(x_train2, y_train)
visual.show()
y_test_pred = lgtuned.predict(test4)
submission = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv" ,
index_col ='PassengerId')
submission['Survived'] = y_test_pred
submission
submission.to_csv("submission.csv")
# in submission.csv file stored the survived passenger details with its ID
# Thanks For Reading.
Output:
# ML Project Program 08 # machine learning model that predicts who survived the Titanic shipwreck using passenger data # import packages import numpy as np import pandas as pd import matplotlib.pyplot as plt # import dataset train_data = pd.read_csv("./Titanic_shipwreck_dataset/train.csv") test_data = pd.read_csv("./Titanic_shipwreck_dataset/test.csv") train_data.info() train_data.shape test_data.info() test_data.shape # chek null values train_data.isnull().sum() # Age having 177 null values # Cabin having 687 null values # Embarked having 2 null values train_data['Age'].fillna(train_data['Age'].mean(), inplace = True) train_data = train_data.dropna() train_data.isnull().sum() train_data.isnull().any() train_data.loc[train_data['Sex'] != 'male', 'Sex'] = 0 train_data.loc[train_data['Sex'] == 'male', 'Sex'] = 1 train_data.describe() # Graph Plotting fig = plt.figure() ax0 = fig.add_subplot(1, 3, 1) ax1 = fig.add_subplot(1, 3, 2) ax2 = fig.add_subplot(1, 3, 3) # Survive rate based on Fare z = train_data.loc[:, ['Fare', 'Survived']] z = z.sort_values(['Fare'], ascending = True, axis=0) z['group'] = pd.cut(z['Fare'], 9, labels = ['1', '2', '3', '4', '5', '6', '7', '8', '9']) t_f1 = train_data.Survived[z.group == '1'].value_counts().sort_index() t_f2 = train_data.Survived[z.group == '2'].value_counts().sort_index() t_f3 = train_data.Survived[z.group == '3'].value_counts().sort_index() t_f4 = train_data.Survived[z.group == '4'].value_counts().sort_index() t_f5 = train_data.Survived[z.group == '5'].value_counts().sort_index() t_f6 = train_data.Survived[z.group == '6'].value_counts().sort_index() t_f7 = train_data.Survived[z.group == '7'].value_counts().sort_index() t_f8 = train_data.Survived[z.group == '8'].value_counts().sort_index() t_f9 = train_data.Survived[z.group == '9'].value_counts().sort_index() z1 = pd.concat([t_f1, t_f2, t_f3, t_f4, t_f5, t_f9], axis = 1) z1 = z1.fillna(0) z1.index = ['not survived', 'survived'] z1.columns = ['low', 'median-low', 'median', 'median-high', 'high', 'extreme-high'] z1 = z1.T z1['survive rate'] = z1['survived'] / (z1['survived'] + z1['not survived']) * 100 z2 = z1.drop(['not survived', 'survived'], axis = 1) z2.plot(kind = 'line', ax=ax0, figsize = (15, 5)) ax0.set_title("Survived Rate based on Fare") ax0.set_ylabel('Rate %') # Survive Rate based on Siblings t_s0 = train_data.Survived[train_data.SibSp == 0].value_counts().sort_index() t_s1 = train_data.Survived[train_data.SibSp == 1].value_counts().sort_index() t_s2 = train_data.Survived[train_data.SibSp == 2].value_counts().sort_index() t_s3 = train_data.Survived[train_data.SibSp == 3].value_counts().sort_index() t_s4 = train_data.Survived[train_data.SibSp == 4].value_counts().sort_index() t_s5 = train_data.Survived[train_data.SibSp == 5].value_counts().sort_index() t_s8 = train_data.Survived[train_data.SibSp == 8].value_counts().sort_index() d = pd.concat([t_s0, t_s1, t_s2, t_s3, t_s4, t_s5, t_s8], axis = 1) d.index = ['not survived', 'survived'] d.columns = ['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S8'] d = d.fillna(0) d = d.T d['survived rate'] = d['survived'] / (d['survived'] + d['not survived']) * 100 d1 = d.drop(['not survived', 'survived'], axis = 1) d1.plot(kind = 'line', ax = ax1 , figsize = (15, 5)) ax1.set_title('Survived Rate based on Siblings') ax1.set_ylabel('Rate %') # Survive Rate based on Parents t_p0 = train_data.Survived[train_data.Parch == 0].value_counts().sort_index() t_p1 = train_data.Survived[train_data.Parch == 1].value_counts().sort_index() t_p2 = train_data.Survived[train_data.Parch == 2].value_counts().sort_index() t_p3 = train_data.Survived[train_data.Parch == 3].value_counts().sort_index() t_p4 = train_data.Survived[train_data.Parch == 4].value_counts().sort_index() t_p5 = train_data.Survived[train_data.Parch == 5].value_counts().sort_index() t_p6 = train_data.Survived[train_data.Parch == 6].value_counts().sort_index() f = pd.concat([t_p0, t_p1, t_p2, t_p3, t_p4, t_p5, t_p6], axis = 1) f = f.fillna(0) f.index = ['not survived', 'survived'] f.columns = ['P0', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6'] f = f.T f['survived rate'] = f['survived'] / (f['survived'] + f['not survived']) * 100 f1 = f.drop(['not survived', 'survived'], axis = 1) f1.plot(kind = 'line', ax = ax2, figsize = (15, 5)) ax2.set_title('Survived based on Parents') ax2.set_ylabel('Rate %') fig = plt.figure() ax0 = fig.add_subplot(1, 4, 1) ax1 = fig.add_subplot(1, 4, 2) ax2 = fig.add_subplot(1, 4, 3) ax3 = fig.add_subplot(1, 4, 4) # Plot Survived information a = pd.DataFrame(train_data.Survived.value_counts()) a.index = ['not survived', 'survived'] a.columns = ['number'] a.plot(kind = 'bar', ax = ax0) ax0.set_title('Number of Survival') # Plot Survived info of class Factor t_p1 = train_data.Survived[train_data.Pclass == 1].value_counts().sort_index() t_p2 = train_data.Survived[train_data.Pclass == 2].value_counts().sort_index() t_p3 = train_data.Survived[train_data.Pclass == 3].value_counts().sort_index() b = pd.concat([t_p1, t_p2, t_p3], axis = 1) b.index = ['not survived', 'survived'] b.columns = ['class1', 'class2', 'class3'] b.plot(kind = 'bar', stacked = True, ax = ax1, figsize = (15, 5)) ax1.set_title('Number of Survival for class') plt.xticks(rotation = 360) # Plot Survived info of Sex Factor tm = train_data.Survived[train_data.Sex == 1].value_counts().sort_index() tf = train_data.Survived[train_data.Sex == 0].value_counts().sort_index() e = pd.concat([tm, tf], axis = 1) e.index = ['not survived', 'survived'] e.columns = ['male', 'female'] e.plot(kind = 'bar', stacked=True, ax=ax2) ax2.set_title("Survived by Gender") # Plot Survived info of Embarked Factors t_es = train_data.Survived[train_data.Embarked == "S"].value_counts().sort_index() t_ec = train_data.Survived[train_data.Embarked == "C"].value_counts().sort_index() t_eq = train_data.Survived[train_data.Embarked == "Q"].value_counts().sort_index() c = pd.concat([t_es, t_ec, t_eq], axis = 1) c.index = ['not survived', 'survived'] c.columns = ['Embarked S', 'Embarked C', 'Embarked Q'] c.plot(kind = 'bar', stacked=True, ax=ax3, figsize = (15, 5)) ax3.set_title("Survived by Embarked") train_data.info() train_data = train_data.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1) train_data.info() # compose corelation map float_columns = [x for x in train_data.columns if x not in ['Embarked', 'Sex']] corr_max = train_data[float_columns].corr() for x in range(len(float_columns)): corr_max.iloc[x, x] = 0.0 corr_max corr_max.abs().idxmax() import seaborn as sns # construct the pair for scaled variable sns.set_context('notebook') sns.pairplot(train_data[float_columns], hue = 'Survived', hue_order = [0, 1]) # skew data skew_columns = (train_data[float_columns].skew().sort_values(ascending=False)) skew_columns = skew_columns.loc[skew_columns > 0.75] skew_columns # plot Skew data fig = plt.figure() ax0 = fig.add_subplot(1, 3, 1) ax1 = fig.add_subplot(1, 3, 2) ax2 = fig.add_subplot(1, 3, 3) train_data.Fare.plot(kind='bar', ax = ax0, figsize = (15, 5)) ax0.set_title('Fare Distribution') train_data.SibSp.plot(kind = 'bar', ax = ax1) ax1.set_title("Sibling Number Distribution") train_data.Parch.plot(kind='bar', ax = ax2) ax2.set_title("Parent Number Distribution") # Taje log of skewd data for col in skew_columns.index.tolist(): train_data[col] = np.log1p(train_data[col]) train_data.head() # Pre-Processing from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler from sklearn.model_selection import learning_curve mm = MinMaxScaler() # Drop non discription data from testing data test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1) # clean testing data set edt = pd.get_dummies(test_data.Embarked) test_data.loc[test_data['Sex'] != 'male', 'Sex'] = 0 test_data.loc[test_data['Sex'] == 'male', 'Sex'] = 1 test_data['Age'].fillna(test_data['Age'].mean(), inplace = True) test1 = pd.concat([test_data, edt], axis = 1) test2 = test1. drop(['Embarked'], axis=1) test3 = test2.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1) mean_value = test_data['Fare'].mean() test3.fillna(value = mean_value, inplace = True) test4 = mm.fit_transform(test3) print(f"The Range of feature inputs are within {test4.min()} to {test4.max()}") train_data.Embarked.unique() y_train = train_data.iloc[:, 0:1] y_train.head() x_train = train_data.iloc[:, 1:8] x_train.head # get dummies for embark column ed = pd.get_dummies(x_train.Embarked) ed.head() # contact dummies columns of embark and training data x_train2 = pd.concat([x_train, ed], axis = 1) x_train2 = x_train2.drop(['Embarked'], axis=1) x_train2 x = mm.fit_transform(x_train2) # make sure all dataset is range from 0 to 1 print(f" The Range feature inputs are within {x.min()} to {x.max()}") # Machine Learning Model - Logistic Regression # import Lib. from sklearn import metrics from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.metrics import classification_report, accuracy_score, confusion_matrix from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_fscore_support LG = LogisticRegression() # set cv grid search to find best Hyper parameter for Logistic Regression model grid = {"C": np.logspace(-3, 3, 7), "penalty": ["l1","l2"]} log_model_cv = GridSearchCV(LG, grid, cv=10) log_model_cv.fit(x, y_train) log_model_cv.best_params_ # Thw top performance Hyper parameter was find to be c = 1 and penalty to be 12 lgtuned = LogisticRegression(C = 1, penalty="l2") lgtuned.fit(x, y_train) y_pred = lgtuned.predict(x) print(classification_report(y_train, y_pred)) print("Accuracy score: ", round(accuracy_score(y_train, y_pred), 2)) print("F1 score: ", round(f1_score(y_train, y_pred), 2)) y_pred = lgtuned.predict(test4) y_test = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv") y_test = y_test.iloc[:, -1] # Print test score, Accuracy for Logistic Regression print(classification_report(y_test, y_pred)) print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2)) print("F1 Score: ", round(f1_score(y_test, y_pred), 2)) # Coefficient for each factor for survival rate coef_dict = lgtuned.coef_ coef = pd.DataFrame(coef_dict).T coef.columns = ['Coefficient'] coef.index = ['scio-class', '# Sibling', '# Parent', 'Fare Price', 'Female', 'Male', 'Embark C', 'Embark Q', 'Embark S'] coef = coef.sort_values(['Coefficient'], ascending = True) coef coef.plot(kind = 'barh') plt.title("Coefficient of Survival") confusion_matrix(y_test, y_pred) import seaborn as sns sns.set_palette(sns.color_palette()) _, ax = plt.subplots(figsize = (12, 12)) ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, fmt='d', annot_kws = {"size": 40, "weight": "bold"}) labels = ['False', 'True'] ax.set_xticklabels(labels, fontsize=25) ax.set_yticklabels(labels[::-1], fontsize=25) ax.set_ylabel('Prediction', fontsize=30) ax.set_xlabel('Ground Truth', fontsize=30) # KNN Model from sklearn.neighbors import KNeighborsClassifier # Search for best K number for modeling max_k = 100 error_rates = list() f1_scores = list() for k in range(1, max_k): knn = KNeighborsClassifier(n_neighbors=k, weights="distance") knn = knn.fit(x, y_train) y_pred = knn.predict(test4) f1 = f1_score(y_pred, y_test) f1_scores.append((k, round(f1_score(y_test, y_pred), 4))) error = 1-round(accuracy_score(y_test, y_pred), 4) error_rates.append((k,error)) f1_results = pd.DataFrame(f1_scores, columns=['K', 'F1 Score']) error_results = pd.DataFrame(error_rates, columns=['K', 'Error Rate']) f1_results error_results fig = plt.figure() ax0 = fig.add_subplot(2, 1, 1) ax1 = fig.add_subplot(2, 1, 2) f1_results.plot(kind='line', x='K', y='F1 Score', ax=ax0, figsize=(20, 10)) ax0.set_title('F1 Score', fontsize=15) xinterval = np.arange(0, 101, 2) ax0.set_xticks(xinterval) error_results.plot(kind='line', x='K', y="Error Rate", ax = ax1, figsize=(20, 10)) ax1.set_title('Error Rate', fontsize=15) # KNN performance score knn = KNeighborsClassifier(n_neighbors=12, weights='distance') knn = knn.fit(x, y_train) y_pred = knn.predict(test4) print(classification_report(y_test, y_pred)) print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2)) print("F1 Score: ", round(f1_score(y_test, y_pred), 2)) # Plot confusion matrix sns.set_palette(sns.color_palette()) _, ax= plt.subplots(figsize=(12, 12)) ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', annot_kws={'size': 40, 'weight':'bold'}) labels = ['False', 'True'] ax.set_xticklabels(labels, fontsize = 25) ax.set_yticklabels(labels[::-1], fontsize = 25) ax.set_xlabel("Ground Truth", fontsize=30) ax.set_ylabel("Prediction", fontsize=30) # Random Forest Model # import lib. from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier, export_text, export_graphviz, plot_tree RM = RandomForestClassifier() # set up sv grid search best hyperparameter param_grid = {'n_estimators': [2*n+1 for n in range(20)], 'max_depth': [2*n+1 for n in range(10)], 'max_features': ['suto', 'sqrt', 'log2']} # Best score and parameter print(search.best_score_) print(search.best_params_) # best hyper parameter was found to be max depth of 7 max feature of sqrt random_forest = RandomForestClassifier(n_estimators=15, max_depth=10, max_features='sqrt') random_forest = random_forest.fit(x, y_train) y_pred = random_forest.predict(test4) print(classification_report(y_test, y_pred)) print('Accuracy score: ', round(accuracy_score(y_test, y_pred))) print("F1 score: ", round(f1_score(y_test, y_pred), 2)) # plot heat map sns.set_palette(sns.color_palette()) _, ax= plt.subplots(figsize=(12,12)) ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', annot_kws={ 'size': 40, 'weight': 'bold' }) labels = ['False', 'True'] ax.set_xticklabels(labels, fontsize=25) ax.set_yticklabels(labels[::-1], fontsize=25) ax.set_xlabel("Ground Truth", fontsize=30) ax.set_ylabel("Prediction", fontsize=30) !pip install pydotplus from io import StringIO !pip install yellowbrick from yellowbrick.model_selection import FeatureImportances visual = FeatureImportances(random_forest) visual.fit(x_train2, y_train) visual.show() y_test_pred = lgtuned.predict(test4) submission = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv" , index_col ='PassengerId') submission['Survived'] = y_test_pred submission submission.to_csv("submission.csv") # in submission.csv file stored the survived passenger details with its ID # Thanks For Reading.
Output:
0 Comments