Project - 08 (github.com)

How do you preprocess the Titanic dataset to predict passenger survival ?

Preprocessing the Titanic dataset involves cleaning and preparing the data before training a model:

Handle Missing Values:

Fill missing Age values using median or predictive models.

Fill Embarked with the most common port.

Drop or impute missing Cabin values.

Convert Categorical Features:

Use one-hot encoding for Sex, Embarked, and Pclass.

Extract titles from the Name column (e.g., Mr, Miss, etc.).

Feature Engineering:

Create features like FamilySize = SibSp + Parch + 1.

Bin Age and Fare into ranges to reduce skewness.

Split the dataset into training and testing sets.

This cleaned data can then be used for model training.

Which classification models are suitable for predicting survival, and how are they evaluated ?
The Titanic dataset is well-suited for binary classification (Survived: Yes or No). Suitable models include:

Logistic Regression

Random Forest Classifier

Support Vector Machines (SVM)

XGBoost

K-Nearest Neighbors

Model Evaluation Metrics:
Accuracy overall correctness.

Precision true positives over predicted positives.

Recall true positives over actual positives.

F1 Score harmonic mean of precision and recall.

Confusion Matrix detailed breakdown of classification results.

Using these metrics ensures the model is reliable, especially in unbalanced datasets.

Programming Code:

Following code write in: ML_P08.py

# ML Project Program 08 

# machine learning model that predicts who survived the Titanic shipwreck using passenger data

# import packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# import dataset
train_data = pd.read_csv("./Titanic_shipwreck_dataset/train.csv")
test_data = pd.read_csv("./Titanic_shipwreck_dataset/test.csv")

train_data.info()

train_data.shape

test_data.info()

test_data.shape

# chek null values
train_data.isnull().sum()

# Age having 177 null values
# Cabin having 687 null values       
# Embarked having 2 null values

train_data['Age'].fillna(train_data['Age'].mean(), inplace = True)
train_data = train_data.dropna()

train_data.isnull().sum()
train_data.isnull().any()

train_data.loc[train_data['Sex'] != 'male', 'Sex'] = 0
train_data.loc[train_data['Sex'] == 'male', 'Sex'] = 1
train_data.describe()

# Graph Plotting
fig = plt.figure()
ax0 = fig.add_subplot(1, 3, 1)
ax1 = fig.add_subplot(1, 3, 2)
ax2 = fig.add_subplot(1, 3, 3)

# Survive rate based on Fare
z = train_data.loc[:, ['Fare', 'Survived']]
z = z.sort_values(['Fare'], ascending = True, axis=0)
z['group'] = pd.cut(z['Fare'], 9, labels = ['1', '2', '3', '4', '5', '6', '7', '8', '9'])
t_f1 = train_data.Survived[z.group == '1'].value_counts().sort_index()
t_f2 = train_data.Survived[z.group == '2'].value_counts().sort_index()
t_f3 = train_data.Survived[z.group == '3'].value_counts().sort_index()
t_f4 = train_data.Survived[z.group == '4'].value_counts().sort_index()
t_f5 = train_data.Survived[z.group == '5'].value_counts().sort_index()
t_f6 = train_data.Survived[z.group == '6'].value_counts().sort_index()
t_f7 = train_data.Survived[z.group == '7'].value_counts().sort_index()
t_f8 = train_data.Survived[z.group == '8'].value_counts().sort_index()
t_f9 = train_data.Survived[z.group == '9'].value_counts().sort_index()

z1 = pd.concat([t_f1, t_f2, t_f3, t_f4, t_f5, t_f9], axis = 1)
z1 = z1.fillna(0)
z1.index = ['not survived', 'survived']
z1.columns = ['low', 'median-low', 'median', 'median-high', 'high', 'extreme-high']

z1 = z1.T

z1['survive rate'] = z1['survived'] / (z1['survived'] + z1['not survived']) * 100

z2 = z1.drop(['not survived', 'survived'],  axis = 1)

z2.plot(kind = 'line', ax=ax0, figsize = (15, 5))

ax0.set_title("Survived Rate based on Fare")

ax0.set_ylabel('Rate %')

# Survive Rate based on Siblings

t_s0 = train_data.Survived[train_data.SibSp == 0].value_counts().sort_index()
t_s1 = train_data.Survived[train_data.SibSp == 1].value_counts().sort_index()
t_s2 = train_data.Survived[train_data.SibSp == 2].value_counts().sort_index()
t_s3 = train_data.Survived[train_data.SibSp == 3].value_counts().sort_index()
t_s4 = train_data.Survived[train_data.SibSp == 4].value_counts().sort_index()
t_s5 = train_data.Survived[train_data.SibSp == 5].value_counts().sort_index()
t_s8 = train_data.Survived[train_data.SibSp == 8].value_counts().sort_index()

d = pd.concat([t_s0, t_s1, t_s2, t_s3, t_s4, t_s5, t_s8], axis = 1)
d.index = ['not survived', 'survived']

d.columns = ['S0', 'S1', 'S2', 'S3', 'S4', 'S5', 'S8']
d = d.fillna(0)

d = d.T
d['survived rate'] = d['survived'] / (d['survived'] + d['not survived']) * 100

d1 = d.drop(['not survived', 'survived'], axis = 1)
d1.plot(kind = 'line', ax = ax1 , figsize = (15, 5))
ax1.set_title('Survived Rate based on Siblings')
ax1.set_ylabel('Rate %')

# Survive Rate based on Parents

t_p0 = train_data.Survived[train_data.Parch == 0].value_counts().sort_index()
t_p1 = train_data.Survived[train_data.Parch == 1].value_counts().sort_index()
t_p2 = train_data.Survived[train_data.Parch == 2].value_counts().sort_index()
t_p3 = train_data.Survived[train_data.Parch == 3].value_counts().sort_index()
t_p4 = train_data.Survived[train_data.Parch == 4].value_counts().sort_index()
t_p5 = train_data.Survived[train_data.Parch == 5].value_counts().sort_index()
t_p6 = train_data.Survived[train_data.Parch == 6].value_counts().sort_index()

f = pd.concat([t_p0, t_p1, t_p2, t_p3, t_p4, t_p5, t_p6], axis = 1)
f = f.fillna(0)
f.index = ['not survived', 'survived']
f.columns = ['P0', 'P1', 'P2', 'P3', 'P4', 'P5', 'P6']

f = f.T
f['survived rate'] = f['survived'] / (f['survived'] + f['not survived']) * 100
f1 = f.drop(['not survived', 'survived'], axis = 1)

f1.plot(kind = 'line', ax = ax2, figsize = (15, 5))
ax2.set_title('Survived based on Parents')
ax2.set_ylabel('Rate %')

fig = plt.figure()
ax0 = fig.add_subplot(1, 4, 1)
ax1 = fig.add_subplot(1, 4, 2)
ax2 = fig.add_subplot(1, 4, 3)
ax3 = fig.add_subplot(1, 4, 4)

# Plot Survived information
a = pd.DataFrame(train_data.Survived.value_counts())
a.index = ['not survived', 'survived']
a.columns = ['number']
a.plot(kind = 'bar', ax = ax0)
ax0.set_title('Number of Survival')

# Plot Survived info of class Factor
t_p1 = train_data.Survived[train_data.Pclass == 1].value_counts().sort_index()
t_p2 = train_data.Survived[train_data.Pclass == 2].value_counts().sort_index()
t_p3 = train_data.Survived[train_data.Pclass == 3].value_counts().sort_index()

b = pd.concat([t_p1, t_p2, t_p3], axis = 1)
b.index = ['not survived', 'survived']
b.columns = ['class1', 'class2', 'class3']
b.plot(kind = 'bar', stacked = True, ax = ax1, figsize = (15, 5))
ax1.set_title('Number of Survival for class')
plt.xticks(rotation = 360)

# Plot Survived info of Sex Factor
tm = train_data.Survived[train_data.Sex == 1].value_counts().sort_index()
tf = train_data.Survived[train_data.Sex == 0].value_counts().sort_index()

e = pd.concat([tm, tf], axis = 1)
e.index = ['not survived', 'survived']
e.columns = ['male', 'female']
e.plot(kind = 'bar', stacked=True, ax=ax2)
ax2.set_title("Survived by Gender")

# Plot Survived info of Embarked Factors
t_es = train_data.Survived[train_data.Embarked == "S"].value_counts().sort_index()
t_ec = train_data.Survived[train_data.Embarked == "C"].value_counts().sort_index()
t_eq = train_data.Survived[train_data.Embarked == "Q"].value_counts().sort_index()

c = pd.concat([t_es, t_ec, t_eq], axis = 1)
c.index = ['not survived', 'survived']
c.columns = ['Embarked S', 'Embarked C', 'Embarked Q']
c.plot(kind = 'bar', stacked=True, ax=ax3, figsize = (15, 5))
ax3.set_title("Survived by Embarked")

train_data.info()
train_data = train_data.drop(['Name', 'PassengerId', 'Ticket', 'Cabin'], axis=1)
train_data.info()

# compose corelation map
float_columns = [x for x in train_data.columns if x not in ['Embarked', 'Sex']]
corr_max = train_data[float_columns].corr()
for x in range(len(float_columns)):
    corr_max.iloc[x, x] = 0.0
corr_max

corr_max.abs().idxmax()
import seaborn as sns

# construct the pair for scaled variable
sns.set_context('notebook')
sns.pairplot(train_data[float_columns], hue = 'Survived', hue_order = [0, 1])

# skew data
skew_columns = (train_data[float_columns].skew().sort_values(ascending=False))
skew_columns = skew_columns.loc[skew_columns > 0.75]
skew_columns

# plot Skew data
fig = plt.figure()
ax0 = fig.add_subplot(1, 3, 1)
ax1 = fig.add_subplot(1, 3, 2)
ax2 = fig.add_subplot(1, 3, 3)

train_data.Fare.plot(kind='bar', ax = ax0, figsize = (15, 5))
ax0.set_title('Fare Distribution')

train_data.SibSp.plot(kind = 'bar', ax = ax1)
ax1.set_title("Sibling Number Distribution")

train_data.Parch.plot(kind='bar', ax = ax2)
ax2.set_title("Parent Number Distribution")

# Taje log of skewd data
for col in skew_columns.index.tolist():
    train_data[col] = np.log1p(train_data[col])
train_data.head()

# Pre-Processing
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, MinMaxScaler
from sklearn.model_selection import learning_curve
mm = MinMaxScaler()

# Drop non discription data from testing data
test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1)

# clean testing data set
edt = pd.get_dummies(test_data.Embarked)
test_data.loc[test_data['Sex'] != 'male', 'Sex'] = 0
test_data.loc[test_data['Sex'] == 'male', 'Sex'] = 1
test_data['Age'].fillna(test_data['Age'].mean(), inplace = True)

test1 = pd.concat([test_data, edt], axis = 1)
test2 = test1. drop(['Embarked'], axis=1)
test3 = test2.drop(['PassengerId', 'Ticket', 'Cabin', 'Name'], axis = 1)
mean_value = test_data['Fare'].mean()
test3.fillna(value = mean_value, inplace = True)
test4 = mm.fit_transform(test3)
print(f"The Range of feature inputs are within {test4.min()} to {test4.max()}")

train_data.Embarked.unique()
y_train = train_data.iloc[:, 0:1]
y_train.head()

x_train = train_data.iloc[:, 1:8]
x_train.head

# get dummies for embark column
ed = pd.get_dummies(x_train.Embarked)
ed.head()

# contact dummies columns of embark and training data
x_train2 = pd.concat([x_train, ed], axis = 1)
x_train2 = x_train2.drop(['Embarked'], axis=1)
x_train2

x = mm.fit_transform(x_train2)
# make sure all dataset is range from 0 to 1
print(f" The Range feature inputs are within {x.min()} to {x.max()}")

# Machine Learning Model - Logistic Regression
# import Lib.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from sklearn.metrics import precision_score, recall_score, f1_score, precision_recall_fscore_support

LG = LogisticRegression()

# set cv grid search to find best Hyper parameter for Logistic Regression model
grid = {"C": np.logspace(-3, 3, 7), "penalty": ["l1","l2"]}
log_model_cv = GridSearchCV(LG, grid, cv=10)
log_model_cv.fit(x, y_train)

log_model_cv.best_params_

# Thw top performance Hyper parameter was find to be c = 1 and penalty to be 12 
lgtuned = LogisticRegression(C = 1, penalty="l2")
lgtuned.fit(x, y_train)
y_pred = lgtuned.predict(x)

print(classification_report(y_train, y_pred))
print("Accuracy score: ", round(accuracy_score(y_train, y_pred), 2))
print("F1 score: ", round(f1_score(y_train, y_pred), 2))

y_pred = lgtuned.predict(test4)
y_test = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv")
y_test = y_test.iloc[:, -1]

# Print test score, Accuracy for Logistic Regression
print(classification_report(y_test, y_pred))
print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2))
print("F1 Score: ", round(f1_score(y_test, y_pred), 2))

# Coefficient for each factor for survival rate
coef_dict = lgtuned.coef_
coef = pd.DataFrame(coef_dict).T
coef.columns = ['Coefficient']
coef.index = ['scio-class', '# Sibling', '# Parent', 'Fare Price', 'Female', 'Male', 'Embark C',
             'Embark Q', 'Embark S']
coef = coef.sort_values(['Coefficient'], ascending = True)
coef

coef.plot(kind = 'barh')
plt.title("Coefficient of Survival")

confusion_matrix(y_test, y_pred)

import seaborn as sns

sns.set_palette(sns.color_palette())
_, ax = plt.subplots(figsize = (12, 12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot = True, fmt='d', annot_kws = {"size": 40, 
        "weight": "bold"})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize=25)
ax.set_yticklabels(labels[::-1], fontsize=25)
ax.set_ylabel('Prediction', fontsize=30)
ax.set_xlabel('Ground Truth', fontsize=30)

# KNN Model
from sklearn.neighbors import KNeighborsClassifier

# Search for best K number for modeling
max_k = 100
error_rates = list()
f1_scores = list()

for k in range(1, max_k):
    knn = KNeighborsClassifier(n_neighbors=k, weights="distance")
    knn = knn.fit(x, y_train)
    y_pred = knn.predict(test4)
    f1 = f1_score(y_pred, y_test)
    f1_scores.append((k, round(f1_score(y_test, y_pred), 4)))
    error = 1-round(accuracy_score(y_test, y_pred), 4)
    error_rates.append((k,error))
f1_results = pd.DataFrame(f1_scores, columns=['K', 'F1 Score'])
error_results = pd.DataFrame(error_rates, columns=['K', 'Error Rate'])

f1_results
error_results

fig = plt.figure()
ax0 = fig.add_subplot(2, 1, 1)
ax1 = fig.add_subplot(2, 1, 2)

f1_results.plot(kind='line', x='K', y='F1 Score', ax=ax0, figsize=(20, 10))
ax0.set_title('F1 Score', fontsize=15)

xinterval = np.arange(0, 101, 2)
ax0.set_xticks(xinterval)
error_results.plot(kind='line', x='K', y="Error Rate", ax = ax1, figsize=(20, 10))

ax1.set_title('Error Rate', fontsize=15)

# KNN performance score
knn = KNeighborsClassifier(n_neighbors=12, weights='distance')
knn = knn.fit(x, y_train)
y_pred = knn.predict(test4)
print(classification_report(y_test, y_pred))
print("Accuracy score:", round(accuracy_score(y_test, y_pred), 2))
print("F1 Score: ", round(f1_score(y_test, y_pred), 2))

# Plot confusion matrix
sns.set_palette(sns.color_palette())
_, ax= plt.subplots(figsize=(12, 12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', 
                 annot_kws={'size': 40, 'weight':'bold'})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize = 25)
ax.set_yticklabels(labels[::-1], fontsize = 25)

ax.set_xlabel("Ground Truth", fontsize=30)
ax.set_ylabel("Prediction", fontsize=30)                           

# Random Forest Model
# import lib.
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_text, export_graphviz, plot_tree

RM = RandomForestClassifier()

# set up sv grid search best hyperparameter
param_grid = {'n_estimators': [2*n+1 for n in range(20)],
             'max_depth': [2*n+1 for n in range(10)],
             'max_features': ['suto', 'sqrt', 'log2']}
             
# Best score and parameter
print(search.best_score_)
print(search.best_params_)

# best hyper parameter was found to be max depth of 7 max feature of sqrt
random_forest = RandomForestClassifier(n_estimators=15, max_depth=10, max_features='sqrt')
random_forest = random_forest.fit(x, y_train)
y_pred = random_forest.predict(test4)

print(classification_report(y_test, y_pred))
print('Accuracy score: ', round(accuracy_score(y_test, y_pred)))
print("F1 score: ", round(f1_score(y_test, y_pred), 2))

# plot heat map
sns.set_palette(sns.color_palette())
_, ax= plt.subplots(figsize=(12,12))
ax = sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', annot_kws={
    'size': 40, 'weight': 'bold'
})
labels = ['False', 'True']
ax.set_xticklabels(labels, fontsize=25)
ax.set_yticklabels(labels[::-1], fontsize=25)
ax.set_xlabel("Ground Truth", fontsize=30)
ax.set_ylabel("Prediction", fontsize=30)

!pip install pydotplus
from io import StringIO

!pip install yellowbrick
from yellowbrick.model_selection import FeatureImportances

visual = FeatureImportances(random_forest)
visual.fit(x_train2, y_train)
visual.show()

y_test_pred = lgtuned.predict(test4)
submission = pd.read_csv("./Titanic_shipwreck_dataset/gender_submission.csv" , 
                         index_col ='PassengerId')
submission['Survived'] = y_test_pred
submission

submission.to_csv("submission.csv")

# in submission.csv file stored the survived passenger details with its ID

# Thanks For Reading.

Output:

Machine Learning Program / Project - 08

Posted by go2collage

Post a Comment

0 Comments

Search This Blog

Most Popular

Machine Learning Program / Project - 05

Machine Learning Program / Project - 06

Machine Learning Program / Project - 04

Featured Post

Machine Learning Program / Project - 08

Program / Project Code

Pages

Footer Menu Widget

Contact form

Ad Code

Machine Learning Program / Project - 08

Posted by go2collage

You may like these posts

Post a Comment

0 Comments

Social Plugin

Search This Blog

Most Popular

Machine Learning Program / Project - 05

Machine Learning Program / Project - 06

Machine Learning Program / Project - 04

Featured Post

Machine Learning Program / Project - 08

Program / Project Code

Pages

Ad Code

Footer Menu Widget

Contact form