Question 06: Implement K-Means clustering/ hierarchical clustering on sales_data_sample.csv dataset. Determine the number of clusters using the elbow method. Download hole Program / Project code, by clicking following link: How can K-Means Clustering be applied to the `sales_data_sample.csv` dataset, and how is the Elbow Method used to determine the optimal number of clusters ?
To apply K-Means Clustering on sales_data_sample.csv:
- Preprocess the data: Select relevant numerical features (e.g., SALES, QUANTITYORDERED), handle missing values, and scale the features using StandardScaler.
- Apply the Elbow Method:
- Run K-Means for a range of k (e.g., 1 to 10).
- For each k, calculate inertia (sum of squared distances to the closest cluster center).
- Plot k vs inertia.
- The elbow point in the curve is where the rate of decrease sharply changes. This is the optimal k.
This approach helps segment the data into meaningful clusters such as sales performance tiers or customer groupings.
- Preprocess the data: Select relevant numerical features (e.g., SALES, QUANTITYORDERED), handle missing values, and scale the features using StandardScaler.
- Apply the Elbow Method:
- Run K-Means for a range of k (e.g., 1 to 10).
- For each k, calculate inertia (sum of squared distances to the closest cluster center).
- Plot k vs inertia.
- The elbow point in the curve is where the rate of decrease sharply changes. This is the optimal k.
Write a short Python snippet to implement the Elbow Method for determining the number of clusters in the sales dataset using K-Means ? from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load and preprocess data
data = pd.read_csv('sales_data_sample.csv')
X = data[['SALES', 'QUANTITYORDERED']].dropna()
X_scaled = StandardScaler().fit_transform(X)
# Elbow method
inertia = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
# Plot
plt.plot(K_range, inertia, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
This code helps visualize the elbow point for optimal k selection, which can then be used to apply final clustering.
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Load and preprocess data
data = pd.read_csv('sales_data_sample.csv')
X = data[['SALES', 'QUANTITYORDERED']].dropna()
X_scaled = StandardScaler().fit_transform(X)
# Elbow method
inertia = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_scaled)
inertia.append(kmeans.inertia_)
# Plot
plt.plot(K_range, inertia, 'bo-')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal k')
plt.show()
This code helps visualize the elbow point for optimal k selection, which can then be used to apply final clustering.
Programming Code: Following code write in: ML_P06.py # ML Project Program 06
# K-Means clustering/ hierarchical clustering on sales_data_sample.csv dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv("./salesdata_sample_dataset/sales_data_sample.csv", encoding = 'latin1')
data
data.info()
data.describe()
data.columns
data.shape
data = data[['QUANTITYORDERED', 'ORDERLINENUMBER']]
new_data = data.dropna(axis = 0)
from sklearn.cluster import KMeans
import seaborn as sns
wcss = []
for i in range (1, 11):
clustering = KMeans(n_clusters = i, init = 'k-means++', random_state=42 )
clustering.fit(data)
wcss.append(clustering.inertia_)
ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sns.lineplot(x = ks, y = wcss)
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5))
sns.scatterplot(ax = axes[0], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER').set_title('without clustering')
sns.scatterplot(ax = axes[1], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering.labels_).set_title('Using Elbow Clustering Method')
new_data.describe().T
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
scaled = ss.fit_transform(new_data,)
wcss_sc = []
for i in range(1, 11):
clustering_sc = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
clustering_sc.fit(scaled)
wcss_sc.append(clustering_sc.inertia_)
ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
sns.lineplot(x = ks, y = wcss_sc)
fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 5))
sns.scatterplot(ax = axes[0], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER').set_title('without clustering')
sns.scatterplot(ax = axes[2], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering.labels_).set_title('Using Elbow Clustering Method')
sns.scatterplot(ax = axes[1], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering_sc.labels_).set_title('Using Elbow Clustering Method & Scaled Data')
# Thanks For Reading.
Output:
# ML Project Program 06 # K-Means clustering/ hierarchical clustering on sales_data_sample.csv dataset import pandas as pd import numpy as np import matplotlib.pyplot as plt data = pd.read_csv("./salesdata_sample_dataset/sales_data_sample.csv", encoding = 'latin1') data data.info() data.describe() data.columns data.shape data = data[['QUANTITYORDERED', 'ORDERLINENUMBER']] new_data = data.dropna(axis = 0) from sklearn.cluster import KMeans import seaborn as sns wcss = [] for i in range (1, 11): clustering = KMeans(n_clusters = i, init = 'k-means++', random_state=42 ) clustering.fit(data) wcss.append(clustering.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] sns.lineplot(x = ks, y = wcss) fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (15, 5)) sns.scatterplot(ax = axes[0], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER').set_title('without clustering') sns.scatterplot(ax = axes[1], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering.labels_).set_title('Using Elbow Clustering Method') new_data.describe().T from sklearn.preprocessing import StandardScaler ss = StandardScaler() scaled = ss.fit_transform(new_data,) wcss_sc = [] for i in range(1, 11): clustering_sc = KMeans(n_clusters = i, init = 'k-means++', random_state = 42) clustering_sc.fit(scaled) wcss_sc.append(clustering_sc.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] sns.lineplot(x = ks, y = wcss_sc) fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize = (15, 5)) sns.scatterplot(ax = axes[0], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER').set_title('without clustering') sns.scatterplot(ax = axes[2], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering.labels_).set_title('Using Elbow Clustering Method') sns.scatterplot(ax = axes[1], data = new_data, x = 'QUANTITYORDERED', y = 'ORDERLINENUMBER', hue = clustering_sc.labels_).set_title('Using Elbow Clustering Method & Scaled Data') # Thanks For Reading.
Output:
0 Comments