Using K-Means, Hierarchical Clustering, and DBSCAN to Group Customers

Emine Bozkus
16 min readDec 23, 2022

--

Customer segmentation is a powerful tool for businesses looking to optimize their marketing and sales efforts. By dividing customers into smaller groups based on shared characteristics or behaviors, businesses can tailor their campaigns to specific segments, potentially increasing their effectiveness. One common approach to customer segmentation is the use of clustering algorithms. Clustering algorithms group data points into clusters based on similarities between the points. There are several popular clustering algorithms, including K-Means, hierarchical clustering, and DBSCAN.

K-Means is an iterative algorithm that divides a dataset into a specified number of clusters based on distance from the centroid of each cluster. To use K-Means for customer segmentation, businesses can input customer data into the algorithm and specify the number of clusters they want to create. The algorithm will then group the customers into clusters based on shared characteristics.

Hierarchical clustering is a method of grouping data into a tree-like structure based on similarities between points. To use hierarchical clustering for customer segmentation, businesses can input customer data into the algorithm and specify the level of granularity they want for the clusters. The algorithm will then group the customers into clusters based on shared characteristics.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an algorithm that groups together points that are closely packed together and marks points that are isolated as noise. To use DBSCAN for customer segmentation, businesses can input customer data into the algorithm and specify the minimum number of points required to form a cluster and the maximum distance between points in a cluster. The algorithm will then group the customers into clusters based on shared characteristics.

Using K-Means, hierarchical clustering, and DBSCAN for customer segmentation can help businesses understand their customer base and tailor their marketing and sales efforts accordingly. By grouping customers into meaningful segments, businesses can increase the effectiveness of their campaigns and ultimately improve their bottom line.

Let’s practice in python now.

Business Problem: Segmentation of a Customer Portfolio

FLO wants to segment its customers and determine marketing strategies according to these segments. To this end, the behaviors of the customers will be defined and groups will be formed according to the clusters in these behaviors.

Dataset Story: Purchasing Behavior of FLO Customers

The dataset includes Flo’s last purchases from OmniChannel (both online and offline shoppers) in 2020–2021. 12 Variables 19,945 Observations 2.7MB

  • master_id: Unique client number
  • order_channel: Which channel of the shopping platform is used (Android, ios, Desktop, Mobile)
  • last_order_channel: The channel where the last purchase was made
  • first_order_date: The date of the first purchase made by the customer
  • last_order_date: The date of the customer’s last purchase
  • last_order_date_online: The date of the last purchase made by the customer on the online platform
  • last_order_date_offline: The date of the last purchase made by the customer on the offline platform
  • order_num_total_ever_online: The total number of purchases made by the customer on the online platform
  • order_num_total_ever_offline: Total number of purchases made by the customer offline
  • customer_value_total_ever_offline: The total price paid by the customer for offline purchases
  • customer_value_total_ever_online: The total price paid by the customer for their online shopping
  • interested_in_categories_12: List of categories the customer has purchased from in the last 12 months

1. Loading required libraries and data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans # K-Means
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer # Elbow method
from scipy.cluster.hierarchy import linkage, dendrogram # hierarchical clustering
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import AgglomerativeClustering # hierarchical clustering

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
import warnings
warnings.simplefilter(action='ignore', category=Warning)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv('/kaggle/input/flodata/flo_data_20k.csv')

df['tenure'] = (pd.to_datetime('today') - pd.to_datetime(df['first_order_date'])).dt.days
df['recency'] = (pd.to_datetime('today') - pd.to_datetime(df['last_order_date'])).dt.days
df['frequency'] = df['order_num_total_ever_online'] + df['order_num_total_ever_offline']
df['monetary'] = df['customer_value_total_ever_online'] + df['customer_value_total_ever_offline']

df["order_num_total"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
df["customer_value_total"] = df["customer_value_total_ever_offline"] + df["customer_value_total_ever_online"]

2. Exploratory Data Analysis

def check_df(dataframe, head=5):
print("##################### Shape #####################")
print(dataframe.shape)
print("##################### Types #####################")
print(dataframe.dtypes)
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)
##################### Shape #####################
(19945, 18)
##################### Types #####################
master_id object
order_channel object
last_order_channel object
first_order_date object
last_order_date object
last_order_date_online object
last_order_date_offline object
order_num_total_ever_online float64
order_num_total_ever_offline float64
customer_value_total_ever_offline float64
customer_value_total_ever_online float64
interested_in_categories_12 object
tenure int64
recency int64
frequency float64
monetary float64
order_num_total float64
customer_value_total float64
dtype: object
##################### Head #####################
master_id order_channel last_order_channel first_order_date last_order_date last_order_date_online last_order_date_offline order_num_total_ever_online order_num_total_ever_offline customer_value_total_ever_offline customer_value_total_ever_online interested_in_categories_12 tenure recency frequency monetary order_num_total customer_value_total
0 cc294636-19f0-11eb-8d74-000d3a38a36f Android App Offline 2020-10-30 2021-02-26 2021-02-21 2021-02-26 4.0 1.0 139.99 799.38 [KADIN] 784 665 5.0 939.37 5.0 939.37
1 f431bd5a-ab7b-11e9-a2fc-000d3a38a36f Android App Mobile 2017-02-08 2021-02-16 2021-02-16 2020-01-10 19.0 2.0 159.97 1853.58 [ERKEK, COCUK, KADIN, AKTIFSPOR] 2144 675 21.0 2013.55 21.0 2013.55
2 69b69676-1a40-11ea-941b-000d3a38a36f Android App Android App 2019-11-27 2020-11-27 2020-11-27 2019-12-01 3.0 2.0 189.97 395.35 [ERKEK, KADIN] 1122 756 5.0 585.32 5.0 585.32
3 1854e56c-491f-11eb-806e-000d3a38a36f Android App Android App 2021-01-06 2021-01-17 2021-01-17 2021-01-06 1.0 1.0 39.99 81.98 [AKTIFCOCUK, COCUK] 716 705 2.0 121.97 2.0 121.97
4 d6ea1074-f1f5-11e9-9346-000d3a38a36f Desktop Desktop 2019-08-03 2021-03-07 2021-03-07 2019-08-03 1.0 1.0 49.99 159.99 [AKTIFSPOR] 1238 656 2.0 209.98 2.0 209.98
##################### Tail #####################
master_id order_channel last_order_channel first_order_date last_order_date last_order_date_online last_order_date_offline order_num_total_ever_online order_num_total_ever_offline customer_value_total_ever_offline customer_value_total_ever_online interested_in_categories_12 tenure recency frequency monetary order_num_total customer_value_total
19940 727e2b6e-ddd4-11e9-a848-000d3a38a36f Android App Offline 2019-09-21 2020-07-05 2020-06-05 2020-07-05 1.0 2.0 289.98 111.98 [ERKEK, AKTIFSPOR] 1189 901 3.0 401.96 3.0 401.96
19941 25cd53d4-61bf-11ea-8dd8-000d3a38a36f Desktop Desktop 2020-03-01 2020-12-22 2020-12-22 2020-03-01 1.0 1.0 150.48 239.99 [AKTIFSPOR] 1027 731 2.0 390.47 2.0 390.47
19942 8aea4c2a-d6fc-11e9-93bc-000d3a38a36f Ios App Ios App 2019-09-11 2021-05-24 2021-05-24 2019-09-11 2.0 1.0 139.98 492.96 [AKTIFSPOR] 1199 578 3.0 632.94 3.0 632.94
19943 e50bb46c-ff30-11e9-a5e8-000d3a38a36f Android App Android App 2019-03-27 2021-02-13 2021-02-13 2021-01-08 1.0 5.0 711.79 297.98 [ERKEK, AKTIFSPOR] 1367 678 6.0 1009.77 6.0 1009.77
19944 740998d2-b1f7-11e9-89fa-000d3a38a36f Android App Android App 2019-09-03 2020-06-06 2020-06-06 2019-09-03 1.0 1.0 39.99 221.98 [KADIN, AKTIFSPOR] 1207 930 2.0 261.97 2.0 261.97
##################### NA #####################
master_id 0
order_channel 0
last_order_channel 0
first_order_date 0
last_order_date 0
last_order_date_online 0
last_order_date_offline 0
order_num_total_ever_online 0
order_num_total_ever_offline 0
customer_value_total_ever_offline 0
customer_value_total_ever_online 0
interested_in_categories_12 0
tenure 0
recency 0
frequency 0
monetary 0
order_num_total 0
customer_value_total 0
dtype: int64
##################### Quantiles #####################
0.00 0.05 0.50 0.95 0.99 1.00
order_num_total_ever_online 1.00 1.00 2.00 10.000 20.0000 200.00
order_num_total_ever_offline 1.00 1.00 1.00 4.000 7.0000 109.00
customer_value_total_ever_offline 10.00 39.99 179.98 694.222 1219.9468 18119.14
customer_value_total_ever_online 12.99 63.99 286.46 1556.726 3143.8104 45220.13
tenure 575.00 779.20 1221.00 2644.000 3175.0000 3630.00
recency 572.00 579.00 681.00 905.000 930.0000 937.00
frequency 2.00 2.00 4.00 12.000 22.0000 202.00
monetary 44.98 175.48 545.27 1921.924 3606.3556 45905.10
order_num_total 2.00 2.00 4.00 12.000 22.0000 202.00
customer_value_total 44.98 175.48 545.27 1921.924 3606.3556 45905.10
def cat_summary(dataframe, col_name, plot=False):
print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
"Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
print("##########################################")
if plot:
sns.countplot(x=dataframe[col_name], data=dataframe)
plt.show(block=True)
def num_summary(dataframe, numerical_col, plot=False):
quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
print(dataframe[numerical_col].describe(quantiles).T)

if plot:
dataframe[numerical_col].hist(bins=20)
plt.xlabel(numerical_col)
plt.title(numerical_col)
plt.show(block=True)
def target_summary_with_num(dataframe, target, numerical_col):
print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

def target_summary_with_cat(dataframe, target, categorical_col):
print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")
def correlation_matrix(df, cols):
fig = plt.gcf()
fig.set_size_inches(10, 8)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
fig = sns.heatmap(df[cols].corr(), annot=True, linewidths=0.5, annot_kws={'size': 12}, linecolor='w', cmap='RdBu')
plt.show(block=True)
  • df: This is the dataframe that contains the columns for which you want to calculate the correlations.
  • cols: This is a list of the column names that you want to include in the correlation matrix.
  • fig: This variable stores a reference to the current figure object. The figure object represents the overall window or page that the plot will be drawn on.
  • fig.set_size_inches: This method sets the size of the figure object in inches.
  • plt.xticks and plt.yticks: These functions control the appearance of the x-axis and y-axis tick labels.
  • sns.heatmap: This function from the seaborn library creates a heatmap visualization of the correlations between the columns in the dataframe. The annot parameter specifies whether to annotate the cells with the correlations, and the annot_kws parameter controls the appearance of the annotations. The linewidths parameter controls the width of the lines between cells, and the linecolor parameter sets the color of the lines. The cmap parameter specifies the color map to use for the heatmap.
  • plt.show: This function displays the plot. The block parameter specifies whether to block the script until the plot window is closed.
def grab_col_names(dataframe, cat_th=10, car_th=20):
"""
It gives the names of categorical, numerical and categorical but cardinal variables in the data set.
Note: Categorical variables with numerical appearance are also included in categorical variables.

parameters
------
dataframe: dataframe
The dataframe from which variable names are to be retrieved
cat_th: int, optional
class threshold for numeric but categorical variables
car_th: int, optinal
class threshold for categorical but cardinal variables

Returns
------
cat_cols: list
Categorical variable list
num_cols: list
Numeric variable list
cat_but_car: list
Categorical view cardinal variable list

Examples
------
import seaborn as sns
df = sns.load_dataset("iris")
print(grab_col_names(df))


notes
------
cat_cols + num_cols + cat_but_car = total number of variables
num_but_cat is inside cat_cols.
The sum of the 3 returned lists equals the total number of variables: cat_cols + num_cols + cat_but_car = number of variables

"""

# cat_cols, cat_but_car
cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
dataframe[col].dtypes != "O"]
cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
dataframe[col].dtypes == "O"]
cat_cols = cat_cols + num_but_cat
cat_cols = [col for col in cat_cols if col not in cat_but_car]

# num_cols
num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
num_cols = [col for col in num_cols if col not in num_but_cat]

# print(f"Observations: {dataframe.shape[0]}")
# print(f"Variables: {dataframe.shape[1]}")
# print(f'cat_cols: {len(cat_cols)}')
# print(f'num_cols: {len(num_cols)}')
# print(f'cat_but_car: {len(cat_but_car)}')
# print(f'num_but_cat: {len(num_but_cat)}')
return cat_cols, num_cols, cat_but_car

cat_cols, num_cols, cat_but_car = grab_col_names(df, cat_th=5, car_th=20)
print(cat_cols)
print(num_cols)
print(cat_but_car)
  1. The function first creates a list cat_cols of columns that are categorical variables (i.e., their data type is "O" for "object").
  2. It then creates a list num_but_cat of columns that are numerical but categorical (i.e., they have fewer than cat_th unique values).
  3. It then creates a list cat_but_car of columns that are categorical but cardinal (i.e., they have more than car_th unique values).
  4. It adds the num_but_cat list to the cat_cols list and removes any column names that are also in the cat_but_car list.
  5. It creates a list num_cols of numerical variables that are not in the num_but_cat list.

The function also has optional parameters cat_th and car_th that control the thresholds for determining which columns are considered numerical but categorical or categorical but cardinal.

cat_cols
['order_channel', 'last_order_channel']
num_cols
['order_num_total_ever_online',
'order_num_total_ever_offline',
'customer_value_total_ever_offline',
'customer_value_total_ever_online',
'tenure',
'recency',
'frequency',
'monetary',
'order_num_total',
'customer_value_total']
# Correlation of numerical variables with each other
correlation_matrix(df, num_cols)

3. Data Preprocessing & Feature Engineering

def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
quartile1 = dataframe[col_name].quantile(q1)
quartile3 = dataframe[col_name].quantile(q3)
interquantile_range = quartile3 - quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 - 1.5 * interquantile_range
return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_thresholds(dataframe, variable)
dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

def check_outlier(dataframe, col_name, q1=0.25, q3=0.75):
low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)
if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
return True
else:
return False

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

def label_encoder(dataframe, binary_col):
labelencoder = LabelEncoder()
dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
return dataframe

binary_cols = [col for col in df.columns if df[col].dtype not in [int, float]
and df[col].nunique() == 2]
for col in binary_cols:
df = label_encoder(df, col)

def one_hot_encoder(dataframe, categorical_cols, drop_first=True):
dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
return dataframe

ohe_cols = [col for col in df.columns if
25 >= df[col].nunique() > 2]

df = one_hot_encoder(df, ohe_cols)
cat_cols, num_cols, cat_but_car = grab_col_names(df)
  • outlier_thresholds: This function takes in a dataframe, a column name, and optional parameters q1 and q3 (default values are 0.25 and 0.75, respectively). It calculates the interquartile range (IQR) of the column and returns the upper and lower limits for detecting outliers based on the IQR. An outlier is defined as a value that is more than 1.5 times the IQR above the upper quartile or below the lower quartile.
  • replace_with_thresholds: This function takes in a dataframe and a column name. It calls the outlier_thresholds function to get the upper and lower limits for detecting outliers in the column. It then replaces any values in the column that are above the upper limit or below the lower limit with the respective limit.
  • check_outlier: This function takes in a dataframe, a column name, and optional parameters q1 and q3 (default values are 0.25 and 0.75, respectively). It calls the outlier_thresholds function to get the upper and lower limits for detecting outliers in the column. It then checks if there are any values in the column that are above the upper limit or below the lower limit, and returns True if any are found, or False if none are found.
  • one_hot_encoder: This function takes in a dataframe, a list of column names, and an optional parameter drop_first (default value is True). It converts the categorical variables in the specified columns into dummy variables using the pd.get_dummies function from the pandas library. The drop_first parameter specifies whether to drop the first dummy variable in each column to avoid the dummy variable trap.
  • label_encoder: This function takes in a dataframe and a column name. It converts the values in the specified column into numerical values using the LabelEncoder class from the sklearn library.
  • binary_cols: This line of code creates a list of column names for binary variables (i.e., variables with only two unique values) in the dataframe.
  • label_encoder: This line of code calls the label_encoder function on each column in the binary_cols list, converting the values in these columns into numerical values.
  • one_hot_encoder: This line of code calls the one_hot_encoder function on the dataframe, using a list of column names for categorical variables with a relatively small number of unique values.
  • grab_col_names: This line of code calls the grab_col_names function on the dataframe, returning lists of column names for categorical variables, numerical variables, and categorical variables that are also cardinal (have a large number of unique values).
df.shape
(19945, 23)

4. Customer Segmentation with K-Means

K-Means is a clustering algorithm used to divide a dataset into clusters. This algorithm creates clusters using the coordinates of the points in the dataset. Therefore, the K-Means algorithm can only be applied for numerical variables. The K-Means algorithm determines a center point to divide the dataset into its clusters. These center points are the average of the coordinates of the points in the dataset. The K-Means algorithm assigns each point to the nearest centre.

Figure. K-Means Clustering Algorithm
sc = StandardScaler()
X = sc.fit_transform(df[num_cols])
X = pd.DataFrame(X, columns=num_cols)
X.head()

For the K-Means algorithm to work, it is necessary to determine the optimum clusters. The Elbow method can be used for this. For the Elbow method, the WCSS values of the clusters must be calculated. WCSS values decrease with increasing number of clusters. Looking at the graph of these values, the decrease slows down after a point. After this point, increasing the number of clusters will not work. This point is called the Elbow point.

wcss = []  # We created a list to hold WCSS values.
for k in range(1, 15): # We looped the numbers from 1 to 15.
kmeans = KMeans(n_clusters=k).fit(X) # We ran the K-Means algorithm.
wcss.append(kmeans.inertia_) # We added the WCSS values to the wcss list.

plt.plot(range(1, 15), wcss, 'bx-') # We plotted the WCSS values.
plt.xlabel('k values')
plt.ylabel('WCSS')
plt.title('The Elbow Method')
plt.show()
kmeans = KMeans()
elbow = KElbowVisualizer(kmeans, k=(2, 20))
elbow.fit(X)
elbow.show(block=True)

elbow.elbow_value_
kmeans = KMeans(n_clusters=elbow.elbow_value_, init='k-means++').fit(X)
kmeans.cluster_centers_ # Indicates the centers of clusters.
kmeans.n_clusters  # Indicates the number of clusters.
8
kmeans.labels_
array([6, 4, 2, ..., 6, 7, 2], dtype=int32)
kmeans.inertia_  # Displays the WCSS value.
65766.97770803791
kmeans.get_params()  # With get_params() we can see the parameters of the kmeans model.
{'algorithm': 'auto',
'copy_x': True,
'init': 'k-means++',
'max_iter': 300,
'n_clusters': 8,
'n_init': 10,
'random_state': None,
'tol': 0.0001,
'verbose': 0}

Build your model and segment your customers

clusters_kmeans = kmeans.labels_    # Indicates which observation the clusters belong to.
X["cluster"] = clusters_kmeans # We added a variable named cluster_no to X.
X.head()
X.groupby('cluster').agg(['mean', 'median', 'count', 'std']).T
X['cluster'] = X['cluster'] + 1
X.head()
X["cluster"].value_counts()                 # It shows the number of observations belonging to each cluster.
X["cluster"].value_counts() / len(X) * 100
7    43.474555
3 27.480572
4 9.917272
8 8.578591
1 8.373026
5 2.070694
6 0.070193
2 0.035097
Name: cluster, dtype: float64
sns.countplot(x='cluster', data=X)
plt.show()

5. Customer Segmentation with Hierarchical Clustering

The Hierarchical Clustering (HC) method is a clustering method used to separate data points into groups with similar characteristics. This method is used to group data points about each other. Each of these groups is called a cluster.

The HC method is also called the link matrix method. This method is used to group data points according to their degree of connectivity to each other. This method creates a connectivity matrix to measure the degrees of relationship between data points. This matrix shows the degrees of connectivity between the data points and clusters the data points according to these degrees.

linkage_matrix = linkage(X, method='ward')

# Create the dendrogram
dend = dendrogram(linkage_matrix)

# Show the dendrogram
plt.show()

In this snippet, a link matrix is created that measures the degree of similarity of data points using the linkage() function of the scipy library. Then, using this link matrix, a dendogram is created with the HC method using the dendrogram() function of the scipy library. This dendogram is displayed using the show() function of the matplotlib library. In this snippet, the method parameter is specified as ‘ward’, so the hc_ward method is used as the HC method. This parameter value can take different values such as ‘single’, ‘complete’ or ‘average’ and according to these values, a dendogram can be obtained by applying the HC method with different methods.

hc_average = linkage(X, "average")  # We created a connection matrix with the average method.
hc_ward = linkage(X, "ward") # We created a connection matrix with the ward method.
hc_complete = linkage(X, "complete") # We created a connection matrix with the complete method.
hc_single = linkage(X, "single") # We created a connection matrix with the single method.
hc_centroid = linkage(X, "centroid") # We created a connection matrix with the centroid method.
plt.figure(figsize=(7, 5))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Observation Units")
plt.ylabel("Distances")
dendrogram(hc_average,
truncate_mode="lastp",
p=10,
show_contracted=True,
leaf_font_size=10)
plt.show()

We are using the AgglomerativeClustering class from scikit-learn to perform hierarchical clustering on a dataset X. We are setting the number of clusters to 3 and using the Euclidean distance and the Ward linkage method for clustering.

The fit_predict method fits the model to the data and returns the cluster labels for each sample in the dataset. The cluster labels are stored in the clusters_hc array.

After running this code, we can use the clusters_hc array to see the cluster labels for each sample in the dataset. For example, if the first sample has a cluster label of 0, it belongs to the first cluster. If the second sample has a cluster label of 1, it belongs to the second cluster, and so on.

hc = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
clusters_hc = hc.fit_predict(X)
clusters_hc
X.groupby('cluster_hc').agg(['mean', 'median', 'count', 'std']).T
X["cluster_hc"].value_counts()                 # It shows the number of observations belonging to each cluster.
X["cluster_hc"].value_counts() / len(X) * 100
2    52.053146
3 35.853597
1 12.093256
Name: cluster_hc, dtype: float64
sns.countplot(x='cluster_hc', data=X)
plt.show()

6. Customer Segmentation with DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that is based on the density of data points. It can be used for customer segmentation by finding groups of similar customers based on their characteristics or attributes.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.8, min_samples=15)
clusters = dbscan.fit_predict(X)
print(clusters)
df['dbscan_cluster'] = clusters
df.head()

Use the DBSCAN class from scikit-learn to fit the model to your data. You’ll need to specify two important parameters: eps and min_samples. eps is the maximum distance between two points in the same cluster, and min_samples is the minimum number of points required to form a cluster.

X["cluster_dbscan"].value_counts()
1    8662
2 5451
0 2215
5 1289
3 1170
4 1148
6 10
Name: cluster_dbscan, dtype: int64
sns.countplot(x='cluster_dbscan', data=X)
plt.show()

Conclusion

Segmenting customers into different groups using their characteristics and behaviors has always been an important topic. Customer segmentation can lead to better customer understanding and targeting, which in turn leads to more effective product tailoring and marketing strategies. Data mining methods are powerful techniques that can be used in customer segmentation to find customers with similar characteristics.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

Please feel free to contact me if you need any further information.

References

  1. https://miuul.com/makine-ogrenmesi
  2. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
  3. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

--

--