Using K-Means, Hierarchical Clustering, and DBSCAN to Group Customers

16 min readDec 23, 2022

Customer segmentation is a powerful tool for businesses looking to optimize their marketing and sales efforts. By dividing customers into smaller groups based on shared characteristics or behaviors, businesses can tailor their campaigns to specific segments, potentially increasing their effectiveness. One common approach to customer segmentation is the use of clustering algorithms. Clustering algorithms group data points into clusters based on similarities between the points. There are several popular clustering algorithms, including K-Means, hierarchical clustering, and DBSCAN.

K-Means is an iterative algorithm that divides a dataset into a specified number of clusters based on distance from the centroid of each cluster. To use K-Means for customer segmentation, businesses can input customer data into the algorithm and specify the number of clusters they want to create. The algorithm will then group the customers into clusters based on shared characteristics.

Hierarchical clustering is a method of grouping data into a tree-like structure based on similarities between points. To use hierarchical clustering for customer segmentation, businesses can input customer data into the algorithm and specify the level of granularity they want for the clusters. The algorithm will then group the customers into clusters based on shared characteristics.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an algorithm that groups together points that are closely packed together and marks points that are isolated as noise. To use DBSCAN for customer segmentation, businesses can input customer data into the algorithm and specify the minimum number of points required to form a cluster and the maximum distance between points in a cluster. The algorithm will then group the customers into clusters based on shared characteristics.

Using K-Means, hierarchical clustering, and DBSCAN for customer segmentation can help businesses understand their customer base and tailor their marketing and sales efforts accordingly. By grouping customers into meaningful segments, businesses can increase the effectiveness of their campaigns and ultimately improve their bottom line.

Let’s practice in python now.

Business Problem: Segmentation of a Customer Portfolio

FLO wants to segment its customers and determine marketing strategies according to these segments. To this end, the behaviors of the customers will be defined and groups will be formed according to the clusters in these behaviors.

Dataset Story: Purchasing Behavior of FLO Customers

The dataset includes Flo’s last purchases from OmniChannel (both online and offline shoppers) in 2020–2021. 12 Variables 19,945 Observations 2.7MB

master_id: Unique client number
order_channel: Which channel of the shopping platform is used (Android, ios, Desktop, Mobile)
last_order_channel: The channel where the last purchase was made
first_order_date: The date of the first purchase made by the customer
last_order_date: The date of the customer’s last purchase
last_order_date_online: The date of the last purchase made by the customer on the online platform
last_order_date_offline: The date of the last purchase made by the customer on the offline platform
order_num_total_ever_online: The total number of purchases made by the customer on the online platform
order_num_total_ever_offline: Total number of purchases made by the customer offline
customer_value_total_ever_offline: The total price paid by the customer for offline purchases
customer_value_total_ever_online: The total price paid by the customer for their online shopping
interested_in_categories_12: List of categories the customer has purchased from in the last 12 months

1. Loading required libraries and data

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans      # K-Means
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer     # Elbow method 
from scipy.cluster.hierarchy import linkage, dendrogram  # hierarchical clustering
from sklearn.preprocessing import StandardScaler    
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import AgglomerativeClustering     # hierarchical clustering

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
import warnings 
warnings.simplefilter(action='ignore', category=Warning)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

df = pd.read_csv('/kaggle/input/flodata/flo_data_20k.csv')

df['tenure'] = (pd.to_datetime('today') - pd.to_datetime(df['first_order_date'])).dt.days
df['recency'] = (pd.to_datetime('today') - pd.to_datetime(df['last_order_date'])).dt.days
df['frequency'] = df['order_num_total_ever_online'] + df['order_num_total_ever_offline']
df['monetary'] = df['customer_value_total_ever_online'] + df['customer_value_total_ever_offline']

df["order_num_total"] = df["order_num_total_ever_online"] + df["order_num_total_ever_offline"]
df["customer_value_total"] = df["customer_value_total_ever_offline"] + df["customer_value_total_ever_online"]

2. Exploratory Data Analysis

def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(df)

##################### Shape #####################
(19945, 18)
##################### Types #####################
master_id                             object
order_channel                         object
last_order_channel                    object
first_order_date                      object
last_order_date                       object
last_order_date_online                object
last_order_date_offline               object
order_num_total_ever_online          float64
order_num_total_ever_offline         float64
customer_value_total_ever_offline    float64
customer_value_total_ever_online     float64
interested_in_categories_12           object
tenure                                 int64
recency                                int64
frequency                            float64
monetary                             float64
order_num_total                      float64
customer_value_total                 float64
dtype: object
##################### Head #####################
                              master_id order_channel last_order_channel first_order_date last_order_date last_order_date_online last_order_date_offline  order_num_total_ever_online  order_num_total_ever_offline  customer_value_total_ever_offline  customer_value_total_ever_online       interested_in_categories_12  tenure  recency  frequency  monetary  order_num_total  customer_value_total
0  cc294636-19f0-11eb-8d74-000d3a38a36f   Android App            Offline       2020-10-30      2021-02-26             2021-02-21              2021-02-26                          4.0                           1.0                             139.99                            799.38                           [KADIN]     784      665        5.0    939.37              5.0                939.37
1  f431bd5a-ab7b-11e9-a2fc-000d3a38a36f   Android App             Mobile       2017-02-08      2021-02-16             2021-02-16              2020-01-10                         19.0                           2.0                             159.97                           1853.58  [ERKEK, COCUK, KADIN, AKTIFSPOR]    2144      675       21.0   2013.55             21.0               2013.55
2  69b69676-1a40-11ea-941b-000d3a38a36f   Android App        Android App       2019-11-27      2020-11-27             2020-11-27              2019-12-01                          3.0                           2.0                             189.97                            395.35                    [ERKEK, KADIN]    1122      756        5.0    585.32              5.0                585.32
3  1854e56c-491f-11eb-806e-000d3a38a36f   Android App        Android App       2021-01-06      2021-01-17             2021-01-17              2021-01-06                          1.0                           1.0                              39.99                             81.98               [AKTIFCOCUK, COCUK]     716      705        2.0    121.97              2.0                121.97
4  d6ea1074-f1f5-11e9-9346-000d3a38a36f       Desktop            Desktop       2019-08-03      2021-03-07             2021-03-07              2019-08-03                          1.0                           1.0                              49.99                            159.99                       [AKTIFSPOR]    1238      656        2.0    209.98              2.0                209.98
##################### Tail #####################
                                  master_id order_channel last_order_channel first_order_date last_order_date last_order_date_online last_order_date_offline  order_num_total_ever_online  order_num_total_ever_offline  customer_value_total_ever_offline  customer_value_total_ever_online interested_in_categories_12  tenure  recency  frequency  monetary  order_num_total  customer_value_total
19940  727e2b6e-ddd4-11e9-a848-000d3a38a36f   Android App            Offline       2019-09-21      2020-07-05             2020-06-05              2020-07-05                          1.0                           2.0                             289.98                            111.98          [ERKEK, AKTIFSPOR]    1189      901        3.0    401.96              3.0                401.96
19941  25cd53d4-61bf-11ea-8dd8-000d3a38a36f       Desktop            Desktop       2020-03-01      2020-12-22             2020-12-22              2020-03-01                          1.0                           1.0                             150.48                            239.99                 [AKTIFSPOR]    1027      731        2.0    390.47              2.0                390.47
19942  8aea4c2a-d6fc-11e9-93bc-000d3a38a36f       Ios App            Ios App       2019-09-11      2021-05-24             2021-05-24              2019-09-11                          2.0                           1.0                             139.98                            492.96                 [AKTIFSPOR]    1199      578        3.0    632.94              3.0                632.94
19943  e50bb46c-ff30-11e9-a5e8-000d3a38a36f   Android App        Android App       2019-03-27      2021-02-13             2021-02-13              2021-01-08                          1.0                           5.0                             711.79                            297.98          [ERKEK, AKTIFSPOR]    1367      678        6.0   1009.77              6.0               1009.77
19944  740998d2-b1f7-11e9-89fa-000d3a38a36f   Android App        Android App       2019-09-03      2020-06-06             2020-06-06              2019-09-03                          1.0                           1.0                              39.99                            221.98          [KADIN, AKTIFSPOR]    1207      930        2.0    261.97              2.0                261.97
##################### NA #####################
master_id                            0
order_channel                        0
last_order_channel                   0
first_order_date                     0
last_order_date                      0
last_order_date_online               0
last_order_date_offline              0
order_num_total_ever_online          0
order_num_total_ever_offline         0
customer_value_total_ever_offline    0
customer_value_total_ever_online     0
interested_in_categories_12          0
tenure                               0
recency                              0
frequency                            0
monetary                             0
order_num_total                      0
customer_value_total                 0
dtype: int64
##################### Quantiles #####################
                                     0.00    0.05     0.50      0.95       0.99      1.00
order_num_total_ever_online          1.00    1.00     2.00    10.000    20.0000    200.00
order_num_total_ever_offline         1.00    1.00     1.00     4.000     7.0000    109.00
customer_value_total_ever_offline   10.00   39.99   179.98   694.222  1219.9468  18119.14
customer_value_total_ever_online    12.99   63.99   286.46  1556.726  3143.8104  45220.13
tenure                             575.00  779.20  1221.00  2644.000  3175.0000   3630.00
recency                            572.00  579.00   681.00   905.000   930.0000    937.00
frequency                            2.00    2.00     4.00    12.000    22.0000    202.00
monetary                            44.98  175.48   545.27  1921.924  3606.3556  45905.10
order_num_total                      2.00    2.00     4.00    12.000    22.0000    202.00
customer_value_total                44.98  175.48   545.27  1921.924  3606.3556  45905.10

def cat_summary(dataframe, col_name, plot=False):
    print(pd.DataFrame({col_name: dataframe[col_name].value_counts(),
                        "Ratio": 100 * dataframe[col_name].value_counts() / len(dataframe)}))
    print("##########################################")
    if plot:
        sns.countplot(x=dataframe[col_name], data=dataframe)
        plt.show(block=True)
def num_summary(dataframe, numerical_col, plot=False):
    quantiles = [0.05, 0.10, 0.20, 0.30, 0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 0.95, 0.99]
    print(dataframe[numerical_col].describe(quantiles).T)

    if plot:
        dataframe[numerical_col].hist(bins=20)
        plt.xlabel(numerical_col)
        plt.title(numerical_col)
        plt.show(block=True)
def target_summary_with_num(dataframe, target, numerical_col):
    print(dataframe.groupby(target).agg({numerical_col: "mean"}), end="\n\n\n")

def target_summary_with_cat(dataframe, target, categorical_col):
    print(pd.DataFrame({"TARGET_MEAN": dataframe.groupby(categorical_col)[target].mean()}), end="\n\n\n")

def correlation_matrix(df, cols):
    fig = plt.gcf()
    fig.set_size_inches(10, 8)
    plt.xticks(fontsize=10)
    plt.yticks(fontsize=10)
    fig = sns.heatmap(df[cols].corr(), annot=True, linewidths=0.5, annot_kws={'size': 12}, linecolor='w', cmap='RdBu')
    plt.show(block=True)

df: This is the dataframe that contains the columns for which you want to calculate the correlations.
cols: This is a list of the column names that you want to include in the correlation matrix.
fig: This variable stores a reference to the current figure object. The figure object represents the overall window or page that the plot will be drawn on.
fig.set_size_inches: This method sets the size of the figure object in inches.
plt.xticks and plt.yticks: These functions control the appearance of the x-axis and y-axis tick labels.
sns.heatmap: This function from the seaborn library creates a heatmap visualization of the correlations between the columns in the dataframe. The annot parameter specifies whether to annotate the cells with the correlations, and the annot_kws parameter controls the appearance of the annotations. The linewidths parameter controls the width of the lines between cells, and the linecolor parameter sets the color of the lines. The cmap parameter specifies the color map to use for the heatmap.
plt.show: This function displays the plot. The block parameter specifies whether to block the script until the plot window is closed.

def grab_col_names(dataframe, cat_th=10, car_th=20):
    """
It gives the names of categorical, numerical and categorical but cardinal variables in the data set.
Note: Categorical variables with numerical appearance are also included in categorical variables.

     parameters
     ------
         dataframe: dataframe
                 The dataframe from which variable names are to be retrieved
         cat_th: int, optional
                 class threshold for numeric but categorical variables
         car_th: int, optinal
                 class threshold for categorical but cardinal variables

     Returns
     ------
         cat_cols: list
                 Categorical variable list
         num_cols: list
                 Numeric variable list
         cat_but_car: list
                 Categorical view cardinal variable list

     Examples
     ------
         import seaborn as sns
         df = sns.load_dataset("iris")
         print(grab_col_names(df))


     notes
     ------
         cat_cols + num_cols + cat_but_car = total number of variables
         num_but_cat is inside cat_cols.
         The sum of the 3 returned lists equals the total number of variables: cat_cols + num_cols + cat_but_car = number of variables

    """

    # cat_cols, cat_but_car
    cat_cols = [col for col in dataframe.columns if dataframe[col].dtypes == "O"]
    num_but_cat = [col for col in dataframe.columns if dataframe[col].nunique() < cat_th and
                   dataframe[col].dtypes != "O"]
    cat_but_car = [col for col in dataframe.columns if dataframe[col].nunique() > car_th and
                   dataframe[col].dtypes == "O"]
    cat_cols = cat_cols + num_but_cat
    cat_cols = [col for col in cat_cols if col not in cat_but_car]

    # num_cols
    num_cols = [col for col in dataframe.columns if dataframe[col].dtypes != "O"]
    num_cols = [col for col in num_cols if col not in num_but_cat]

    # print(f"Observations: {dataframe.shape[0]}")
    # print(f"Variables: {dataframe.shape[1]}")
    # print(f'cat_cols: {len(cat_cols)}')
    # print(f'num_cols: {len(num_cols)}')
    # print(f'cat_but_car: {len(cat_but_car)}')
    # print(f'num_but_cat: {len(num_but_cat)}')
    return cat_cols, num_cols, cat_but_car
     
    cat_cols, num_cols, cat_but_car = grab_col_names(df, cat_th=5, car_th=20)
    print(cat_cols)
    print(num_cols)
    print(cat_but_car)

The function first creates a list cat_cols of columns that are categorical variables (i.e., their data type is "O" for "object").
It then creates a list num_but_cat of columns that are numerical but categorical (i.e., they have fewer than cat_th unique values).
It then creates a list cat_but_car of columns that are categorical but cardinal (i.e., they have more than car_th unique values).
It adds the num_but_cat list to the cat_cols list and removes any column names that are also in the cat_but_car list.
It creates a list num_cols of numerical variables that are not in the num_but_cat list.

The function also has optional parameters cat_th and car_th that control the thresholds for determining which columns are considered numerical but categorical or categorical but cardinal.

cat_cols

['order_channel', 'last_order_channel']

num_cols

['order_num_total_ever_online',
 'order_num_total_ever_offline',
 'customer_value_total_ever_offline',
 'customer_value_total_ever_online',
 'tenure',
 'recency',
 'frequency',
 'monetary',
 'order_num_total',
 'customer_value_total']

# Correlation of numerical variables with each other
correlation_matrix(df, num_cols)

3. Data Preprocessing & Feature Engineering

def outlier_thresholds(dataframe, col_name, q1=0.25, q3=0.75):
    quartile1 = dataframe[col_name].quantile(q1)
    quartile3 = dataframe[col_name].quantile(q3)
    interquantile_range = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 - 1.5 * interquantile_range
    return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_thresholds(dataframe, variable)
    dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit
    dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

def check_outlier(dataframe, col_name, q1=0.25, q3=0.75):
    low_limit, up_limit = outlier_thresholds(dataframe, col_name, q1, q3)
    if dataframe[(dataframe[col_name] > up_limit) | (dataframe[col_name] < low_limit)].any(axis=None):
        return True
    else:
        return False

def one_hot_encoder(dataframe, categorical_cols, drop_first=False):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

def label_encoder(dataframe, binary_col):
    labelencoder = LabelEncoder()
    dataframe[binary_col] = labelencoder.fit_transform(dataframe[binary_col])
    return dataframe

binary_cols = [col for col in df.columns if df[col].dtype not in [int, float]
               and df[col].nunique() == 2]
for col in binary_cols:
    df = label_encoder(df, col)

def one_hot_encoder(dataframe, categorical_cols, drop_first=True):
    dataframe = pd.get_dummies(dataframe, columns=categorical_cols, drop_first=drop_first)
    return dataframe

ohe_cols = [col for col in df.columns if
            25 >= df[col].nunique() > 2]

df = one_hot_encoder(df, ohe_cols)
cat_cols, num_cols, cat_but_car = grab_col_names(df)

outlier_thresholds: This function takes in a dataframe, a column name, and optional parameters q1 and q3 (default values are 0.25 and 0.75, respectively). It calculates the interquartile range (IQR) of the column and returns the upper and lower limits for detecting outliers based on the IQR. An outlier is defined as a value that is more than 1.5 times the IQR above the upper quartile or below the lower quartile.
replace_with_thresholds: This function takes in a dataframe and a column name. It calls the outlier_thresholds function to get the upper and lower limits for detecting outliers in the column. It then replaces any values in the column that are above the upper limit or below the lower limit with the respective limit.
check_outlier: This function takes in a dataframe, a column name, and optional parameters q1 and q3 (default values are 0.25 and 0.75, respectively). It calls the outlier_thresholds function to get the upper and lower limits for detecting outliers in the column. It then checks if there are any values in the column that are above the upper limit or below the lower limit, and returns True if any are found, or False if none are found.
one_hot_encoder: This function takes in a dataframe, a list of column names, and an optional parameter drop_first (default value is True). It converts the categorical variables in the specified columns into dummy variables using the pd.get_dummies function from the pandas library. The drop_first parameter specifies whether to drop the first dummy variable in each column to avoid the dummy variable trap.
label_encoder: This function takes in a dataframe and a column name. It converts the values in the specified column into numerical values using the LabelEncoder class from the sklearn library.
binary_cols: This line of code creates a list of column names for binary variables (i.e., variables with only two unique values) in the dataframe.
label_encoder: This line of code calls the label_encoder function on each column in the binary_cols list, converting the values in these columns into numerical values.
one_hot_encoder: This line of code calls the one_hot_encoder function on the dataframe, using a list of column names for categorical variables with a relatively small number of unique values.
grab_col_names: This line of code calls the grab_col_names function on the dataframe, returning lists of column names for categorical variables, numerical variables, and categorical variables that are also cardinal (have a large number of unique values).

df.shape

(19945, 23)

4. Customer Segmentation with K-Means

K-Means is a clustering algorithm used to divide a dataset into clusters. This algorithm creates clusters using the coordinates of the points in the dataset. Therefore, the K-Means algorithm can only be applied for numerical variables. The K-Means algorithm determines a center point to divide the dataset into its clusters. These center points are the average of the coordinates of the points in the dataset. The K-Means algorithm assigns each point to the nearest centre.

sc = StandardScaler()
X = sc.fit_transform(df[num_cols])
X = pd.DataFrame(X, columns=num_cols)
X.head()

For the K-Means algorithm to work, it is necessary to determine the optimum clusters. The Elbow method can be used for this. For the Elbow method, the WCSS values of the clusters must be calculated. WCSS values decrease with increasing number of clusters. Looking at the graph of these values, the decrease slows down after a point. After this point, increasing the number of clusters will not work. This point is called the Elbow point.

wcss = []  # We created a list to hold WCSS values.
for k in range(1, 15):  # We looped the numbers from 1 to 15.
    kmeans = KMeans(n_clusters=k).fit(X)  # We ran the K-Means algorithm.
    wcss.append(kmeans.inertia_)  # We added the WCSS values to the wcss list.

plt.plot(range(1, 15), wcss, 'bx-')  # We plotted the WCSS values.
plt.xlabel('k values')
plt.ylabel('WCSS')  
plt.title('The Elbow Method')  
plt.show()

kmeans = KMeans()
elbow = KElbowVisualizer(kmeans, k=(2, 20))
elbow.fit(X)
elbow.show(block=True)

elbow.elbow_value_

kmeans = KMeans(n_clusters=elbow.elbow_value_, init='k-means++').fit(X)
kmeans.cluster_centers_  # Indicates the centers of clusters.

kmeans.n_clusters  # Indicates the number of clusters.

kmeans.labels_

array([6, 4, 2, ..., 6, 7, 2], dtype=int32)

kmeans.inertia_  # Displays the WCSS value.

65766.97770803791

kmeans.get_params()  # With get_params() we can see the parameters of the kmeans model.

{'algorithm': 'auto',
 'copy_x': True,
 'init': 'k-means++',
 'max_iter': 300,
 'n_clusters': 8,
 'n_init': 10,
 'random_state': None,
 'tol': 0.0001,
 'verbose': 0}

Build your model and segment your customers

clusters_kmeans = kmeans.labels_    # Indicates which observation the clusters belong to.
X["cluster"] = clusters_kmeans   # We added a variable named cluster_no to X.
X.head()

X.groupby('cluster').agg(['mean', 'median', 'count', 'std']).T

X['cluster'] = X['cluster'] + 1
X.head()

X["cluster"].value_counts()                 # It shows the number of observations belonging to each cluster.
X["cluster"].value_counts() / len(X) * 100

7    43.474555
3    27.480572
4     9.917272
8     8.578591
1     8.373026
5     2.070694
6     0.070193
2     0.035097
Name: cluster, dtype: float64

sns.countplot(x='cluster', data=X)
plt.show()

5. Customer Segmentation with Hierarchical Clustering

The Hierarchical Clustering (HC) method is a clustering method used to separate data points into groups with similar characteristics. This method is used to group data points about each other. Each of these groups is called a cluster.

The HC method is also called the link matrix method. This method is used to group data points according to their degree of connectivity to each other. This method creates a connectivity matrix to measure the degrees of relationship between data points. This matrix shows the degrees of connectivity between the data points and clusters the data points according to these degrees.

linkage_matrix = linkage(X, method='ward')

# Create the dendrogram
dend = dendrogram(linkage_matrix)

# Show the dendrogram
plt.show()

In this snippet, a link matrix is created that measures the degree of similarity of data points using the linkage() function of the scipy library. Then, using this link matrix, a dendogram is created with the HC method using the dendrogram() function of the scipy library. This dendogram is displayed using the show() function of the matplotlib library. In this snippet, the method parameter is specified as ‘ward’, so the hc_ward method is used as the HC method. This parameter value can take different values such as ‘single’, ‘complete’ or ‘average’ and according to these values, a dendogram can be obtained by applying the HC method with different methods.

hc_average = linkage(X, "average")  # We created a connection matrix with the average method.
hc_ward = linkage(X, "ward")  # We created a connection matrix with the ward method.
hc_complete = linkage(X, "complete")  # We created a connection matrix with the complete method.
hc_single = linkage(X, "single")  # We created a connection matrix with the single method.
hc_centroid = linkage(X, "centroid")  # We created a connection matrix with the centroid method.

plt.figure(figsize=(7, 5))
plt.title("Hierarchical Clustering Dendrogram")
plt.xlabel("Observation Units")
plt.ylabel("Distances")
dendrogram(hc_average,
           truncate_mode="lastp",
           p=10,
           show_contracted=True,
           leaf_font_size=10)
plt.show()

We are using the AgglomerativeClustering class from scikit-learn to perform hierarchical clustering on a dataset X. We are setting the number of clusters to 3 and using the Euclidean distance and the Ward linkage method for clustering.

The fit_predict method fits the model to the data and returns the cluster labels for each sample in the dataset. The cluster labels are stored in the clusters_hc array.

After running this code, we can use the clusters_hc array to see the cluster labels for each sample in the dataset. For example, if the first sample has a cluster label of 0, it belongs to the first cluster. If the second sample has a cluster label of 1, it belongs to the second cluster, and so on.

hc = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
clusters_hc = hc.fit_predict(X)
clusters_hc

X.groupby('cluster_hc').agg(['mean', 'median', 'count', 'std']).T

X["cluster_hc"].value_counts()                 # It shows the number of observations belonging to each cluster.
X["cluster_hc"].value_counts() / len(X) * 100

2    52.053146
3    35.853597
1    12.093256
Name: cluster_hc, dtype: float64

sns.countplot(x='cluster_hc', data=X)
plt.show()

6. Customer Segmentation with DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that is based on the density of data points. It can be used for customer segmentation by finding groups of similar customers based on their characteristics or attributes.

from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.8, min_samples=15)
clusters = dbscan.fit_predict(X)
print(clusters)
df['dbscan_cluster'] = clusters
df.head()

Use the DBSCAN class from scikit-learn to fit the model to your data. You’ll need to specify two important parameters: eps and min_samples. eps is the maximum distance between two points in the same cluster, and min_samples is the minimum number of points required to form a cluster.

X["cluster_dbscan"].value_counts()

1    8662
2    5451
0    2215
5    1289
3    1170
4    1148
6      10
Name: cluster_dbscan, dtype: int64

sns.countplot(x='cluster_dbscan', data=X)
plt.show()

Conclusion

Segmenting customers into different groups using their characteristics and behaviors has always been an important topic. Customer segmentation can lead to better customer understanding and targeting, which in turn leads to more effective product tailoring and marketing strategies. Data mining methods are powerful techniques that can be used in customer segmentation to find customers with similar characteristics.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

Please feel free to contact me if you need any further information.