Maximizing Performance in Breast Cancer Prediction through Machine Learning, Hyperparameter Tuning, and PCA Analysis

15 min readDec 24, 2022

Breast cancer is the most common cancer among women and the second most common cancer overall. Early detection and accurate prediction of breast cancer can improve treatment outcomes and save lives. In recent years, machine learning has emerged as a promising approach for predicting breast cancer. However, achieving high performance in breast cancer prediction using machine learning can be challenging due to the complexity of the data and the need to carefully select and tune the appropriate algorithms and parameters. In this article, we will discuss three key techniques for maximizing performance in breast cancer prediction through machine learning: hyperparameter tuning, principal component analysis (PCA), and ensemble learning.

Figure 1. Spotlight on IARC research related to breast cancer

One of the key challenges in using machine learning for breast cancer prediction is selecting the right algorithm and setting its hyperparameters, which are the parameters that are not learned from the training data but rather set prior to training. Hyperparameter tuning involves finding the optimal values for these hyperparameters in order to maximize the performance of the machine learning model. There are several approaches to hyperparameter tuning, including manual tuning, grid search, and random search.

Manual tuning involves manually selecting and adjusting the hyperparameters based on expert knowledge and experience. This can be time-consuming and may not always lead to the optimal values.

Grid search involves specifying a grid of hyperparameter values and evaluating the model for each combination of values. This can be computationally expensive, as it requires training and evaluating the model multiple times.

Random search involves sampling random combinations of hyperparameter values and evaluating the model for each combination. This can be more efficient than grid search, as it does not require evaluating the model for every combination of values.

Another technique for maximizing performance in breast cancer prediction is using PCA to reduce the dimensionality of the data. Breast cancer data can often have a large number of features, which can make it challenging to train a machine learning model and may lead to overfitting. PCA is a statistical technique that can be used to reduce the dimensionality of the data by projecting the data onto a lower-dimensional space. This can help improve the performance of the machine learning model by reducing the complexity of the data and reducing the risk of overfitting.

Let’s implement our application in python now.

Introduction

Dataset Description The Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle, contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image.

Number of instances: 569

Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)

Attribute information

ID number
Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter² / area — 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” — 1)

The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
import warnings
warnings.filterwarnings('ignore')

Load the Data

We start by loading the breast cancer data into a Pandas dataframe and inspecting its columns and values. We can use the read_csv() function from Pandas to load the data into a dataframe, and the head() method to display the first few rows of the data:

# Load the data
data = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
data.head()

Summary Statistics

To get a summary of the numerical columns in the data, you can use the some function.

def check_df(dataframe, head=5):
    print("##################### Shape #####################")
    print(dataframe.shape)
    print("##################### Types #####################")
    print(dataframe.dtypes)
    print("##################### Head #####################")
    print(dataframe.head(head))
    print("##################### Tail #####################")
    print(dataframe.tail(head))
    print("##################### NA #####################")
    print(dataframe.isnull().sum())
    print("##################### Quantiles #####################")
    print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(data)

##################### Shape #####################
(569, 33)
##################### Types #####################
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Unnamed: 32                float64
dtype: object
##################### Head #####################
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  compactness_mean  concavity_mean  concave points_mean  symmetry_mean  fractal_dimension_mean  radius_se  texture_se  perimeter_se  area_se  smoothness_se  compactness_se  concavity_se  concave points_se  symmetry_se  fractal_dimension_se  radius_worst  texture_worst  perimeter_worst  area_worst  smoothness_worst  compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
0    842302         M        17.99         10.38          122.80     1001.0          0.11840           0.27760          0.3001              0.14710         0.2419                 0.07871     1.0950      0.9053         8.589   153.40       0.006399         0.04904       0.05373            0.01587      0.03003              0.006193         25.38          17.33           184.60      2019.0            0.1622             0.6656           0.7119                0.2654          0.4601   
1    842517         M        20.57         17.77          132.90     1326.0          0.08474           0.07864          0.0869              0.07017         0.1812                 0.05667     0.5435      0.7339         3.398    74.08       0.005225         0.01308       0.01860            0.01340      0.01389              0.003532         24.99          23.41           158.80      1956.0            0.1238             0.1866           0.2416                0.1860          0.2750   
2  84300903         M        19.69         21.25          130.00     1203.0          0.10960           0.15990          0.1974              0.12790         0.2069                 0.05999     0.7456      0.7869         4.585    94.03       0.006150         0.04006       0.03832            0.02058      0.02250              0.004571         23.57          25.53           152.50      1709.0            0.1444             0.4245           0.4504                0.2430          0.3613   
3  84348301         M        11.42         20.38           77.58      386.1          0.14250           0.28390          0.2414              0.10520         0.2597                 0.09744     0.4956      1.1560         3.445    27.23       0.009110         0.07458       0.05661            0.01867      0.05963              0.009208         14.91          26.50            98.87       567.7            0.2098             0.8663           0.6869                0.2575          0.6638   
4  84358402         M        20.29         14.34          135.10     1297.0          0.10030           0.13280          0.1980              0.10430         0.1809                 0.05883     0.7572      0.7813         5.438    94.44       0.011490         0.02461       0.05688            0.01885      0.01756              0.005115         22.54          16.67           152.20      1575.0            0.1374             0.2050           0.4000                0.1625          0.2364   

   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
3                  0.17300          NaN  
4                  0.07678          NaN  
##################### Tail #####################
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  smoothness_mean  compactness_mean  concavity_mean  concave points_mean  symmetry_mean  fractal_dimension_mean  radius_se  texture_se  perimeter_se  area_se  smoothness_se  compactness_se  concavity_se  concave points_se  symmetry_se  fractal_dimension_se  radius_worst  texture_worst  perimeter_worst  area_worst  smoothness_worst  compactness_worst  concavity_worst  concave points_worst  symmetry_worst  \
564  926424         M        21.56         22.39          142.00     1479.0          0.11100           0.11590         0.24390              0.13890         0.1726                 0.05623     1.1760       1.256         7.673   158.70       0.010300         0.02891       0.05198            0.02454      0.01114              0.004239        25.450          26.40           166.10      2027.0           0.14100            0.21130           0.4107                0.2216          0.2060   
565  926682         M        20.13         28.25          131.20     1261.0          0.09780           0.10340         0.14400              0.09791         0.1752                 0.05533     0.7655       2.463         5.203    99.04       0.005769         0.02423       0.03950            0.01678      0.01898              0.002498        23.690          38.25           155.00      1731.0           0.11660            0.19220           0.3215                0.1628          0.2572   
566  926954         M        16.60         28.08          108.30      858.1          0.08455           0.10230         0.09251              0.05302         0.1590                 0.05648     0.4564       1.075         3.425    48.55       0.005903         0.03731       0.04730            0.01557      0.01318              0.003892        18.980          34.12           126.70      1124.0           0.11390            0.30940           0.3403                0.1418          0.2218   
567  927241         M        20.60         29.33          140.10     1265.0          0.11780           0.27700         0.35140              0.15200         0.2397                 0.07016     0.7260       1.595         5.772    86.22       0.006522         0.06158       0.07117            0.01664      0.02324              0.006185        25.740          39.42           184.60      1821.0           0.16500            0.86810           0.9387                0.2650          0.4087   
568   92751         B         7.76         24.54           47.92      181.0          0.05263           0.04362         0.00000              0.00000         0.1587                 0.05884     0.3857       1.428         2.548    19.15       0.007189         0.00466       0.00000            0.00000      0.02676              0.002783         9.456          30.37            59.16       268.6           0.08996            0.06444           0.0000                0.0000          0.2871   

     fractal_dimension_worst  Unnamed: 32  
564                  0.07115          NaN  
565                  0.06637          NaN  
566                  0.07820          NaN  
567                  0.12400          NaN  
568                  0.07039          NaN  
##################### NA #####################
id                           0
diagnosis                    0
radius_mean                  0
texture_mean                 0
perimeter_mean               0
area_mean                    0
smoothness_mean              0
compactness_mean             0
concavity_mean               0
concave points_mean          0
symmetry_mean                0
fractal_dimension_mean       0
radius_se                    0
texture_se                   0
perimeter_se                 0
area_se                      0
smoothness_se                0
compactness_se               0
concavity_se                 0
concave points_se            0
symmetry_se                  0
fractal_dimension_se         0
radius_worst                 0
texture_worst                0
perimeter_worst              0
area_worst                   0
smoothness_worst             0
compactness_worst            0
concavity_worst              0
concave points_worst         0
symmetry_worst               0
fractal_dimension_worst      0
Unnamed: 32                569
dtype: int64
##################### Quantiles #####################
                                0.00          0.05           0.50          0.95          0.99          1.00
id                       8670.000000  90267.000000  906024.000000  9.042446e+07  9.010343e+08  9.113205e+08
radius_mean                 6.981000      9.529200      13.370000  2.057600e+01  2.437160e+01  2.811000e+01
texture_mean                9.710000     13.088000      18.840000  2.715000e+01  3.065200e+01  3.928000e+01
perimeter_mean             43.790000     60.496000      86.240000  1.358200e+02  1.657240e+02  1.885000e+02
area_mean                 143.500000    275.780000     551.100000  1.309800e+03  1.786600e+03  2.501000e+03
smoothness_mean             0.052630      0.075042       0.095870  1.187800e-01  1.328880e-01  1.634000e-01
compactness_mean            0.019380      0.040660       0.092630  2.087000e-01  2.771920e-01  3.454000e-01
concavity_mean              0.000000      0.004983       0.061540  2.430200e-01  3.516880e-01  4.268000e-01
concave points_mean         0.000000      0.005621       0.033500  1.257400e-01  1.642080e-01  2.012000e-01
symmetry_mean               0.106000      0.141500       0.179200  2.307200e-01  2.595640e-01  3.040000e-01
fractal_dimension_mean      0.049960      0.053926       0.061540  7.609000e-02  8.543760e-02  9.744000e-02
radius_se                   0.111500      0.160100       0.324200  9.595200e-01  1.291320e+00  2.873000e+00
texture_se                  0.360200      0.540140       1.108000  2.212000e+00  2.915440e+00  4.885000e+00
perimeter_se                0.757000      1.132800       2.287000  7.041600e+00  9.690040e+00  2.198000e+01
area_se                     6.802000     11.360000      24.530000  1.158000e+02  1.776840e+02  5.422000e+02
smoothness_se               0.001713      0.003690       0.006380  1.264400e-02  1.725800e-02  3.113000e-02
compactness_se              0.002252      0.007892       0.020450  6.057800e-02  8.987200e-02  1.354000e-01
concavity_se                0.000000      0.003253       0.025890  7.893600e-02  1.222920e-01  3.960000e-01
concave points_se           0.000000      0.003831       0.010930  2.288400e-02  3.119360e-02  5.279000e-02
symmetry_se                 0.007882      0.011758       0.018730  3.498800e-02  5.220800e-02  7.895000e-02
fractal_dimension_se        0.000895      0.001522       0.003187  7.959800e-03  1.264960e-02  2.984000e-02
radius_worst                7.930000     10.534000      14.970000  2.564000e+01  3.076280e+01  3.604000e+01
texture_worst              12.020000     16.574000      25.410000  3.630000e+01  4.180240e+01  4.954000e+01
perimeter_worst            50.410000     67.856000      97.660000  1.716400e+02  2.083040e+02  2.512000e+02
area_worst                185.200000    331.060000     686.500000  2.009600e+03  2.918160e+03  4.254000e+03
smoothness_worst            0.071170      0.095734       0.131300  1.718400e-01  1.889080e-01  2.226000e-01
compactness_worst           0.027290      0.071196       0.211900  5.641200e-01  7.786440e-01  1.058000e+00
concavity_worst             0.000000      0.018360       0.226700  6.823800e-01  9.023800e-01  1.252000e+00
concave points_worst        0.000000      0.024286       0.099930  2.369200e-01  2.692160e-01  2.910000e-01
symmetry_worst              0.156500      0.212700       0.282200  4.061600e-01  4.869080e-01  6.638000e-01
fractal_dimension_worst     0.055040      0.062558       0.080040  1.195200e-01  1.406280e-01  2.075000e-01
Unnamed: 32                      NaN           NaN            NaN           NaN           NaN           NaN

# The id and Unnamed: 32 columns should be removed, since they are unnecessary.
data.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)

# Find missing values
data.isnull().sum()

diagnosis                  0
radius_mean                0
texture_mean               0
perimeter_mean             0
area_mean                  0
smoothness_mean            0
compactness_mean           0
concavity_mean             0
concave points_mean        0
symmetry_mean              0
fractal_dimension_mean     0
radius_se                  0
texture_se                 0
perimeter_se               0
area_se                    0
smoothness_se              0
compactness_se             0
concavity_se               0
concave points_se          0
symmetry_se                0
fractal_dimension_se       0
radius_worst               0
texture_worst              0
perimeter_worst            0
area_worst                 0
smoothness_worst           0
compactness_worst          0
concavity_worst            0
concave points_worst       0
symmetry_worst             0
fractal_dimension_worst    0
dtype: int64

Visualizing the Data

Visualizations can also be useful for exploring the structure of the data. You can use histograms and box plots to get a sense of the distribution of the data.

To plot histograms of the numerical columns, you can use the hist() method of the dataframe. This will create a histogram for each numerical column, showing the distribution of the data.

# Plot histograms of the numerical columns
    
data.hist(bins=50, figsize=(20, 15))
plt.show()

# Plot box plots of the columns
    
data.plot(kind='box', subplots=True, layout=(6, 6), sharex=False, sharey=False, figsize=(15, 15))
plt.show()

plt.figure(figsize=(10,6),dpi=50)
sns.countplot(data['diagnosis'],palette='Set2')
plt.title('Diagnosis')
plt.show()

Correlations

Analyze the correlations between different features in the data and discuss how they might be related to breast cancer.

To compute the correlations, the code uses the corr() method of the dataframe, which computes the Pearson correlation coefficient between each pair of columns in the dataframe. The Pearson correlation coefficient is a measure of the linear relationship between two variables, and ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation).

# Find the correlation between the features
corr = data.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt='.2f')
plt.show()

# Plot pairwise relationships to check the correlations between the mean features.
sns.pairplot(data, hue='diagnosis', vars=['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean'])
plt.show()

data.corr()['diagnosis'].sort_values(ascending=False)
cancerous = data.corr()['diagnosis'].sort_values(ascending=False)
# Set the figure size and resolution
plt.figure(figsize=(10, 6), dpi=150)

# Create a bar plot of the correlations
sns.barplot(x=cancerous.index, y=cancerous.values)

# Rotate the x-tick labels
plt.xticks(rotation=90)

# Add labels to the bars
for container in ax.containers:
    ax.bar_label(container, rotation=90)

# Show the plot
plt.show()

Feature Selection

Feature selection is the process of selecting a subset of relevant features from a larger set of features for use in modeling. It is often used in machine learning to improve the accuracy and efficiency of a model by reducing the number of features that the model needs to consider.

# Unique values of "diagnosis" column
data['diagnosis'].unique()

# Encode “diagnosis” to numerical values
data['diagnosis'] = [1 if each == 'M' else 0 for each in data['diagnosis']]
data.head()

We will need to extract the features and labels from the dataframe. The features are the data that will be used to predict the label, which in this case is the diagnosis of breast cancer (i.e. malignant or benign).

It is often useful to combine multiple feature selection techniques to get the best results. You can use the FeatureUnion class from the sklearn.pipeline module to combine different feature selection methods in a pipeline.

# Combine PCA and Feature Selection with FeatureUnion
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# Create a SelectKBest object to select features with two best ANOVA F-Values
kbest = SelectKBest(chi2, k=2)
kbest

SelectKBest(k=2, score_func=<function chi2 at 0x7faa4344eef0>)

# Create a PCA object
pca = PCA(n_components=2)
# Create a FeatureUnion that combines PCA and kbest
combined_features = FeatureUnion([("pca", pca), ("univ_select", kbest)])

# Create a pipeline that combines the FeatureUnion with a classifier
pipeline = Pipeline([("features", combined_features), ("svc", SVC())])

# Create a grid search object
grid = GridSearchCV(pipeline, param_grid={'features__pca__n_components': [1, 2, 3],
                                            'features__univ_select__k': [1, 2]}, cv=5)

# Fit the grid search
grid.fit(X, y)

# Print the best parameters
print(grid.best_params_)

{'features__pca__n_components': 2, 'features__univ_select__k': 2}

Principal Component Analysis (PCA)

We can use the StandardScaler class from the sklearn.preprocessing module to standardize the features. This is an important step as it ensures that all features are on the same scale, which can improve the performance of many machine learning algorithms:

# Standardize the data
scaler = StandardScaler()
data = scaler.fit_transform(data)

# Create a PCA that will retain 99% of the variance
pca = PCA(n_components=0.99, whiten=True)

# Fit the PCA on the data
pca.fit(data)

# Print the number of components required to retain 99% of the variance
print(pca.n_components_)

We can use the explained_varianceratio attribute of the PCA object to see how much variance is explained by each principal component:

# Print the explained variance
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())

[0.44896035 0.18472104 0.09183385 0.06446333 0.05351866 0.03895187
 0.02208771 0.0156405  0.01344822 0.01131915 0.00983405 0.00938664
 0.00841969 0.0068476  0.00479278 0.00284395 0.00257613 0.00190437]
0.9915498949973757

# Transform the data onto the first two principal components
reduced_data = pca.transform(data)

# Print the shape of the reduced data
print(reduced_data.shape)

(569, 18)

We applies standard scaling to the input data X and then performs PCA to reduce the dimensions to 2. The resulting principal components are stored in a dataframe pca_df along with the target variable y.

Standard scaling is a preprocessing step that scales the features of the data such that they have zero mean and unit variance. This is useful for many machine learning algorithms, as it can help to remove the scale of the features and make the model more robust to variations in the data.

The PCA class in scikit-learn performs PCA by first standardizing the data and then computing the singular value decomposition (SVD) of the data matrix. The resulting principal components are the eigenvectors of the covariance matrix of the data, and are ranked in order of decreasing eigenvalue. By setting the n_components parameter to 2, you are selecting the first two principal components.

The resulting dataframe pca_df contains the first two principal components as columns ‘PC1’ and ‘PC2’, and the target variable y as the final column.

# Extract the features and labels
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X.describe()

def create_pca_df(X, y):
    X = StandardScaler().fit_transform(X)
    pca = PCA(n_components=2)
    pca_fit = pca.fit_transform(X)
    pca_df = pd.DataFrame(data=pca_fit, columns=['PC1', 'PC2'])
    final_df = pd.concat([pca_df, pd.DataFrame(y)], axis=1)
    return final_df


pca_df = create_pca_df(X, y) # Create a dataframe with the first two principal components   
pca_df.head()

import random

def plot_pca(dataframe, target):
    fig = plt.figure(figsize=(7, 5))
    ax = fig.add_subplot(1, 1, 1)
    ax.set_xlabel('PC1', fontsize=15)
    ax.set_ylabel('PC2', fontsize=15)
    ax.set_title(f'{target.capitalize()} ', fontsize=20)

    targets = list(dataframe[target].unique())
    colors = random.sample(['r', 'b', "g", "y"], len(targets))

    for t, color in zip(targets, colors):
        indices = dataframe[target] == t
        ax.scatter(dataframe.loc[indices, 'PC1'], dataframe.loc[indices, 'PC2'], c=color, s=50)
    ax.legend(targets)
    ax.grid()
    plt.show()


plot_pca(pca_df, "diagnosis")

This is a function that visualizes the results of principal component analysis (PCA) on a given dataframe. The function takes in two arguments:

dataframe: A pandas dataframe that contains the PCA-transformed data. The dataframe should have two columns named "PC1" and "PC2" that contain the values of the first and second principal components, respectively.
target: A string that represents the target variable in the dataframe. The function will use this column to color the data points according to the unique values of the target variable.

The function begins by creating a new figure and setting the x-axis and y-axis labels and the title. It then creates a list of the unique values of the target variable and generates a list of random colors for each value.

Next, the function iterates through the unique values of the target variable and the corresponding colors. For each value, it filters the dataframe to select only the rows where the value of the target variable is equal to the current value. It then plots these rows as a scatterplot, using the “PC1” and “PC2” columns as the x- and y-coordinates, respectively, and the current color as the color of the points.

Finally, the function adds a legend to the plot using the list of unique values of the target variable and shows the plot.

Machine Learning

It is important to evaluate the performance of a machine learning model on a separate test dataset, as this can give you a better estimate of how the model will perform on unseen data. Using the training data to evaluate the model can lead to overfitting, where the model performs well on the training data but poorly on new data.

# Splitting the data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Accuracy

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9473684210526315

from sklearn.metrics import classification_report

# Generate a classification report
print(classification_report(y_test,y_pred))

# Random Forest

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9473684210526315

# Support Vector Machine

from sklearn.svm import SVC

model = SVC()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9035087719298246

# K Nearest Neighbours

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9385964912280702

# Naive Bayes

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9473684210526315

# Decision Tree

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9385964912280702

# XGBoost

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.956140350877193

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
                'max_features': max_features,
                'max_depth': max_depth,
                'min_samples_split': min_samples_split,
                'min_samples_leaf': min_samples_leaf,
                'bootstrap': bootstrap}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
# Fit the random search model
rf_random.fit(X_train, y_train)

rf_random.best_params_

# Accuracy

y_pred = rf_random.predict(X_test)
accuracy_score(y_test, y_pred)

# Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

array([[72,  0],
       [ 5, 37]])

Conclusion

In conclusion, using machine learning, hyperparameter tuning, and principal component analysis (PCA) can help to maximize the performance of breast cancer prediction models. Machine learning algorithms can learn complex relationships between the features and the target variable, and can be trained to make accurate predictions on new data. Hyperparameter tuning can help to optimize the performance of the machine learning algorithms by finding the best combination of hyperparameters. PCA can reduce the dimensionality of the data and help to identify patterns and relationships that might not be apparent in the original dataset, which can improve the accuracy of the predictions. By combining these techniques, it is possible to build highly effective breast cancer prediction models that can accurately classify tumors as benign or malignant.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

Please feel free to contact me if you need any further information.