Maximizing Performance in Breast Cancer Prediction through Machine Learning, Hyperparameter Tuning, and PCA Analysis

Emine Bozkus
15 min readDec 24, 2022

--

Breast cancer is the most common cancer among women and the second most common cancer overall. Early detection and accurate prediction of breast cancer can improve treatment outcomes and save lives. In recent years, machine learning has emerged as a promising approach for predicting breast cancer. However, achieving high performance in breast cancer prediction using machine learning can be challenging due to the complexity of the data and the need to carefully select and tune the appropriate algorithms and parameters. In this article, we will discuss three key techniques for maximizing performance in breast cancer prediction through machine learning: hyperparameter tuning, principal component analysis (PCA), and ensemble learning.

Figure 1. Spotlight on IARC research related to breast cancer

One of the key challenges in using machine learning for breast cancer prediction is selecting the right algorithm and setting its hyperparameters, which are the parameters that are not learned from the training data but rather set prior to training. Hyperparameter tuning involves finding the optimal values for these hyperparameters in order to maximize the performance of the machine learning model. There are several approaches to hyperparameter tuning, including manual tuning, grid search, and random search.

Manual tuning involves manually selecting and adjusting the hyperparameters based on expert knowledge and experience. This can be time-consuming and may not always lead to the optimal values.

Grid search involves specifying a grid of hyperparameter values and evaluating the model for each combination of values. This can be computationally expensive, as it requires training and evaluating the model multiple times.

Random search involves sampling random combinations of hyperparameter values and evaluating the model for each combination. This can be more efficient than grid search, as it does not require evaluating the model for every combination of values.

Another technique for maximizing performance in breast cancer prediction is using PCA to reduce the dimensionality of the data. Breast cancer data can often have a large number of features, which can make it challenging to train a machine learning model and may lead to overfitting. PCA is a statistical technique that can be used to reduce the dimensionality of the data by projecting the data onto a lower-dimensional space. This can help improve the performance of the machine learning model by reducing the complexity of the data and reducing the risk of overfitting.

Let’s implement our application in python now.

Introduction

Dataset Description The Breast Cancer Wisconsin (Diagnostic) DataSet, obtained from Kaggle, contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image.

Number of instances: 569

Number of attributes: 32 (ID, diagnosis, 30 real-valued input features)

Attribute information

  1. ID number
  2. Diagnosis (M = malignant, B = benign)
  3. Ten real-valued features are computed for each cell nucleus:
  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter² / area — 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” — 1)

The mean, standard error, and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

pd.set_option('display.max_columns', None)
pd.set_option('display.width', 500)
import warnings
warnings.filterwarnings('ignore')

Load the Data

We start by loading the breast cancer data into a Pandas dataframe and inspecting its columns and values. We can use the read_csv() function from Pandas to load the data into a dataframe, and the head() method to display the first few rows of the data:

# Load the data
data = pd.read_csv('/kaggle/input/breast-cancer-wisconsin-data/data.csv')
data.head()

Summary Statistics

To get a summary of the numerical columns in the data, you can use the some function.

def check_df(dataframe, head=5):
print("##################### Shape #####################")
print(dataframe.shape)
print("##################### Types #####################")
print(dataframe.dtypes)
print("##################### Head #####################")
print(dataframe.head(head))
print("##################### Tail #####################")
print(dataframe.tail(head))
print("##################### NA #####################")
print(dataframe.isnull().sum())
print("##################### Quantiles #####################")
print(dataframe.quantile([0, 0.05, 0.50, 0.95, 0.99, 1]).T)

check_df(data)
##################### Shape #####################
(569, 33)
##################### Types #####################
id int64
diagnosis object
radius_mean float64
texture_mean float64
perimeter_mean float64
area_mean float64
smoothness_mean float64
compactness_mean float64
concavity_mean float64
concave points_mean float64
symmetry_mean float64
fractal_dimension_mean float64
radius_se float64
texture_se float64
perimeter_se float64
area_se float64
smoothness_se float64
compactness_se float64
concavity_se float64
concave points_se float64
symmetry_se float64
fractal_dimension_se float64
radius_worst float64
texture_worst float64
perimeter_worst float64
area_worst float64
smoothness_worst float64
compactness_worst float64
concavity_worst float64
concave points_worst float64
symmetry_worst float64
fractal_dimension_worst float64
Unnamed: 32 float64
dtype: object
##################### Head #####################
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst \
0 842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601
1 842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750
2 84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613
3 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638
4 84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364

fractal_dimension_worst Unnamed: 32
0 0.11890 NaN
1 0.08902 NaN
2 0.08758 NaN
3 0.17300 NaN
4 0.07678 NaN
##################### Tail #####################
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave points_worst symmetry_worst \
564 926424 M 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 0.05623 1.1760 1.256 7.673 158.70 0.010300 0.02891 0.05198 0.02454 0.01114 0.004239 25.450 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060
565 926682 M 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 0.05533 0.7655 2.463 5.203 99.04 0.005769 0.02423 0.03950 0.01678 0.01898 0.002498 23.690 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572
566 926954 M 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 0.05648 0.4564 1.075 3.425 48.55 0.005903 0.03731 0.04730 0.01557 0.01318 0.003892 18.980 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218
567 927241 M 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 0.07016 0.7260 1.595 5.772 86.22 0.006522 0.06158 0.07117 0.01664 0.02324 0.006185 25.740 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087
568 92751 B 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 0.05884 0.3857 1.428 2.548 19.15 0.007189 0.00466 0.00000 0.00000 0.02676 0.002783 9.456 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871

fractal_dimension_worst Unnamed: 32
564 0.07115 NaN
565 0.06637 NaN
566 0.07820 NaN
567 0.12400 NaN
568 0.07039 NaN
##################### NA #####################
id 0
diagnosis 0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
Unnamed: 32 569
dtype: int64
##################### Quantiles #####################
0.00 0.05 0.50 0.95 0.99 1.00
id 8670.000000 90267.000000 906024.000000 9.042446e+07 9.010343e+08 9.113205e+08
radius_mean 6.981000 9.529200 13.370000 2.057600e+01 2.437160e+01 2.811000e+01
texture_mean 9.710000 13.088000 18.840000 2.715000e+01 3.065200e+01 3.928000e+01
perimeter_mean 43.790000 60.496000 86.240000 1.358200e+02 1.657240e+02 1.885000e+02
area_mean 143.500000 275.780000 551.100000 1.309800e+03 1.786600e+03 2.501000e+03
smoothness_mean 0.052630 0.075042 0.095870 1.187800e-01 1.328880e-01 1.634000e-01
compactness_mean 0.019380 0.040660 0.092630 2.087000e-01 2.771920e-01 3.454000e-01
concavity_mean 0.000000 0.004983 0.061540 2.430200e-01 3.516880e-01 4.268000e-01
concave points_mean 0.000000 0.005621 0.033500 1.257400e-01 1.642080e-01 2.012000e-01
symmetry_mean 0.106000 0.141500 0.179200 2.307200e-01 2.595640e-01 3.040000e-01
fractal_dimension_mean 0.049960 0.053926 0.061540 7.609000e-02 8.543760e-02 9.744000e-02
radius_se 0.111500 0.160100 0.324200 9.595200e-01 1.291320e+00 2.873000e+00
texture_se 0.360200 0.540140 1.108000 2.212000e+00 2.915440e+00 4.885000e+00
perimeter_se 0.757000 1.132800 2.287000 7.041600e+00 9.690040e+00 2.198000e+01
area_se 6.802000 11.360000 24.530000 1.158000e+02 1.776840e+02 5.422000e+02
smoothness_se 0.001713 0.003690 0.006380 1.264400e-02 1.725800e-02 3.113000e-02
compactness_se 0.002252 0.007892 0.020450 6.057800e-02 8.987200e-02 1.354000e-01
concavity_se 0.000000 0.003253 0.025890 7.893600e-02 1.222920e-01 3.960000e-01
concave points_se 0.000000 0.003831 0.010930 2.288400e-02 3.119360e-02 5.279000e-02
symmetry_se 0.007882 0.011758 0.018730 3.498800e-02 5.220800e-02 7.895000e-02
fractal_dimension_se 0.000895 0.001522 0.003187 7.959800e-03 1.264960e-02 2.984000e-02
radius_worst 7.930000 10.534000 14.970000 2.564000e+01 3.076280e+01 3.604000e+01
texture_worst 12.020000 16.574000 25.410000 3.630000e+01 4.180240e+01 4.954000e+01
perimeter_worst 50.410000 67.856000 97.660000 1.716400e+02 2.083040e+02 2.512000e+02
area_worst 185.200000 331.060000 686.500000 2.009600e+03 2.918160e+03 4.254000e+03
smoothness_worst 0.071170 0.095734 0.131300 1.718400e-01 1.889080e-01 2.226000e-01
compactness_worst 0.027290 0.071196 0.211900 5.641200e-01 7.786440e-01 1.058000e+00
concavity_worst 0.000000 0.018360 0.226700 6.823800e-01 9.023800e-01 1.252000e+00
concave points_worst 0.000000 0.024286 0.099930 2.369200e-01 2.692160e-01 2.910000e-01
symmetry_worst 0.156500 0.212700 0.282200 4.061600e-01 4.869080e-01 6.638000e-01
fractal_dimension_worst 0.055040 0.062558 0.080040 1.195200e-01 1.406280e-01 2.075000e-01
Unnamed: 32 NaN NaN NaN NaN NaN NaN
# The id and Unnamed: 32 columns should be removed, since they are unnecessary.
data.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)
# Find missing values
data.isnull().sum()
diagnosis                  0
radius_mean 0
texture_mean 0
perimeter_mean 0
area_mean 0
smoothness_mean 0
compactness_mean 0
concavity_mean 0
concave points_mean 0
symmetry_mean 0
fractal_dimension_mean 0
radius_se 0
texture_se 0
perimeter_se 0
area_se 0
smoothness_se 0
compactness_se 0
concavity_se 0
concave points_se 0
symmetry_se 0
fractal_dimension_se 0
radius_worst 0
texture_worst 0
perimeter_worst 0
area_worst 0
smoothness_worst 0
compactness_worst 0
concavity_worst 0
concave points_worst 0
symmetry_worst 0
fractal_dimension_worst 0
dtype: int64

Visualizing the Data

Visualizations can also be useful for exploring the structure of the data. You can use histograms and box plots to get a sense of the distribution of the data.

To plot histograms of the numerical columns, you can use the hist() method of the dataframe. This will create a histogram for each numerical column, showing the distribution of the data.

# Plot histograms of the numerical columns

data.hist(bins=50, figsize=(20, 15))
plt.show()
# Plot box plots of the columns

data.plot(kind='box', subplots=True, layout=(6, 6), sharex=False, sharey=False, figsize=(15, 15))
plt.show()
plt.figure(figsize=(10,6),dpi=50)
sns.countplot(data['diagnosis'],palette='Set2')
plt.title('Diagnosis')
plt.show()

Correlations

Analyze the correlations between different features in the data and discuss how they might be related to breast cancer.

To compute the correlations, the code uses the corr() method of the dataframe, which computes the Pearson correlation coefficient between each pair of columns in the dataframe. The Pearson correlation coefficient is a measure of the linear relationship between two variables, and ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation).

# Find the correlation between the features
corr = data.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(corr, annot=True, fmt='.2f')
plt.show()
# Plot pairwise relationships to check the correlations between the mean features.
sns.pairplot(data, hue='diagnosis', vars=['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean'])
plt.show()
data.corr()['diagnosis'].sort_values(ascending=False)
cancerous = data.corr()['diagnosis'].sort_values(ascending=False)
# Set the figure size and resolution
plt.figure(figsize=(10, 6), dpi=150)

# Create a bar plot of the correlations
sns.barplot(x=cancerous.index, y=cancerous.values)

# Rotate the x-tick labels
plt.xticks(rotation=90)

# Add labels to the bars
for container in ax.containers:
ax.bar_label(container, rotation=90)

# Show the plot
plt.show()

Feature Selection

Feature selection is the process of selecting a subset of relevant features from a larger set of features for use in modeling. It is often used in machine learning to improve the accuracy and efficiency of a model by reducing the number of features that the model needs to consider.

# Unique values of "diagnosis" column
data['diagnosis'].unique()
# Encode “diagnosis” to numerical values
data['diagnosis'] = [1 if each == 'M' else 0 for each in data['diagnosis']]
data.head()

We will need to extract the features and labels from the dataframe. The features are the data that will be used to predict the label, which in this case is the diagnosis of breast cancer (i.e. malignant or benign).

It is often useful to combine multiple feature selection techniques to get the best results. You can use the FeatureUnion class from the sklearn.pipeline module to combine different feature selection methods in a pipeline.

# Combine PCA and Feature Selection with FeatureUnion
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Create a SelectKBest object to select features with two best ANOVA F-Values
kbest = SelectKBest(chi2, k=2)
kbest
SelectKBest(k=2, score_func=<function chi2 at 0x7faa4344eef0>)
# Create a PCA object
pca = PCA(n_components=2)
# Create a FeatureUnion that combines PCA and kbest
combined_features = FeatureUnion([("pca", pca), ("univ_select", kbest)])
# Create a pipeline that combines the FeatureUnion with a classifier
pipeline = Pipeline([("features", combined_features), ("svc", SVC())])

# Create a grid search object
grid = GridSearchCV(pipeline, param_grid={'features__pca__n_components': [1, 2, 3],
'features__univ_select__k': [1, 2]}, cv=5)

# Fit the grid search
grid.fit(X, y)

# Print the best parameters
print(grid.best_params_)
{'features__pca__n_components': 2, 'features__univ_select__k': 2}

Principal Component Analysis (PCA)

We can use the StandardScaler class from the sklearn.preprocessing module to standardize the features. This is an important step as it ensures that all features are on the same scale, which can improve the performance of many machine learning algorithms:

# Standardize the data
scaler = StandardScaler()
data = scaler.fit_transform(data)
# Create a PCA that will retain 99% of the variance
pca = PCA(n_components=0.99, whiten=True)

# Fit the PCA on the data
pca.fit(data)
# Print the number of components required to retain 99% of the variance
print(pca.n_components_)
18

We can use the explained_varianceratio attribute of the PCA object to see how much variance is explained by each principal component:

# Print the explained variance
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.sum())
[0.44896035 0.18472104 0.09183385 0.06446333 0.05351866 0.03895187
0.02208771 0.0156405 0.01344822 0.01131915 0.00983405 0.00938664
0.00841969 0.0068476 0.00479278 0.00284395 0.00257613 0.00190437]
0.9915498949973757
# Transform the data onto the first two principal components
reduced_data = pca.transform(data)

# Print the shape of the reduced data
print(reduced_data.shape)
(569, 18)

We applies standard scaling to the input data X and then performs PCA to reduce the dimensions to 2. The resulting principal components are stored in a dataframe pca_df along with the target variable y.

Standard scaling is a preprocessing step that scales the features of the data such that they have zero mean and unit variance. This is useful for many machine learning algorithms, as it can help to remove the scale of the features and make the model more robust to variations in the data.

The PCA class in scikit-learn performs PCA by first standardizing the data and then computing the singular value decomposition (SVD) of the data matrix. The resulting principal components are the eigenvectors of the covariance matrix of the data, and are ranked in order of decreasing eigenvalue. By setting the n_components parameter to 2, you are selecting the first two principal components.

The resulting dataframe pca_df contains the first two principal components as columns ‘PC1’ and ‘PC2’, and the target variable y as the final column.

# Extract the features and labels
X = data.drop('diagnosis', axis=1)
y = data['diagnosis']
X.describe()
def create_pca_df(X, y):
X = StandardScaler().fit_transform(X)
pca = PCA(n_components=2)
pca_fit = pca.fit_transform(X)
pca_df = pd.DataFrame(data=pca_fit, columns=['PC1', 'PC2'])
final_df = pd.concat([pca_df, pd.DataFrame(y)], axis=1)
return final_df


pca_df = create_pca_df(X, y) # Create a dataframe with the first two principal components
pca_df.head()
import random
def plot_pca(dataframe, target):
fig = plt.figure(figsize=(7, 5))
ax = fig.add_subplot(1, 1, 1)
ax.set_xlabel('PC1', fontsize=15)
ax.set_ylabel('PC2', fontsize=15)
ax.set_title(f'{target.capitalize()} ', fontsize=20)

targets = list(dataframe[target].unique())
colors = random.sample(['r', 'b', "g", "y"], len(targets))

for t, color in zip(targets, colors):
indices = dataframe[target] == t
ax.scatter(dataframe.loc[indices, 'PC1'], dataframe.loc[indices, 'PC2'], c=color, s=50)
ax.legend(targets)
ax.grid()
plt.show()


plot_pca(pca_df, "diagnosis")

This is a function that visualizes the results of principal component analysis (PCA) on a given dataframe. The function takes in two arguments:

  • dataframe: A pandas dataframe that contains the PCA-transformed data. The dataframe should have two columns named "PC1" and "PC2" that contain the values of the first and second principal components, respectively.
  • target: A string that represents the target variable in the dataframe. The function will use this column to color the data points according to the unique values of the target variable.

The function begins by creating a new figure and setting the x-axis and y-axis labels and the title. It then creates a list of the unique values of the target variable and generates a list of random colors for each value.

Next, the function iterates through the unique values of the target variable and the corresponding colors. For each value, it filters the dataframe to select only the rows where the value of the target variable is equal to the current value. It then plots these rows as a scatterplot, using the “PC1” and “PC2” columns as the x- and y-coordinates, respectively, and the current color as the color of the points.

Finally, the function adds a legend to the plot using the list of unique values of the target variable and shows the plot.

Machine Learning

It is important to evaluate the performance of a machine learning model on a separate test dataset, as this can give you a better estimate of how the model will perform on unseen data. Using the training data to evaluate the model can lead to overfitting, where the model performs well on the training data but poorly on new data.

# Splitting the data

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Logistic Regression

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)

# Accuracy

from sklearn.metrics import accuracy_score

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9473684210526315
from sklearn.metrics import classification_report

# Generate a classification report
print(classification_report(y_test,y_pred))
# Random Forest

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200)
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9473684210526315
# Support Vector Machine

from sklearn.svm import SVC

model = SVC()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9035087719298246
# K Nearest Neighbours

from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9385964912280702

# Naive Bayes

from sklearn.naive_bayes import GaussianNB

model = GaussianNB()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.9473684210526315
# Decision Tree

from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)

0.9385964912280702

# XGBoost

from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(X_train, y_train)

# Accuracy

y_pred = model.predict(X_test)
accuracy_score(y_test, y_pred)
0.956140350877193

Hyperparameter Tuning

from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
# Fit the random search model
rf_random.fit(X_train, y_train)

rf_random.best_params_

# Accuracy

y_pred = rf_random.predict(X_test)
accuracy_score(y_test, y_pred)

# Confusion Matrix

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm
array([[72,  0],
[ 5, 37]])

Conclusion

In conclusion, using machine learning, hyperparameter tuning, and principal component analysis (PCA) can help to maximize the performance of breast cancer prediction models. Machine learning algorithms can learn complex relationships between the features and the target variable, and can be trained to make accurate predictions on new data. Hyperparameter tuning can help to optimize the performance of the machine learning algorithms by finding the best combination of hyperparameters. PCA can reduce the dimensionality of the data and help to identify patterns and relationships that might not be apparent in the original dataset, which can improve the accuracy of the predictions. By combining these techniques, it is possible to build highly effective breast cancer prediction models that can accurately classify tumors as benign or malignant.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

Please feel free to contact me if you need any further information.

References

  1. https://www.iarc.who.int/featured-news/world-cancer-day-2021/
  2. https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data

--

--

Emine Bozkus
Emine Bozkus

Written by Emine Bozkus

👩‍💻Data Scientist | 🤖 Researcher

No responses yet