Exploring the Iris flower dataset

Emine Bozkus
10 min readDec 19, 2022

--

“Have you ever heard of the Iris flower dataset? It is one of the most well-known datasets in the world of machine learning and data science, and for good reason. It consists of 150 records of Iris flowers, including information about their sepal and petal length and width, as well as the type of Iris flower. In this blog post, we will be exploring the Iris dataset and learning about the different techniques and methods we can use to analyze and understand it. Whether you are a beginner or an experienced data scientist, this post will provide valuable insights and tips for working with this classic dataset.”

Figure 1. Iris dataset species

The Iris flower dataset is a classic dataset in the field of machine learning and statistical analysis. It consists of 150 observations of iris flowers, including the sepal and petal length and width for each flower, as well as the species of the flower. The dataset was introduced by British statistician and biologist Ronald Fisher in his 1936 paper, “The use of multiple measurements in taxonomic problems.”

In this notebook, we will explore the Iris dataset and use various statistical and machine learning techniques to better understand the relationships between the different features and the species of the flowers. We will also use the dataset to build and evaluate a classifier that can predict the species of an iris flower based on its measurements.

The variables are:

  • sepal_length: Sepal length, in centimeters, used as input.
  • sepal_width: Sepal width, in centimeters, used as input.
  • petal_length: Petal length, in centimeters, used as input.
  • petal_width: Petal width, in centimeters, used as input.
  • class: Iris Setosa, Versicolor, or Virginica, used as the target.

Let’s start by importing the necessary libraries and loading the dataset.

1. Import the necessary libraries

# Import necessary libraries
import numpy as np
import pandas as pd
import pandas_profiling
import tensorflow as tf
import sklearn.metrics as metrics

# for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
%matplotlib inline

# Import Warnings
import warnings
warnings.simplefilter(action="ignore")

# Setting Configurations:
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# for data splitting, transforming and model training
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

2. Load the data

Load the Iris dataset into a Pandas DataFrame.

# Load the Iris dataset into a Pandas DataFrame
data = pd.read_csv('/kaggle/input/iris/Iris.csv')
data.head()
data.drop('Id',axis=1,inplace=True) #dropping the Id column as it is unecessary
data.info()
from pandas.api.types import is_numeric_dtype

for col in data.columns:
if is_numeric_dtype(data[col]):
print('%s:' % (col))
print('\t Mean = %.2f' % data[col].mean())
print('\t Standard deviation = %.2f' % data[col].std())
print('\t Minimum = %.2f' % data[col].min())
print('\t Maximum = %.2f' % data[col].max())

3. Preprocess the data

In the preprocessing step, you will typically perform a series of operations on your raw data to get it ready for further analysis or modeling. The specific steps you take will depend on your goals and the characteristics of your data, but some common tasks include:

  1. Cleaning the data: This involves fixing errors or missing values, and removing duplicates or irrelevant information.
  2. Normalizing or scaling the data: You may want to scale the numeric data so that it is on the same scale, or to handle outliers.
  3. Encoding categorical data: If you have categorical data (data that can be divided into a fixed number of categories), you may need to encode it as numerical data so that it can be used in a machine learning model.
  4. Splitting the data: You may want to split your data into a training set, a validation set, and a test set in order to evaluate the performance of your model.
  5. Dimensionality reduction: If you have a large number of features, you may want to reduce the number of dimensions by selecting a subset of the most important features, or by using techniques such as principal component analysis (PCA).

Overall, the goal of the preprocessing step is to get your data into a form that is suitable for further analysis or modeling.

# Checking for missing values
data.isnull().sum()
duplicates = data[data.duplicated()]
print("Number of duplicates:", len(duplicates))
# Drop duplicates
data = data.drop_duplicates()
# Checking for outliers
data.describe()
# Checking for the data types
data.dtypes
# Checking for the shape of the dataset
data.shape
(147, 5)
# Checking for the correlation
data.corr()

Skewness is a measure of the symmetry of the distribution of values in a column. A skewness value of 0 means that the distribution is symmetrical, while a positive value means that there is more weight on the left side of the distribution and a negative value means that there is more weight on the right side.

# Calculate the skewness for all columns
data.skew()
# Checking for the unique values
data.nunique()
SepalLengthCm    35
SepalWidthCm 23
PetalLengthCm 43
PetalWidthCm 22
Species 3
dtype: int64
data.columns
Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm', 'Species'], dtype='object')
# Checking for the value counts
data["Species"].value_counts()
Iris-versicolor    50
Iris-virginica 49
Iris-setosa 48
Name: Species, dtype: int64
# Checking for the value counts
data['Species'].value_counts().plot(kind='bar')
plt.show()

4. Data visualization

Data visualization is the process of creating visual representations of data in order to gain insights and better understand the data. This can be done using various tools and techniques, such as graphs, charts, and maps.

There are many reasons to use data visualization, including:

  1. Identifying patterns and trends: Visualizing data can help you spot trends and patterns that may not be apparent in the raw data.
  2. Communicating data: Data visualization can make it easier to present data to others, as it allows you to convey complex information in a more easily digestible format.
  3. Comparing data: Visualizations can help you compare different data sets or variables and see how they relate to each other.
  4. Identifying outliers: Visualizations can help you identify unusual or unexpected data points, which can be useful for identifying errors or anomalies in the data.
# Visualize the whole dataset

sns.pairplot(data,hue="Species")
plt.show()

There is a high correlation between the petal length and width columns in the Iris dataset. The Setosa species has both low petal length and width, while the Versicolor species has both average petal length and width. The Virginica species, on the other hand, has both high petal length and width. In terms of sepal dimensions, the Setosa species has high sepal width and low sepal length, the Versicolor species has average values for both sepal dimensions, and the Virginica species has small sepal width but large sepal length.

plt.figure(figsize=(7,5))
# Plotting the heatmap
sns.heatmap(data.corr(), annot=True)
plt.show()

Sepal Length and Sepal Width features are slightly correlated with each other.

# Plotting the boxplot
sns.boxplot(x='Species', y='SepalLengthCm', data=data)
plt.show()

sns.boxplot(x='Species', y='SepalWidthCm', data=data)
plt.show()

sns.boxplot(x='Species', y='PetalLengthCm', data=data)
plt.show()

sns.boxplot(x='Species', y='PetalWidthCm', data=data)
plt.show()

The Setosa species has smaller and less distributed features compared to the other two species. The Versicolor species is distributed in an average manner and has average-sized features. The Virginica species, on the other hand, is highly distributed with a large number of values and features. The mean and median values of various features (such as sepal length and width, and petal length and width) are clearly shown by each plot for each species. This suggests that the distribution of these features varies significantly between the three species.

# Plotting the violinplot
sns.violinplot(x='Species', y='SepalLengthCm', data=data)
plt.show()

sns.violinplot(x='Species', y='SepalWidthCm', data=data)
plt.show()

sns.violinplot(x='Species', y='PetalLengthCm', data=data)
plt.show()

sns.violinplot(x='Species', y='PetalWidthCm', data=data)
plt.show()
# Plotting the swarmplot
sns.swarmplot(x='Species', y='SepalLengthCm', data=data)
plt.show()
sns.swarmplot(x='Species', y='SepalWidthCm', data=data)
plt.show()
sns.swarmplot(x='Species', y='PetalLengthCm', data=data)
plt.show()
sns.swarmplot(x='Species', y='PetalWidthCm', data=data)
plt.show()

A distplot is a visualization that shows the distribution of a single numerical variable. It combines a histogram with a density plot, which is a continuous line that represents the probability density function of the data.

A distplot can be used to understand the shape and spread of the distribution of a variable. It can help you answer questions such as:

  • What is the central tendency of the data (mean, median, mode)?
  • What is the spread of the data (range, interquartile range, standard deviation)?
  • What is the shape of the distribution (symmetrical, skewed)?
# Plotting the distplot
sns.distplot(data['SepalLengthCm'])
plt.show()

sns.distplot(data['SepalWidthCm'])
plt.show()

sns.distplot(data['PetalLengthCm'])
plt.show()

sns.distplot(data['PetalWidthCm'])
plt.show()

A jointplot is a type of visualization that displays the joint distribution of two variables. It is a combination of a scatter plot and one or more histograms, and can be used to visualize the relationship between the two variables and the distribution of each variable separately.

The scatter plot portion of a jointplot shows the relationship between the two variables, while the histograms show the distribution of each variable individually. This allows you to see both the relationship between the variables and the distribution of each variable at the same time.

You can use a jointplot to answer questions such as:

  • Is there a relationship between the two variables?
  • What is the strength and direction of the relationship?
  • Are there any unusual or unexpected points in the data?
# Plotting the jointplot
sns.jointplot(x='SepalLengthCm', y='SepalWidthCm', data=data)
plt.show()

sns.jointplot(x='SepalLengthCm', y='PetalLengthCm', data=data)
plt.show()

sns.jointplot(x='SepalLengthCm', y='PetalWidthCm', data=data)
plt.show()

sns.jointplot(x='SepalWidthCm', y='PetalLengthCm', data=data)
plt.show()
# Plotting the stripplot
# Create a figure with 4 subplots
fig, ax = plt.subplots(2, 2)

# Plot the first stripplot in the top left subplot
sns.stripplot(x='Species', y='SepalLengthCm', data=data, ax=ax[0, 0])

# Plot the second stripplot in the top right subplot
sns.stripplot(x='Species', y='SepalWidthCm', data=data, ax=ax[0, 1])

# Plot the third stripplot in the bottom left subplot
sns.stripplot(x='Species', y='PetalLengthCm', data=data, ax=ax[1, 0])

# Plot the fourth stripplot in the bottom right subplot
sns.stripplot(x='Species', y='PetalWidthCm', data=data, ax=ax[1, 1])

plt.show()

5. Split the data into training and test sets

It’s important to choose an appropriate split ratio for your data. A common practice is to use a 80/20 or 70/30 split, with the larger portion going into the training set. However, the best split ratio will depend on the size and characteristics of your dataset.

train, test = train_test_split(data, test_size = 0.3) # the attribute test_size=0.3 splits the data into 70% and 30% ratio. train=70% and test=30%
print(train.shape)
print(test.shape)
X_train= train[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']]   # taking the training data features
y_train=train.Species # output of our training data
X_test= test[['SepalLengthCm','SepalWidthCm','PetalLengthCm','PetalWidthCm']] # taking test data features
y_test=test.Species #output value of test data
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

6. Training the model

Training a model refers to the process of fitting a model to data, so that the model can learn to make predictions based on the data. In machine learning, this is typically done by optimizing the model’s parameters to minimize the error between the model’s predictions and the true values.

Logistic Regression

from sklearn.linear_model import LogisticRegression# Train a logistic regression model on the training data
model = LogisticRegression()
model.fit(X_train, y_train)
prediction=model.predict(X_test)
print('The accuracy of the Logistic Regression is',metrics.accuracy_score(prediction,y_test))
The accuracy of the Logistic Regression is 0.9555555555555556
# Evaluate the performance of the model on the test data
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred, average='micro')
print("F1 score:", f1)
F1 score: 0.9555555555555556

KNN (K-Nearest Neighbors)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix

classifier = KNeighborsClassifier(n_neighbors=2)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

# Summary of the predictions made by the classifier
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
# Accuracy score
from sklearn.metrics import accuracy_score
print('accuracy is',accuracy_score(y_pred,y_test))

Support Vector Machine (SVM)

from sklearn import svm
model = svm.SVC() #select the algorithm
model.fit(X_train, y_train) # train the algorithm with the training data and the training output
prediction=model.predict(X_test) #pass the testing data to the trained algorithm
#check the accuracy of the algorithm.
print('The accuracy of the SVM is:',metrics.accuracy_score(prediction,y_test))
The accuracy of the SVM is: 0.9777777777777777
# A detailed classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, prediction))

You can reach the other methods and the detailed description of the project on the notebook I prepared on Kaggle.

Conclusion

In conclusion, the Iris flower dataset is a valuable resource for understanding and practicing machine learning and data analysis techniques. Through our exploration of the dataset, we were able to uncover insights about the relationship between the different variables and the distribution of each variable. We also built a machine learning model to predict the species of iris based on the sepal and petal measurements, and were able to evaluate the performance of the model on a test set. Overall, the Iris flower dataset provides a great opportunity to learn about and practice data analysis and machine learning.

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account. Happy coding!

Please feel free to contact me if you need any further information.

--

--

Emine Bozkus
Emine Bozkus

Written by Emine Bozkus

👩‍💻Data Scientist | 🤖 Researcher

No responses yet