Exploratory Data Analysis (EDA) Using Pandas Profiling

Explore pandas profiling and see the benefit of this magical single line of code.

4 min readDec 5, 2022

Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights.

pandas-profiling generates profile reports from a pandas DataFrame. The pandas df.describe() function is handy yet a little basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with df.profile_report(), which automatically generates a standardized univariate and multivariate report for data understanding.

For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:

Type inference: detect the types of columns in a DataFrame
Essentials: type, unique values, indication of missing values
Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
Most frequent and extreme values
Histograms: categorical and numerical
Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
Missing values: through counts, matrix and heatmap
Duplicate rows: list of the most common duplicated rows
Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata.

Let’s get into practice.

1. Installation 🛠️

1.1. Using pip

You can install using the pip package manager by running:

pip install -U pandas-profiling

1.2. Using conda

You can install using the conda package manager by running:

conda install -c conda-forge pandas-profiling

2. Import Libraries 📚

import numpy as np
import pandas as pd

Let’s import the pandas profiling library:

from pandas_profiling import ProfileReport

data = pd.read_csv("datasets/Telco-Customer-Churn.csv")

We uploaded our data.

3. Generate report

To generate the standard profiling report, merely run:

profile = ProfileReport(data, title="Pandas Profiling Report")

data.profile_report()

4. Exporting the report to a file

To generate a HTML report file, save the ProfileReport to an object and use the to_file() function:

profile.to_file("your_report.html")

Alternatively, the report’s data can be obtained as a JSON file:

# As a JSON string
json_data = profile.to_json()

# As a file
profile.to_file("your_report.json")

The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data, the time to generate the report also increases a lot.

One way to solve this problem is to generate the report from only a part of all the data we have.

profile = ProfileReport(df.sample(n=100))
profile.to_file(output_file='output.html')

The following sections are included in the report in an interactive way.

Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)

This section provides detail analysis on Variables/Columns/Features of the Dataset that totally depend on type of Variables/Columns/Features like Numeric, String , Boolean etc.

Interaction sections gives more details with bivariate analysis/multivariate analysis.

It’s a common tool for describing simple relationships without making a statement about cause and effect. In the pandas profiling report have 5 types of correlation coefficients: Pearson’s r, Spearman’s ρ, Kendall’s τ, Phik (φk), and Cramer’s V (φc).

This report also gives detail analysis of Missing values in the four types of graph Count, matrix, Heatmap and dendrogram.

This section displays the first and last 10 rows of the dataset.

Conclusion

Pandas Profiling is an awesome python package for exploratory analysis (EDA). It extends pandas for statistical analysis summaries including correlations, missing values, distribution (quantile), and descriptive statistics. You can view the EDA I created with the diabetes dataset on Kaggle here. You can also check titanic dataset report here.

I definitely recommend you to try it!

Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!

If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.

References