Exploratory Data Analysis (EDA) Using Pandas Profiling
Explore pandas profiling and see the benefit of this magical single line of code.
Pandas profiling is a Python library that performs an automated Exploratory Data Analysis. It automatically generates a dataset profile report that gives valuable insights.
pandas-profiling
generates profile reports from a pandas DataFrame
. The pandas df.describe()
function is handy yet a little basic for exploratory data analysis. pandas-profiling
extends pandas DataFrame
with df.profile_report()
, which automatically generates a standardized univariate and multivariate report for data understanding.
For each column, the following information (whenever relevant for the column type) is presented in an interactive HTML report:
- Type inference: detect the types of columns in a DataFrame
- Essentials: type, unique values, indication of missing values
- Quantile statistics: minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics: mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent and extreme values
- Histograms: categorical and numerical
- Correlations: high correlation warnings, based on different correlation metrics (Spearman, Pearson, Kendall, Cramér’s V, Phik)
- Missing values: through counts, matrix and heatmap
- Duplicate rows: list of the most common duplicated rows
- Text analysis: most common categories (uppercase, lowercase, separator), scripts (Latin, Cyrillic) and blocks (ASCII, Cyrilic)
- File and Image analysis: file sizes, creation dates, dimensions, indication of truncated images and existence of EXIF metadata.
Let’s get into practice.
1. Installation 🛠️
1.1. Using pip
You can install using the pip
package manager by running:
pip install -U pandas-profiling
1.2. Using conda
You can install using the conda
package manager by running:
conda install -c conda-forge pandas-profiling
2. Import Libraries 📚
import numpy as np
import pandas as pd
Let’s import the pandas profiling library:
from pandas_profiling import ProfileReport
data = pd.read_csv("datasets/Telco-Customer-Churn.csv")
We uploaded our data.
3. Generate report
To generate the standard profiling report, merely run:
profile = ProfileReport(data, title="Pandas Profiling Report")
data.profile_report()
4. Exporting the report to a file
To generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, the report’s data can be obtained as a JSON file:
# As a JSON string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data, the time to generate the report also increases a lot.
One way to solve this problem is to generate the report from only a part of all the data we have.
profile = ProfileReport(df.sample(n=100))
profile.to_file(output_file='output.html')
The following sections are included in the report in an interactive way.
Overview: mostly global details about the dataset (number of records, number of variables, overall missigness and duplicates, memory footprint)
This section provides detail analysis on Variables/Columns/Features of the Dataset that totally depend on type of Variables/Columns/Features like Numeric, String , Boolean etc.
Interaction sections gives more details with bivariate analysis/multivariate analysis.
It’s a common tool for describing simple relationships without making a statement about cause and effect. In the pandas profiling report have 5 types of correlation coefficients: Pearson’s r, Spearman’s ρ, Kendall’s τ, Phik (φk), and Cramer’s V (φc).
This report also gives detail analysis of Missing values in the four types of graph Count, matrix, Heatmap and dendrogram.
This section displays the first and last 10 rows of the dataset.
Conclusion
Pandas Profiling is an awesome python package for exploratory analysis (EDA). It extends pandas for statistical analysis summaries including correlations, missing values, distribution (quantile), and descriptive statistics. You can view the EDA I created with the diabetes dataset on Kaggle here. You can also check titanic dataset report here.
I definitely recommend you to try it!
Thanks for reading this article. You can access the detailed codes of the project and other projects on my Github account or Kaggle account. Happy coding!
If you have any feedback, feel free to share it in the comments section or contact me if you need any further information.
References