The Top Data Cleaning and Feature Engineering Books Every Data Scientist Should Own

Emine Bozkus
6 min readDec 27, 2022

--

There are many books available on data cleaning and feature engineering that can be helpful for data scientists. Here are a few that I recommend:

1. Data Wrangling with Python: Tips and Tools to Make Your Life Easier- by Jacqueline Kazil and David Beazley

This book covers a wide range of data cleaning and preprocessing techniques, including handling missing values, working with strings and dates, and dealing with messy data. It also covers how to use Python’s pandas library for data manipulation.

The book begins with an introduction to Python and the fundamental concepts of data wrangling, including how to read and write data, how to explore and summarize data, and how to visualize data using Python libraries such as Pandas, NumPy, and Matplotlib.

As the book progresses, it covers more advanced topics such as how to work with large datasets, how to perform statistical analysis, and how to use machine learning algorithms to analyze data. It also includes practical examples and exercises to help readers apply what they have learned.

Overall, “Data Wrangling with Python” is a comprehensive guide to working with data in Python, and is suitable for people who are new to both programming and data analysis. It is a useful resource for anyone who wants to learn how to manipulate, analyze, and visualize data using Python

2. Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work-by Q. Ethan McCallum

The “Bad Data Handbook” is a book that provides guidance on how to identify and fix problems with data. Data quality is an important aspect of any project that involves analysis or decision-making based on data. Poor quality data can lead to incorrect conclusions and ineffective decisions, which can have serious consequences for organizations.

The book covers a variety of topics related to data quality, including how to identify bad data, how to clean and transform data, and how to maintain data quality over time. It also includes practical tips and best practices for working with data, as well as case studies and real-world examples to illustrate the concepts.

Some of the key topics covered in the “Bad Data Handbook” include:

  • Identifying common problems with data, such as missing values, incorrect data types, and incorrect formatting
  • Techniques for cleaning and transforming data, including using functions and scripts, using data quality tools, and working with data in a spreadsheet or database
  • Strategies for maintaining data quality, including implementing data governance policies and processes, and using data quality tools and techniques to monitor and improve data quality
  • Case studies and examples of how to apply these concepts in real-world situations

Overall, the “Bad Data Handbook” is a useful resource for anyone who works with data and wants to ensure that they are using high-quality data for their analysis and decision-making.

3. Feature Engineering and Selection: A Practical Approach for Predictive Models- by Kjell Johnson and Max Kuhn

“Feature Engineering and Selection: A Practical Approach for Predictive Models” is a book that teaches readers how to create and select features for use in predictive modeling. Feature engineering and selection are important aspects of the data science process, as they can significantly affect the accuracy and performance of a predictive model.

The book covers a wide range of topics related to feature engineering and selection, including how to create new features from existing data, how to select the most relevant features for a particular problem, and how to evaluate the effectiveness of different features. It also includes practical examples and case studies to illustrate the concepts and show how they can be applied in real-world situations.

Some of the key topics covered in “Feature Engineering and Selection” include:

  • The role of feature engineering and selection in the data science process
  • Techniques for creating and transforming features, including feature extraction, feature selection, and feature construction
  • Strategies for evaluating and comparing the effectiveness of different features
  • Case studies and examples of how to apply these concepts in real-world situations

Overall, “Feature Engineering and Selection” is a useful resource for anyone who is interested in predictive modeling and wants to learn more about how to create and select effective features for their models.

4. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists- by Alice Zheng and Amanda Casari

“Feature Engineering for Machine Learning” by Alice Zheng and Amanda Casari is a comprehensive guide to the principles and techniques of feature engineering for machine learning.

The book covers a wide range of topics, including:

  • The role of feature engineering in the machine learning process
  • Techniques for preprocessing and cleaning data
  • Encoding and transforming features
  • Feature selection and dimensionality reduction
  • Creating new features through feature construction and feature transformation
  • Working with different types of data, including numerical, categorical, and text data

The book also includes practical examples and case studies to illustrate the concepts and techniques discussed. It is suitable for both beginner and experienced data scientists who want to learn more about feature engineering and how to effectively preprocess and transform data for machine learning.

5. Python Feature Engineering Cookbook: Over 70 Recipes for Creating, Engineering, and Transforming Features to Build Machine Learning Models- by Soledad Galli

Feature engineering is an important step in the machine learning process, as it involves creating and transforming features to make them more suitable for building effective models. The “Python Feature Engineering Cookbook” by Soledad Galli offers a collection of over 70 recipes for creating, engineering, and transforming features in Python.

Some of the topics covered in the book include:

  • Working with numerical and categorical features
  • Handling missing values
  • Encoding categorical features
  • Extracting features from text
  • Feature selection and dimensionality reduction
  • Creating new features through feature transformation and feature construction

By following the recipes in the book, you can learn how to effectively preprocess and transform your data to improve the performance of your machine learning models. The book also includes practical tips and best practices for feature engineering, making it a useful resource for both beginner and experienced data scientists.

6. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data- by Jason W. Osborne

“Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” by Jason W. Osborne is a comprehensive guide to the process of data cleaning, which is an important step in the data preparation process for machine learning and data analysis.

The book covers a wide range of topics, including:

  • The importance of data cleaning and the impact of dirty data on the accuracy of machine learning models
  • Techniques for identifying and correcting errors and inconsistencies in data
  • Strategies for handling missing values and outliers
  • Best practices for data cleaning, including planning, documentation, and testing
  • Tips for working with different types of data, including numerical, categorical, and text data

By following the best practices and techniques outlined in the book, you can ensure that your data is clean and ready for analysis or machine learning. The book is suitable for both beginner and experienced data scientists who want to learn more about data cleaning and how to effectively prepare data for analysis and machine learning.

🌸Thanks for reading this article. You can access my projects on my Github account or Kaggle account. Happy coding!

Please feel free to contact me if you need any further information.🌸

--

--