Strategies and Techniques for Data Cleaning in Excel and Python
Master essential data cleaning techniques and turn messy data into a valuable resource with effective cleaning techniques in Excel and Python.
- Know how to clean and analyse data in Excel more efficiently using Tidy Data Principles.
- Understand how to use common techniques in Python to speed up your data-cleaning operations.
- Understand how to clean and manipulate data using Pandas, one of the most popular data science libraries.
- Be comfortable working through a complete data-cleaning pipeline from import to investigation, cleaning and export.
Clean data is data that does not require any transformation or updates prior to commencing analysis. If your data requires cleansing work prior to use, then you have messy data.
If you primarily work with Excel, working with messy data means you regularly find yourself carrying out tasks such as:
- Removing non printable characters such as line breaks.
- Removing leading and trailing spaces or extra spaces.
- Using the text to columns wizard to split data into separate columns.
- Populating blank cells or removing blank rows or columns.
- Identifying duplicate values or duplicate data.
- Using conditional formatting to highlight errors.
- Correcting the capitalisation of text.
- Using paste special to clear formats.
This list is far from exhaustive and messy data is commonplace. Every messy dataset encountered is messy in its own unique way and requires its own unique processes to clean. However, once all such issues are solved and we have clean data, we can progress to the interesting part of the job, the analysis.
In this course we'll work through the complete data-cleaning pipeline, from importing data, to cleaning and manipulating it, to exporting it for analysis.
We'll do this using both Excel and Python, and we'll see how Python can be used to speed up our data cleaning operations.
Figure 1. Basic data exploration in Python
We'll start by focusing on the Tidy Data Principles. These principles are a set of rules that help us to structure our data in a way that makes it easy to clean, manipulate and analyse.
We'll then move on to look at some of the most common data cleaning techniques in Excel, and we'll see how we can use Python to speed up our data cleaning operations.
We'll then move on to look at how we can use Python to clean and manipulate data using Pandas, one of the most popular data science libraries.
When you complete this course, you'll have the ability to take real-world data and clean it for further analysis in your own data science projects.
π€ Use AI to help you learn!
All digiLab Academy subscribers have access to an embedded AI tutor! This is great for...
- Helping to clarify concepts and ideas that you don't fully understand after completing a lesson.
- Explaining the code and algorithms covered during a lesson in more detail.
- Generating additional examples of whatever is covered in a lesson.
- Getting immediate feedback and support around the clock...when your course tutor is asleep!