Bites of Data
Posts
Break Up With Bad Data

Break Up With Bad Data

It’s time to move on from bad data.

Colin Murray
January 17, 2025 • Estimated Reading Time: 3 minutes

LET’S TALK ABOUT:

Moving on from Bad Data

Being stuck in a situation where we are constantly working with bad data is a recipe for disaster. Not only does it cause frustration for everyone involved, but it also leads to incorrect results and undesirable outcomes. The time to move on is now, but first, we need to recognize the signs of bad data.

Knowing What is Good Data

To determine if we have bad data, we must first identify what constitutes good data. Along with asking the five critical questions before any project, it’s essential to understand the expected output. By collaborating with the product owner or business analyst, you can gain a better understanding of the ideal state of the data.

This is an excellent opportunity to work directly with other teams. In my experience, data teams can provide immense value to the business by developing an initial understanding of the data. By grasping the basics of each dataset, they can more effectively facilitate conversations with the rest of the organization. Often, data teams find themselves explaining to the business what is available to them. This context is invaluable when working together to identify what constitutes good data.

…vs. What Is Bad Data

When looking for bad data, we first need to know what to watch for. Bad data can take many forms, including (but not limited to) the following:

Null records
Incorrectly formatted dates
Duplicate records

These are just a few basic examples I’ve encountered that can wreak havoc on analytics efforts. Each can cause unintended consequences, such as incorrect calculations and faulty outputs. Another common issue with bad data is missing data.

The Effects of Missing Data

Missing data can complicate analysis and produce unintended results. For instance, if a subset of data is missing, the output will tell a different story than what is actually happening. This discrepancy erodes trust within the business. Once that trust is lost, it can be incredibly challenging to regain.

Before identifying missing data, it is critical to understand the expected behavior of the data. For example, a retailer might find transactions excluded from a specific time period. This can happen for several reasons:

The data pipeline did not properly capture transactions during that window.
The system failed to log the transactions, preventing them from appearing.
Validation steps in the pipeline incorrectly filtered out the data.

A crucial step in identifying missing data is understanding how it might have gone missing. While the above scenarios are common, implementing proper logging in your pipeline will help you identify such issues more accurately.

What about you?

What are some scenarios where you’ve encountered bad data, and how did you handle it? Send me an email or share an example in the comments!

Share this with someone who is working on an analytics project! As a token of my appreciation, you will receive a free questionnaire to help you refine your next project!

Reply

or to participate.