Go to homepage
Get a Demo
Get a Demo


Working Towards Cleaner Data

Nov 24, 2020 | 10m

Gain Actionable Insights Into:

  • Why you shouldn’t dive straight into your data to solve issues
  • How to set up processes that make sure your data is more reliable
  • Case studies that bring these concepts to life

Save the Data for Last

Picture this: A data analyst goes to the COO with a recommendation to shut down a particular logistics trade lane because it isn’t profitable. Only, they later discover that the data wasn’t reliable and they’ve actually shut down a profitable lane. This would be disastrous – not only for the business, but also the analyst who proposed the idea. 

To prevent such missteps, it’s important to make sure that the data you’re supplying your Advanced Analytics or Data Science teams with is as reliable as possible. How? By adopting a problem-solving approach to understand your data sources or how data is generated or created. When a majority of the effort goes into setting up processes to ensure that the data you capture is reliable, analysing it then becomes the easiest part. 

Common Misconceptions

If you’re an analyst, you expect your data to be as clean as possible. The goal is to be able to use this data to identify patterns that will drive business outcomes. But what happens when you find patterns of issues in data collection that render it unreliable? You’ll likely need to invest the majority of your efforts into cleaning up the data, which isn’t efficient. 

When approaching data, don’t assume that the data creation process – for instance, someone manually logging the information –  followed the guidelines or SOP’s every single time. More often than not, people and even systems don’t function as they should. Acknowledging this will help you predict some of the issues that may arise in the data creation process and solve for them preemptively. 

Secondly, don’t just look for anomalies in data. In most cases, anomalies in data occur when things are going wrong. Yet sometimes, not having anomalies in data should also be a cause for concern. If you’ve launched a marketing activity, for instance, there should be a spike in certain metrics. If you come in with the assumption that no anomalies equate to smooth sailing, you’ll have failed to spot a critical issue that needed to be addressed. This points to the issues with your data capturing processes. 

Begin With a Hypothesis

My advice is to never dive into the data first, or you might find yourself facing analysis paralysis. Start by coming up with a bunch of hypotheses before you look at your data. This saves a great deal of time, because if you start looking at data you can get mired very quickly without achieving any outcomes. Begin with an end goal in mind, and use the data you have to validate your hypothesis. Not only is this quicker, you’ll also learn how to think consultatively – breaking down a problem into smaller parts and diving deep into each of them. 

The question of how reliable a data set arises when you’ve set a hypothesis, but the data isn’t reflecting the results you anticipate. Data isn’t inherently unreliable – it is only considered to be unreliable when it is unable to corroborate your hypothesis. Most of the time, data can be used to point to a problem – whether you observe erratic patterns or not. 

For example, as a logistics company, a shipment from Singapore to Indonesia involves a series of chronological statuses being updated. Broadly, the shipment would be picked up, it would be put on a flight, go through customs clearance, and be delivered. Now let’s say you see a customs clearance status come in before the flight the shipment was on has even taken off. According to your understanding of the operations logic or process hypothesis, the data is unreliable. 

But if you dig deeper, you’ll realise that there are several junctures at which errors may have happened. The system might have updated the information at the wrong time or maybe it was just human error. A majority of the world’s information is generated by people. Increasingly, IoT devices are also generating massive data, and this tends to be largely more reliable. However, when working with people, you’ll need to make sure there are systems and processes in place that can help make sure data is being collected effectively. Even so, expecting protocols to be followed all the time is unrealistic. 

With this lens on data – that it is likely going to be unreliable – you can instead focus on the hypothesis you’re testing for. When you don’t get the results you’re testing for, you can work backwards and pinpoint the issue with the data you’ve received. 

Want to continue your read?

To view the full content, sign up for a free account and unlock 3 free podcasts, power reads or videos every month.


Karthik Pitani

Head of Strategy and Data | Former Head of Data

Janio Asia



Data Analytics Big Data, Demystified