POWER READ

Working Towards Cleaner Data

Gain Actionable Insights Into:

Why you shouldn’t dive straight into your data to solve issues
How to set up processes that make sure your data is more reliable
Case studies that bring these concepts to life

Save the Data for Last

Picture this: A data analyst goes to the COO with a recommendation to shut down a particular logistics trade lane because it isn’t profitable. Only, they later discover that the data wasn’t reliable and they’ve actually shut down a profitable lane. This would be disastrous – not only for the business, but also the analyst who proposed the idea.

To prevent such missteps, it’s important to make sure that the data you’re supplying your Advanced Analytics or Data Science teams with is as reliable as possible. How? By adopting a problem-solving approach to understand your data sources or how data is generated or created. When a majority of the effort goes into setting up processes to ensure that the data you capture is reliable, analysing it then becomes the easiest part.

Common Misconceptions

If you’re an analyst, you expect your data to be as clean as possible. The goal is to be able to use this data to identify patterns that will drive business outcomes. But what happens when you find patterns of issues in data collection that render it unreliable? You’ll likely need to invest the majority of your efforts into cleaning up the data, which isn’t efficient.

When approaching data, don’t assume that the data creation process – for instance, someone manually logging the information – followed the guidelines or SOP’s every single time. More often than not, people and even systems don’t function as they should. Acknowledging this will help you predict some of the issues that may arise in the data creation process and solve for them preemptively.

Secondly, don’t just look for anomalies in data. In most cases, anomalies in data occur when things are going wrong. Yet sometimes, not having anomalies in data should also be a cause for concern. If you’ve launched a marketing activity, for instance, there should be a spike in certain metrics. If you come in with the assumption that no anomalies equate to smooth sailing, you’ll have failed to spot a critical issue that needed to be addressed. This points to the issues with your data capturing processes.

Begin With a Hypothesis

My advice is to never dive into the data first, or you might find yourself facing analysis paralysis. Start by coming up with a bunch of hypotheses before you look at your data. This saves a great deal of time, because if you start looking at data you can get mired very quickly without achieving any outcomes. Begin with an end goal in mind, and use the data you have to validate your hypothesis. Not only is this quicker, you’ll also learn how to think consultatively – breaking down a problem into smaller parts and diving deep into each of them.

The question of how reliable a data set arises when you’ve set a hypothesis, but the data isn’t reflecting the results you anticipate. Data isn’t inherently unreliable – it is only considered to be unreliable when it is unable to corroborate your hypothesis. Most of the time, data can be used to point to a problem – whether you observe erratic patterns or not.

For example, as a logistics company, a shipment from Singapore to Indonesia involves a series of chronological statuses being updated. Broadly, the shipment would be picked up, it would be put on a flight, go through customs clearance, and be delivered. Now let’s say you see a customs clearance status come in before the flight the shipment was on has even taken off. According to your understanding of the operations logic or process hypothesis, the data is unreliable.

But if you dig deeper, you’ll realise that there are several junctures at which errors may have happened. The system might have updated the information at the wrong time or maybe it was just human error. A majority of the world’s information is generated by people. Increasingly, IoT devices are also generating massive data, and this tends to be largely more reliable. However, when working with people, you’ll need to make sure there are systems and processes in place that can help make sure data is being collected effectively. Even so, expecting protocols to be followed all the time is unrealistic.

With this lens on data – that it is likely going to be unreliable – you can instead focus on the hypothesis you’re testing for. When you don’t get the results you’re testing for, you can work backwards and pinpoint the issue with the data you’ve received.

A Problem-Solving Approach to Data

There are two elements to making sure your data is as reliable as possible. The first is developing a deep understanding of the various steps and ways in which data is being created. The second is to use that understanding to look into the data you’ve obtained.

Step 1: Diving Deep Into Processes

To understand how data is created, you’ll need to look at the entire process. Who is creating it – a client, internal teammate, or a partner for instance – and what are the ways in which they’re doing so? Brainstorm with your team on possible corner cases that might occur. Come up with internal hypotheses on where lapses in data creation could occur. The more exhaustive your list is, the better. If you invest in this first step, you become more efficient overall.

For example, if I know that the employees at a warehouse are putting all the packages they receive into a bucket all through the day and only scanning them later in the day to update the statuses, I know that the data I receive isn’t in real time. I don’t need to look at data to know that the process is flawed.

If you’re able to isolate and anticipate potential issues, you’re able to easily identify patterns in the actual data because you already know what you’re looking for. So before you even look at data, look at the different processes involved and hypothesise on where lapses could occur. From there, corrective action becomes more manageable.

For instance, you could identify that 50% of your information is coming in as anticipated. The remaining 50% of information can be broken down into three or four processes, where one of the issues you’ve hypothesised about is causing errors. With this understanding, you’re able to work with the relevant functions to clean up the processes. For the previous warehouse example, I could approach the Head of Operations to review the process in order to get real-time updates of package information.

This approach is problem-solving driven, without needing to go into the data to solve issues. It’ll save you significant amounts of time without having to pore over large volumes of data. I go through this exercise every few weeks with my team to make sure we’re anticipating potential issues before they even happen.

Step 2: Looking at the Data

In this step, you’re usually testing the hypothesis you began with. If your hypothesis is that you’re expecting that the same number of packages being sent from Singapore are reaching Indonesia, you’re essentially matching the two metrics. Suppose this does not happen, either something was lost on the way or your data is faulty. When you set metrics on each of your hypotheses on where data could go wrong, you analyse the data to support those metrics and eliminate inefficiencies.

Let’s look at a case study on how both of these steps can be applied in tandem.

A Case Study in Logistics

At Janio, we work extensively with logistics partners, so the information we receive from them is crucial. When we first launched in the Philippines, we looked closely into the partner onboarding process to anticipate various stages at which things might go awry.

Therefore, it isn’t uncommon to find people working significantly extended shifts, or that systems being updated late at night. These were some of the hypotheses we set up when we entered the new country.

When we looked into the data, as anticipated, we saw updates for outward delivery in the middle of the night. Imagine how you’d feel as a customer if you received a text at 1.00am that your package is out for delivery! Naturally, the number of customer service tickets we were receiving was also on the rise. Since this was one of our hypotheses, we were able to narrow down on the issue quickly.

To narrow it down, you need to understand why the issue is happening in the first place. Is it a fault in internal systems or that of the network partner? Are all parties involved following SOPs? Or are packages really being delivered at 1.00am? With these questions in mind, we looked at the data to spot where exactly the discrepancies were taking place.

The issue turned out to be that our partners were carrying out the actual tasks on time, but only updated the systems later at night when they got home, and all at one go. Given that certain alerts are sent to consignees based on the status, one issue gave rise to other issues. We were then able to take this data and use it to have an open conversation with our network partners and ask them to address the problem on the ground, which would help prevent such inefficiencies in the long term.

When you’re analysing data, make sure you’re not blindly looking at it. Rather, approach data with an idea of what might be wrong with it, and add the relevant buffers before you build any models.

A Case Study in Customer Service

To be able to identify if a data source is reliable, you’ll need to begin with a purpose in mind. Understanding how the problem you’re trying to solve relates to the larger business context allows you to work collaboratively with various functions to course correct.

Suppose a Customer Service team is unable to approach Operations or Product teams with 2-3 actions that will reduce the number of tickets significantly. This means that the team is resolving issues without recording them accurately. Before you build a model to understand and categorise Customer Service tickets, you need to make sure that information is being captured in the first place.

The business metric in this case is: the CS team should be able to narrow down on 1-2 issues that – when solved – will reduce the number of tickets by 20%. With a clearly defined business metric, you can revisit the reliability of the data sources. The question now is – are the tickets being tagged accurately?

Every ticket that’s logged has two components: the reason why it was created, and the root cause of the issue. Often, the two are different from each other. For example, a ticket might be created because a customer hasn’t received their package yet or they haven’t received updates on a shipment. The root cause, however, could be that the systems aren’t working, or the package is being held in customs, or even that the network partner has lost the shipment. But when the CS team isn’t capturing any of this information as a root cause of the issue, you cannot take definitive action in time.

What you can then do is work with the CS department to set up new metrics to capture. This is where human-friendliness is crucial. If you give someone 20 different categories of issues to choose from versus three, you are not solving the problem, but creating more issues. Keep the data entry process as foolproof and simple as possible. If you must have 20 categories to choose from, work on recommendation solutions that suggest a few relevant ones to whoever is entering the data.

All of this contributes to cleaner data, which helps identify key problems that the business should prioritise.

A Checklist for Better Data

Before diving into your data, run through this checklist to gain a more robust context on how it was created.

What is your business problem?
What are your data sources?
How is the data being collected?
What are the SOPs in place for data entry?

Once you’ve done this, zone in on the specific areas in which you can optimise processes. Work collaboratively with the relevant functions to drive the changes you require for cleaner data. When doing so, it helps to highlight why the change will add value to their teams as well as the entire organisation. If this approach isn’t driving the behavioural change you’d hoped for, making the metrics publicly visible during larger leadership meetings might motivate them to do so. We do this every Monday and review progress and bottlenecks openly.

Key Insights

1. Frame Your Business Problem

Think about the larger business problem you’re solving, and the various moving parts involved in doing so. This contextual understanding will help you approach your data with a consultative eye, allowing you to identify gaps and work with the relevant functions to address them.

2. Look at How You’re Getting Your Data

When you have a clear view of the entire process of data creation – who is entering it, what the SOPs are, how confident you are that SOPs are being followed, and so on – you can figure out where potential lapses might occur that might affect your data.

3. Use Your Hypothesis to Set Up Metrics

Take a problem-solving approach to data instead of a technical one. Identify where lapses might have happened, form a hypothesis, and look at your data metrics to validate it. This will save you time and allow you to solve the problem more effectively.