Data cleaning is usually what takes up the largest portion of an analyst’s time. Some articles report, 80% of an analyst’s time is spent in the data cleaning stage, whereas 20% is spent on the “fun” analytical work. This can include creating complex machine learning algorithms or interactive dashboards and reports.

Totally accurate representation 😅

This can be a result of multiple reasons. For instance, imagine if you want to collect survey data on

What were the barriers you encountered when trying to obtain care in this hospital?

If we are looking at collecting qualitative data, this can result in a host of data entry errors, from large volumes of missing entries, people not entering the information correctly, people not interpreting the survey questions correctly etc… Trying to transform large swaths of data into actionable information is not easy and there are many areas where there can be a breakdown in this process.

In this YouTube video, I went through some typical data cleansing issues you might run into:


Such as, first and last names being separated and needing to concatenate them into one name field, how to deal with trailing whitespaces etc… When it comes to reducing these data errors, in my opinion, people focus too much on the end result, the data product they want to build, such as the machine learning algorithm or dashboard and don’t spend enough time on designing the best data entry method. After all, garbage in….garbage out!

Some ways to reduce these data entry errors would be to take a look at the system design of the software you are using, is there a way to limit text fields and free text boxes? This can be rife with data entry errors. How will the dates that you enter, be used in your analysis? Are you using the correct date formatting?

If you are building a data product for an end-user, it’s really important to ensure you have the correct requirements to reduce scope creep and to ensure the product will be of value. Instead of just delivering a data product to them, make sure to involve stakeholders at the beginning of the design process.

Have you considered data governance on how the data product will be accessed and maintained over time? Will data have to be aggregated or de-identified before analysis can be done?

The above tips are all considerations when it comes to analyzing data, especially healthcare data. Analyzing the way the data comes into the system, can go a long way in helping the analysis portion be more seamless.

In a previous job, I created a dashboard that highlighted data quality issues, this was a great resource as it could be accessed by those who were entering in the data and was a great reference for them to understand where the data entry issues were occurring. We were able to work in tandem to reduce the data quality issues, so I could continue to create dashboards and reports of high value.

Comment down below, do you spend 80% of your time cleaning the data or is it less… I hope not more!


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *