Data cleansing and manipulation is an area where many analysts will focus their time. Missing data is prevalent in the healthcare system and can stem from a variety of reasons. For example, technical issues with medical equipment might result in missing data when collecting heart rate data or survey respondents who don’t answer sensitive healthcare questions. If we want to be able to carry out advanced analytics on our data, the ability to find a solution to missing data with healthcare data is key.

Although there are various imputation methods to fill in missing data, using programs such as R, Python or SAS, it is key to understand the true reason why this data is missing in the first place, before applying these tools. In some cases, healthcare data might be missing due to interoperability issues between systems such as health records being transferred from the lab to the electronic health record in the hospital or data being automatically pre-populated into the system versus manually entered in by a nurse. These cases might be solved by investigating the transmission method for sending this clinical information but sometimes there is no getting around missing data in your clinical dataset.

A deep dive into the main cause for this missing data is key, is it due to a lack of incentives for those filling in data manually? Is your system having issues uploading the correctly formatted clinical data from external sites (e.g. genetic labs)? Missing data can be categorized into different types and this can determine what steps you might take to deal with this missing data.


Missing Completely At Random (MCAR)

Imagine you have your trusty FitBit that you use religiously every day to reach your 10,000-step goal. You enjoy collecting the data from your FitBit to create a graph to track your progress over time. One day you examine your graph and notice gaps in the data, it seems like your FitBit stopped working randomly and didn’t collect data on that day. This would be an example of missing data that is MCAR. MCAR is defined as data whose absence occurs independent of the observed and/or unobserved values.

Missing at Random (MAR)

Imagine that it is necessary for you to charge your FitBit every night so it works during the day but the cord is often so far from your bed that you forget to charge it. Every morning you notice on your graph, your FitBit doesn’t start collecting data until 11 AM after you charge it during the morning. This would be an example of missing data that is MAR. MAR occurs when the reason for the missing data point can be deduced by the observed values only and is not related to unobserved values.

Missing Not at Random (MNAR)

Finally, imagine that you made a bet with your friend that you would only walk with them but you still want to get your steps in, so you go behind their back and leave your FitBit at home, so they won’t find out. This would be an example of missing data that is MNAR when there is an explanation as to why the data is missing.

Getting a sense of if your missing data is MCAR, MAR or MNAR, will help to guide you when dealing with this missing data. Whether that is deleting the missing data from your dataset (generally if a column/row in your dataset has more than 30% missing data) or using imputation methods in R such as the “mice” package (generally works better when your data is MCAR or MAR).

I hope this overview has given you some considerations to think of when you encounter missing healthcare data and you are determining how to deal with it effectively. Comment down below, how have you dealt with missing data in the past?


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *