I often talk about on the Health Analytic Insights podcast some of the data challenges that health informatic professionals might face on a day-to-day basis when it comes to cleaning and wrangling healthcare data within their job. In this blog post, I am going to summarize some of the challenges I have dived into within individual podcast and blog post episodes into one mega-list of data challenges that make our job interesting!

Veracity, variety and volume, these are terms you might hear attributed to big data challenges within different industries and healthcare is no stranger to these concepts.

  1. Veracity

When it comes to healthcare data we want to ensure the data we are using originates from a trustworthy source and is entered in with a high degree of accuracy. Healthcare data can come from multiple sources such as, a clinician entering data of the medical encounter of a patient into the EHR (electronic health record). Healthcare data can also come from bloodwork labs and survey data. As a result of all these different types of healthcare data, there can exist issues with the accuracy of the data. For instance, a clinician might find some of the questions from the medical encounter to have a degree of subjectivity in their mind and might prefer to fill out not applicable rather than choosing a specific option.

In this research paper, researchers highlight some of the data challenges they identified in EHR systems when it came to issues such as insufficient data entered in the system to correctly categorize the healthcare data according to the LOINC (Logical Observation Identifiers Names and Codes) standard. This standard consists of a database for identifying medical observations. This is key to ensuring the data is classified correctly and can be understood by a wide audience when viewing the data.

This leads us to another issue with healthcare data which relates to missing data, one example is missing data from surveys. When it comes to asking people questions about healthcare data this can feel intrusive to the respondent. Questions such as “How many glasses of alcohol, do you have a week?” can cause people to not answer, leading to large amounts of missing data.

There are generally three types of missing data: Missing Complete At Random, Missing At Random and Missing Not At Random. Having an understanding of why your data is missing (e.g., people tend to not fill in sensitive survey questions or the way that the question is designed in the EHR is subjective, resulting in clinicians not entering data into the encounter) is key for carrying out complex analysis.

For instance, when employing machine learning algorithms such as artificial neural networks there shouldn’t be missing data for the algorithm to be used for predictive analytics. Therefore, to fill in the data you might use a package in R called MICE (Multivariate Imputation by Chained Equations) which works best with missing data that is Missing Completely At Random or Missing At Random.

When it comes to dealing with this missing data to improve the accuracy of the dataset you could either use a statistical package like MICE in R or you could also take a look at how the questions or survey is designed, to better improve your chances of data being entered in accurately.

From the lens of the health informatic professional, this could look like having conversations with clinicians and understanding what are the limitations for them when it comes to entering in the data, this could stem from a variety of reasons, from a lack of time or a lack of understanding on how entering the data will be of benefit for them.


This is when the power of data visualization can come in handy! When I worked at a non-profit, primarily building dashboards in Power BI, what really made a difference to the audience was showing the ‘missingness’ of the data, where people expected to see metrics displayed. This is where I could have conversations with them to let them know that the data was not there because it was not being entered in correctly. When they could see the gaps in data entry, this was powerful for them to start to enter in the data correctly or help train their staff to start entering the data in correctly. Oftentimes it can be difficult to really start to understand the importance of entering in data correctly, if you can’t see the downstream affects and a visualization in the form of a graph and a chart can be helpful.

As health informatic professionals we want to create a feedback loop where the clinician is involved with the data capture and receives the benefits, from outputting actionable information that can help to improve clinical practice and patient outcomes.

Finally, a data governance framework can be helpful for the entire organization but especially so for the individuals who are entering in the data. A data governance framework will have extensive documentation on standards and policies developed by the organization. For instance, a standard definition on how hospital length of stay should be calculated and there should be certain logic rules in place to help guide people to enter in the data accurately (i.e. The age of the patient should not be less than 0). Data governance is especially important when dealing with healthcare data because most likely your organization reports to a higher government body that will require a certain veracity of the data and this might be tied to your funding. If you would like to read more about data governance you can check out the post I have written here: Is Data Governance Important in Health Informatics?

2. Variety

One of the interesting challenges of healthcare data is the different variety of healthcare data there exists, relating to both unstructured and structured data. If we think of all the different ways we can collect healthcare data from FitBits to mHealth apps to EHR records, the list is endless!

While some might know about the structured healthcare data that exist such as ICD-10 codes, standardized medication codes, patient date of birth etc… unstructured data is quite prevalent within healthcare data and can consist of free text such as, clinicians notes and dictation notes which might exist in audio format. Making sense of this unstructured data can be difficult because it doesn’t have a standardized format as structured health data but it is still valuable information that can help to tell a rich story of the patient’s journey through the healthcare organization.

The advent of machine learning tools such as natural language processing can help to sort through large volumes of unstructured health data and using sentiment analysis can help categorize the data into positive, negative or neutral statements.

Sentiment analysis could be used when sorting through patients satisfaction scores from a hospital. Instead of just having numerical scores (i.e. Rate your experience from 1-5) giving the patient the ability to describe in detail their experience in a free-text field. From this text information, one could build a word cloud to understand patterns amongst what the majority of patients are reporting and if there are generally positive, negative or neutral responses.

When it comes to structured data, one of the key considerations when we have structured and unstructured data coming from so many different sources is, how can we combine these different data sources to add to the historical record of the patient? This is where interoperability comes into play.

Interoperability can be defined as the ability for multiple computer systems to exchange information across platforms and use this information to drive improved outcomes. Some of the barriers to interoperability in healthcare include: financial costs required to integrate healthcare systems, privacy and government regulations to ensure healthcare data is protected and will not result in major breaches, lack of involvement of clinical stakeholders when building these integrated systems which can lead to reduced rates of adoption when using the system and lack of incentives for healthcare providers, to name a few major barriers.

There are many organizations out there which are dedicated to finding solutions to these interoperability challenges in healthcare such as Health Level Seven (HL7) International healthcare standards organization. I for one, am looking forward to a day where we can combine both unstructured and structured data from many different sources into one platform for improved analytical capabilities and greater insights into our healthcare data!

3. Volume

The last challenge when it comes to healthcare data is the volume of data that is collected daily by healthcare organizations. This can lead to considerations when it comes to protecting large amounts of sensitive health information.

In the past few years, there have been several cyberattacks on hospitals, recently there was a large cyberattack on Newfoundland and Labrador health systems which resulted in the cancellation of thousands of medical appointments across the province and downtime issues where some hospitals had to revert back to entering data into paper charts.


Access to data in hospitals and healthcare organizations can be highly valued because hospitals don’t just carry health data they also can house financial and geographical data on patients.

When it comes to cloud storage this consists of “renting” storage from external vendors such as Amazon and Microsoft. One of the benefits of moving your data to the cloud might be the ability to have back-ups of your data rather than a sole on-premise solution where if there is an issue that affects your main server this might result in clinicians having to enter in data manually into paper charts for a long-period of time, leading to potential increased data entry errors and missed records.

We can’t talk about outsourcing your data without talking about the potential privacy and security issues, that might come along with this. It’s important to ensure that when moving large volumes of healthcare data to these external vendors that their retention and disposal policies of the data will be in line with the healthcare policies of your organization or governing body.

One of the benefits of moving your data to the cloud is the ability to integrate your data with analytical tools such as Power BI or Tableau. These data visualization tools are great, to be able to visualize large quantities of data and transform raw data into actionable information.

When it comes to displaying and analyzing this data you want to look at the types of data you are exposing to a public or internal audience:

Aggregated data: With aggregated data, only counts or summations are included in your dataset and identifiers are removed. However, sometimes with healthcare data depending on what is being collected, there could be a very small population of individuals who are diagnosed with a particular rare disease and this could be identifying. Therefore, further limits need to be placed on what data will be revealed depending on the population size.

De-identified data: De-identified data although identifiers are removed (e.g. date of birth, name) one has to be cautious of potential privacy breaches. Bad actors might be able to re-identify your de-identified dataset by using another dataset which has information on the patient and individual and then they could use the de-identified data and their own dataset to re-identify the dataset you previously de-identified.

Anonymous data: For the highest level of privacy and security, anonymized data is data that has all identifying data removed and cannot be identified. One way to anonymize your data is to include ranges instead of actual numbers, for instance, you can display an age range instead of the actual age of the patient. Understanding the different types of data you are displaying and what privacy and security risks you might be exposing is key, to managing large volumes of healthcare data

These are some of the challenges that health informatic professionals should be aware of when it comes to dealing with the veracity, variety and volume of healthcare data. What are the issues that I missed? Comment down below


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *