I often hear a lot of hype surrounding applying machine learning algorithms as a decision support tool for clinicians but there are few considerations prior to using these advanced tools. In this blog post I want to explore the question of what are some key considerations when dealing with healthcare analytics, specifically machine learning algorithms and healthcare data?

When it comes to analyzing data, specifically healthcare data, it’s important to be aware of how data is input into an algorithm. If you are provided with a dataset, it’s important for you to do your own analysis on where and how this data was sourced prior to, using a machine learning algorithm for predictive or prescriptive analysis. With tools such as RapidMiner and KNIME which are data science platforms that can allow you to implement machine learning algorithms without writing a single line of code, unlike R or Python, it can be tempting to forgo all analysis and just simply input your data in and wait for the results.

However, we have read certain situations where the output of algorithms have been applied to policy and have resulted in adverse results for marginalized groups, due to biased datasets. Therefore, analysis and understanding of the data being input into the algorithm, is key to moving towards greater health equity.

How the data was collected


The first step is to understand how the data was collected and the different types of data you are dealing with. If this data was manually entered into a medical encounter (interaction between the clinician and patient, this can be in the form of an EHR record) are there data entry errors that you need to fix or ask for further clarification before you proceed with your analysis?

This can be true when it comes to missing data because oftentimes when dealing with using machine learning algorithms, you will need to fill in missing data. Therefore, you don’t want to make assumptions about this data and just enter in “N/A”, “nulls” or use the MICE package in R or similar libraries in Python, without understanding the reasoning behind these missing values.

Did a patient not want to answer the specific question in the encounter? Did the clinician forget to ask the question to the patient? There are many reasons behind missing values in healthcare data and it is key for the analyst to understand why this is? And not just plug in a package or library and hope for the best!

Source


In addition to this, it’s important to understand if you are dealing with survey data, how many people were involved with the study? Does the sample size of the number of people polled make sense? Can you draw out significant results or will the small sample size uncover bias in your data? Luckily there are several online calculators which take into account a variety of factors such as the study group design (e.g. Is the study looking at one or two independent groups? What alpha rate (significance level) is statistically significant for your study? Usually 0.05 is used).

These factors will tell you, what is the minimum number of participants you need to study, to determine if your results are significant, you also need to plan for participants who will drop off during the study. With the data you are analyzing, you want to ensure you have a significant population size to reduce biases and increase variability of your study results.

In addition, you want to understand who is the group that carried out the analysis? This could be analysis done by the Decision Support department in your hospital, that would like you to apply forecasting algorithms to the data or it could be data provided from external stakeholders, who might have a specific agenda in mind.

From this you want to understand the data governance policy these groups have in place and the data governance policy of your own organization, to ensure that the data that is collected follows the principles of health equity, to ensure that no specific group is being left out from healthcare practices, that will affect the greater community being served.

Especially when dealing with healthcare data that could inform policy, we want to ensure that when using algorithms, we are including a wide group of people in our sample size. There have been cases in the past which I talked about on an episode of the Health Analytic Insights podcast, where algorithms have negatively effected the overall care provided to marginalized groups.

Don’t be afraid to ask people questions on how the data is sourced! Especially if people are relying on your analysis in the form of dashboards and reports, to communicate results. If your data is telling them a story and informs policy, people will ask how you came to these conclusions based on your visualizations and you won’t be able to put up your hands and say “I’m just the messenger of the data“, questions will most likely fall to you, therefore, it’s important you ask the difficult questions and have the uncomfortable conversations before your work is publicly criticized.

Thus, it’s important we do a deep dive into understanding how the data was collected and by which groups. When people lose trust in the data you present, it is difficult for you to get this trust back and people will look at your future work with an ingrained level of mistrust.





How the data is presented


Once you are comfortable concerning the source of the data and how it was collected and you believe it’s free from a reasonable level of bias, to the best of your knowledge, it’s time to move onto the analysis stage where you implement machine learning algorithms or create dashboards to communicate insights to care providers.

When you are analyzing this data and showcasing it to a large audience, you want to make sure that the data you are showing is de-identified and you are aware of the privacy and security access of the audience you are showing the report to. You might be creating a report for one specific healthcare department (e.g. orthopedics) and the metrics you are showing are for their eyes only and shouldn’t be accessed by the administrative staff.

Therefore, you need to ensure you are aware of who should have certain access to the data you show. Power BI has great ways to “lock down” the data by using a feature called Row Level Security. This allows you to filter the data based on the role you assign the user, the user could have full access to the report you built or could see a filtered view of the data.

Many organizations have a “need to know” policy where only information that is relevant for staff to do their job is accessed, to help reduce privacy and security breaches and keep data de-identified. Again this is especially important when dealing with sensitive, healthcare data.

In addition, when you are carrying out your analysis, this is a science, and like all science your work needs to be reproducible. Therefore, it is important to document your steps and the analytical process you went through, in case external users have questions or want to peer review your analysis before policies are developed.

This process can be tracked in OneNote or a Word document and can include any descriptions of acronyms that might be unclear to external users, how the data was sourced and who individuals can reach out to, if they have specific questions about the insights or possible assumptions made during the analysis.

When using machine learning algorithms we don’t want to get into a situation where we are using “black box” models where the clinician and the analyst don’t understand the underlying statistics behind the algorithm being used. It’s important that the analyst is well versed in the input data, the machine learning algorithm and the output from the model is in line with what is expected.

When I was completing my masters degree, my thesis was to identify key factors that were used to predict premature births. There are many factors that impact a mother’s rate of a preterm birth such as weight, socioeconomic and genetic factors. However, when I was working with my clinical advisor she helped me to whittle down these factors to non-obvious factors, that would be clinically relevant to their group because they were already aware of certain risk factors.

Therefore, it was key for me to get her perspective as a knowledgeable clinician who has years of experience working with patients. Instead of me just taking the data and inputting it into a machine learning algorithm and telling her what the algorithm output, it was critical for me to get her perspective and weigh these factors, so that the analysis afterwards could be of value.

These are some key considerations you as an analyst might run into, when dealing with healthcare analytics, specifically machine learning algorithms and healthcare data. It boils down to ensuring you understand the source of your data, what are the underlying assumptions and how you will present your results to a wide audience and an understanding that your analysis might inform policy and have a significant lasting impact on patients.


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *