The use of synthetic data is being more widely used in research fields specifically when dealing with healthcare data. In this blog post, I am going to go over what is synthetic data and how it is having a growing impact on the health informatics field.

What is synthetic data?

Synthetic data is a simulated copy of the real data you are trying to mimic. For instance, if you have a dataset that has patient health information and information on if the patient was re-admitted back to the hospital, you might try to recreate a synthetic dataset that replicates the general content but has none of the individual patient’s health information.

A synthetic dataset is a great option for research purposes and data exploration because the general statistical information of the dataset is there so you could create machine learning algorithms based on this fake dataset, instead of the real dataset which may be at risk of privacy breach. The balance with using synthetic datasets is how closely the fake dataset mimics the real dataset. There have been some studies that have shown promising early results, proving that the measured difference is not significantly different from the original dataset and the risk of a privacy breach is relatively low, therefore, showing in this study a real value in research settings. In addition, another issue is how easy it will be able to mimic outliers and one-off data points in the synthetic dataset, compared to the real dataset. This is key because these outliers often hold key information that should be replicated in the synthetic dataset.

Photo by Campaign Creators on Unsplash

How Is Synthetic Data Impacting the Health Informatics Field

As alluded to above, the benefit of using synthetic data in the health informatics field is the low risk for patient information to be linked back to any individual record, compared to de-identified data where bad actors have the possibility to link other related datasets to this de-identified data and re-identify patients. Patient privacy and security are huge issues in the health informatics field and we are always looking for ways to reduce this risk.

Another benefit of using synthetic data is for research and data exploration, instead of testing your algorithm on real patient data if you can use simulated data instead this could be helpful to see if initially, your idea is feasible and then once there is more confidence in the results then one can move onto real patient data, if necessary.

In addition, one of the issues with real patient data can be the limited sample size, if you are dealing with a rare disease or if there is a large percentage of missing values, this can occur in cases when one is dealing with sensitive healthcare questions. Therefore, a synthetic dataset can be created to expand the volume of data that research teams can work with, which might help to reduce costs associated with collecting survey data from a large population with respect to a rare disease.

Another benefit of using synthetic data is the ability to use these fake datasets to teach health informatic professionals and students. There are a few open-source datasets out there, that individuals can practice on and can start to learn more about the analytical side of the health informatic field. Through using these open datasets focused on healthcare data, individuals can begin to improve their technical skills using SQL (structured query language), R or Python and Tableau and Power BI. However, some of these current open-source datasets are limited in the volume of data you have access to and the type of healthcare problem they address. Allowing people the ability to have access to these synthetic datasets will help to develop the technical skills of health informatic professionals.

If you are looking for an example of synthetic data you can check out the Simulacrum


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *