Synthetic healthcare datasets are useful to support the development of data analysis and machine learning techniques in healthcare, by offering access to representative data to experiment and generate models from while mitigating the issues associated with dealing with highly sensitive data related to human subjects. However, the performance and usefulness of data analysis and machine learning methods applied depend on the quality of these synthetic datasets and their representativity of the phenomenon to model.
The objective of the project is to develop machine learning methods for generating synthetic healthcare datasets that preserve the distribution and the temporality of real administrative healthcare datasets while ensuring that the confidentiality of sensitive information on persons found in the real dataset is preserved. This means to have some guarantees that the capacity to identify real people from the original dataset is not possible or very unlikely, and that attributes of the real records (e.g. personal healthcare history) can not be inferred from the synthetic dataset. Depending on the guarantees we can get in ensuring the confidentiality over the real open medical data used in generating the synthetic datasets, it would be considered to produce synthetic versions of RAMQ datasets, and even to disclose them more openly for research and analysis purposes if that is deemed to be acceptable.
Student
Directeur.e(s) de recherche
Christian Gagné
Co-researcher
Anne-Sophie Charest
Start date
Title of the research project
Confidentiality-preserving synthetic data generation from administrative healthcare databases
Description