Lessons from Synthetic Health Data Generation: Fidelity, Privacy, Augmentation & Time
- RC Trust

Abstract: Primary healthcare care data offers huge value in modelling disease and illness. However, this data holds extremely private information about individuals and privacy concerns continue to limit the wide-spread use of such data, both by public research institutions and by the private health-tech sector. One possible solution is the use of synthetic data which mimics the underlying correlational structure and distributions of real data but avoids many of the privacy concerns. Brunel University London has been working in a long-term collaboration with the Medicine and Health Regulatory Authority in the UK to construct a high-fidelity synthetic data generator using probabilistic models with complex underlying latent variable structures. This work has led to multiple releases of synthetic data on a number of diseases including covid and cardiovascular disease, which are available for state-of-the-art AI research. Two major issues that have arisen from our synthetic data work are issues with bias, even when working with comprehensive national data, and with concept drift where subsequent batches of data move away from current models and what impact this may have on regulation.
In this talk I will discuss some of the key results of the collaboration: on our experiences of synthetic data generation, on the detection of bias and how to better represent the true underlying UK population, and how to handle concept drift when building models of healthcare data that evolves over time.
Prof. Allan Tucker
Bio: Allan Tucker is Professor of Artificial Intelligence in the Department of Computer Science at Brunel University London, where he heads the Intelligent Data Analysis Group consisting of 17 academic staff, 15 PhD students and 4 post-docs. He has been researching Artificial Intelligence and Data Analysis for 28 years and has published 120 peer-reviewed journal and conference papers on data modelling and analysis. His research work includes long-term projects with Moorfields Eye Hospital where he has been developing pseudo-time models of eye disease (EPSRC - £320k) and with DEFRA on modelling fish population dynamics using state space and Bayesian techniques (NERC - £80k). Currently, he has projects with Google, the University of Pavia Italy, the Royal Free Hospital, UCL, Zoological Society of London and the Royal Botanical Gardens at Kew. He was academic lead on an Innovate UK, Regulators’ Pioneer Fund (£740k) with the Medical and Health Regulatory Authority on benchmarking AI apps for the NHS, and another on detecting significant changes in Adaptive AI Models of Healthcare (£195k). He recently acted as academic lead on two Pioneer Funds on Explainability of AI (£168k) and In-Silico Trials (£750k). He is currently CI on a Centre of Excellence for Regulatory Sciences and Innovation. He serves regularly on the PC of the top AI conferences (including IJCAI, AAAI, and ECML) and is on the editorial board for the Journal of Biomedical Informatics. He hosted a special track on "Explainable AI" at the IEEE conference on Computer Based Medical Systems in 2019 and was general chair for AI in Medicine 2021. He has been widely consulted on the ethical and practical implications of AI in health and medical research by the NHS, and the use of machine learning for modelling fisheries data by numerous government thinktanks and academia.





