Abstract:
Patient data are regarded as highly sensitive and protected by federal, state and local
policies that make it available to only those who have been given access to protected health
information. Synthetic data generation provides one possible solution to the issue of limited
access, but at the same time, it is a key challenge in big data benchmarking that aims to
generate application-specific datasets. In this dissertation, first, a comprehensive literature
on synthetic data generation is presented which helps readers and practitioners in
effectively adopting data generator approaches and provides an insight into its state-of-theart.
Next, a Machine Learning (ML)-based algorithm, Intelligent Patient Data Generator
(IntPDG), is proposed to generate scalable patient claims data. In order to construct a model
for generating high quality of patient data, two main elements including back window size
and hyperparameters of different ML algorithms are investigated. Besides, a data
evaluation measure, Weighted Itemset Error (WIE), is presented and used to evaluate the
quality of the generated data in hyperparameter optimization. To generate claim level data
from patient level data, patterns and data structures of actual patient claims data are
xiii
gathered and used in probabilistic models. Once the data generator method is constructed,
it is tested on simulating Medicare carrier claims data, consisting of three datasets: patient
demographic table, patient claim table, and patient line table. To add another layer of
validation to the synthetic data, summary statistics of the generated datasets are compared
with that of Medicare data and result confirms the consistency and validity of the
simulated claims data. The developed data generator method can be used to generate any
sizes and any types of claims data such as inpatient and outpatient claims data or can be
extended to generate other medical data such as Electronic Health Records (EHR).