🔬 Research Summary by Rachele Hendricks-Sturrup, Clara Fontaine, and Sara Jordan
Dr. Rachele Hendricks-Sturrup is the Research Director of Real-World Evidence (RWE) at the Duke-Margolis Center for Health Policy in Washington, DC, strategically leading and managing the Center’s RWE Collaborative.
Clara Fontaine is a Ph.D Student in the Centre for Quantum Technologies at NUS.
Dr. Sara R. Jordan was Senior Researcher, Artificial Intelligence and Ethics at the Future of Privacy Forum.
[Original paper by Rachele Hendricks-Sturrup, Clara Fontaine, and Sara Jordan]
Overview: In the 21st century, real-world data privacy is possible using privacy-enhancing technologies (PETs) or privacy engineering strategies. This paper draws on the literature to summarize privacy engineering strategies that have facilitated the use and exchange of health data across various practical use cases.
Introduction
Today, real-world data privacy remains controversial and elusive, driving ongoing debate among privacy researchers, health industry members, policymakers, and others about how to best safeguard both patient and consumer health data in more modernized ways. Researchers and other data management experts have demonstrated how real-world data can be generated, linked, processed, and shared in privacy-preserving and identity-revealing ways. Few, if any, have broadly explored strategies through which real-world data privacy can be preserved using PETs or privacy engineering.
In this paper, the authors scoped the state of the literature and knowledge on privacy engineering strategies that, to date, have facilitated the use and exchange of health data. Key findings lent three general categories of PETs concerning health data: algorithmic, architectural, and augmentation-based PETs.
Often combined, those three general categories of PETs fill privacy, security, or data sovereignty needs across a range of practical use cases involving health data.
Key Insights
PETs Defined and Explained
Defined broadly, a privacy-enhancing technology is a technical means that can protect a user’s privacy through policy, interaction, and/or architecture. To enable more critical analysis and comparisons of the applicability of these technologies on health data, we identified and categorized 7 PETs.
Algorithmic PETs
Represent data in a privacy-protecting but still useful way. Providing mathematical rigor and measurability to privacy.
- Homomorphic encryption
- Differential privacy
- Zero-knowledge proofs
Architectural PETs
Enabling confidential information exchange without sharing underlying data using a structured computation environment.
- Federated learning
- Multi-party computation
Augmentation PETs
Generating realistic data to enhance small datasets or generate fully synthetic datasets.
- Synthetic data
- Digital twinning
State of the Peer-Reviewed Literature
In the United States, real-world uses of health data require that the data be protected with the highest privacy standards set by HIPAA. However, HIPAA does not precisely define what constitutes these standards beyond removing 18 personal identifiers. An expert must determine any stronger protections.
An expert could defensibly turn to peer-reviewed literature on the topic to make their determination. As domain experts in health data privacy, machine learning, and computer science, we sought to experience and assess this process of expert determination ourselves. We examined the state of peer-reviewed literature at the intersection between PETs and health data applications. We evaluate the robustness, consistency, transparency, and usefulness of the results presented in relevant papers retrieved from ACM, IEEE, and PubMed.
Challenges for Expert Determination of Usefulness of Peer-Reviewed Literature on PETs and Health Data
Within our team of three experts, we independently evaluated each article against two criteria:
- Applicability to the health data context
- The rigor of testing of the PET (i.e., quality, quantity, diversity of datasets)
Although we worked with the same rubric and consulted one another often for clarifications, we evaluated the PET literature differently. Our expert determinations heavily depended on knowledge of health conditions, recognition of standard benchmark datasets, and understanding machine learning performance metrics. Glaring disparities in the quality of performance characterization — details on privacy-utility tradeoffs, computational time, and hardware constraints in these articles amplified the differences in our determinations. We achieved only ⅔ similarity on over 80% of the literature, raising the questions: What makes a relevant expert? and What are the limits of expert collaboration?
Key Characteristics and Considerations for Each PET
To move the health data privacy community forward on this topic, we provide the referenced review and summary table titled “Key Characteristics and Considerations of Each PET.” This table is an overview of each algorithmic, architectural, and data augmentation PET described, specific use cases, pros and cons associated with using each PET in health data contexts, and opportunities for future research.
PET | Description | Use-cases | Pros | Cons | Future research |
---|---|---|---|---|---|
Differential Privacy | Adds noise to a dataset to reduce an adversary’s ability to tell whether an individual is part of the dataset Some variations improve data utility at the cost of weaker privacy protection | Publishing or sharing data to satisfy research needs | Provides measurable privacy guarantees | Privacy-utility tradeoff Inapplicable to time-series data Under-represented or “unique” minority data may not be well-characterized | Comparable and consistent reporting between DP variations of types and granularity of at-risk private information |
Homomorphic Encryption | Encryption scheme that enables private computation over encrypted sensitive data Partial, somewhat, and fully homomorphic encryption | Third-party computation Data storage and processing | Provides a high level of privacy Compatible with most data types | Inefficient, expensive, and complex Not well-suited to resource-constrained environments | Explore more diverse and lightweight variations of HE especially for resource-constrained environments Analyze performance-privacy tradeoffs carefully |
Zero-knowledge proofs | Verification of sensitive data between collaborators without explicitly transferring data | Identity and attribute verification | No direct transfer of sensitive health data Space, power, and computationally efficient | Applications are poorly characterized and infrequently discussed in health data research | Explore practical applications with health data and characterize performance and privacy |
Federated learning | Collaborative ML modeling while keeping training data local to data owners Decentralized or centralized for both data and model | Collaborative ML with theoretically any type of algorithm or data | Enables ML training with more diverse data Reduced computational load for institutions or devices Private data never moves beyond the firewalls of institutions or devices Provides a high level of data sovereignty to owners | No true privacy baseline across the learning system Scalability is dependent on collaboration and stable communication between otherwise sovereign and asymmetrical devices or institutions Aggregated data is not necessarily independent and identically distributed | Identify when a federated approach is the best choice for the specific reason of protecting data privacy Address challenges of interoperability Consistently characterize the tradeoffs between privacy, utility, and performance across different FL approaches to aid decision-making |
Multi-party computation | Computation across multiple encrypted data sources while ensuring no party learns the private data of another Includes secret sharing, garbled circuits, oblivious transfer | Collaborative inference Third-party model training | Strong privacy protections for all participating parties No need for a trusted third-party High accuracy and precision | Communication and computational complexity are too high to use reasonably at scale and in resource-constrained environments Privacy-accuracy tradeoff | Develop more practical SMC solutions for resource-constrained environments and computations at scale |
Synthetic Data | Synthesizing data to use instead of or in addition to real health data | Supports rapid development and benchmarking of ML algorithms Balance data that has uneven representation Augment datasets Measure utility loss of algorithmic PETs | It may be the most effective way to maximize privacy Increasingly easy and cost-efficient to implement | Limited methods to generate realistic data Limited types of data that can be synthesized Need to validate that synthetic data is representative of real data Should be restricted to secondary uses | Develop diverse methods to generate realistic synthetic data of all data types |
Digital twinning | Virtual representations of what has been manufactured | A virtual counterpart to persons or hospitals to test tools like ML models | A real-time simulated environment without risk of exposing private data | Application in healthcare is primarily theoretical Privacy protections (e.g., risk of re-identification) are not well characterized | Develop practical applications of digital twins in healthcare Characterize privacy protections |
Expertly Choosing PETs for Health Data
Experts in health data management, public health, computer science, and privacy engineering may face certain challenges in their collaborative attempts to ascertain the state of the literature on PETs and health data. As the HIPAA expert determination pathway continuously requires experts to rely on their knowledge of state-of-the-art technical methods to balance disclosure risk and data utility, the following themes from the literature are essential to remember:
- Not all types of health data can be protected with each PET. Drastic variations in performance, data utility, and/or computational resources are required.
- Architectural PETs create opportunities for data sovereignty and privacy in the aggregate, but they do not protect privacy locally.
- The state of the art of PETs research for health data often uses well-worn benchmark datasets from other domains or specific health data types. Thus, they cannot be easily extended to health data in the wild.
- Besides maximizing synthetic data, combining architectural and algorithmic approaches is the emerging best practice in the research literature.
- When to use a specific PET is context-dependent. The choice of PET in a pre-inferential setting will vary from choices made in a post-inferential setting. Continuous use of a single method will likely result in unintended privacy or performance loss.
- None of the algorithmic or architectural PETs can guarantee zero risk of reidentification. Fully synthetic data is the closest to providing such guarantees.
Between the lines
An important takeaway from this paper is that it offers a critical yet convenient starting point for experts collaborating across health data management, public health, computer science, and privacy engineering practices. We identified three sentinel examples of work with carefully written descriptions of PETs used and relevant applications and rigorously reported methods and findings.
Yet, a great deal of work remains concerning intentionally integrating PETs into the day-to-day practices of health data privacy and security experts and creating more robust and more standardized guidance for relevant practitioners and stakeholders. Understanding the benefits and limitations of each PET is mission-critical for legal and data experts managing or advising on managing sensitive health data.
Experts should continue to systematically explore a broad range of literature to develop formalized recommendations and disseminate them to policymakers and healthcare system stakeholders seeking to operationalize the potential, utility, and acceptability of PETs in support of public health research and practice.