Exploring the Subtleties of Privacy Protection in Machine Learning Research in Québec

🔬 Original article by Abigail Buller, Léa Leclerc, Cleo Norris and Elizaveta Sycheva from Encode Canada.

📌 Editor’s Note: This is part of our Recess series, featuring university students from across Canada exploring ethical challenges in AI. Written by members of Encode Canada, a student-led advocacy organization dedicated to including Canadian youth in essential conversations about the future of AI, these pieces aim to spark discussions on AI literacy and ethics.

Introduction

On April 4, 2023, the National Assembly of Québec passed Bill 3 (now Law 5), titled An Act respecting health and social services information. This legislation aims to improve healthcare data access for authorized entities, such as accredited researchers, by establishing a framework for the collection, use, and sharing of health and social services information. However, Law 5 lacks specific guidance on privacy, which could give rise to suboptimal practices, data breaches, and flawed research outcomes. These risks could compromise data security and patient confidentiality, affect legislative decisions, and result in discriminatory treatment of minority groups.

Given the sensitivity of personal health data, we suggest that the framework proposed by Law 5 could be strengthened with additional guidance to enhance privacy protections. Clearer recommendations would strengthen data protection, build public trust, and support ethical healthcare research. We believe that developing guidelines that consider the needs of researchers, policymakers, and individuals could help ensure a fair and comprehensive solution. Our work encourages a broader discussion on the potential gaps in the regulation and their ethical implications, and the examination of similar frameworks implemented globally, which may offer insights to improve privacy measures under Law 5.

Legislative Background

Law 5, adopted in 2023, represents Québec’s first comprehensive legislation dedicated to the protection, management, and sharing of health and social services information. It builds upon the foundation laid by Law 25 (“An Act respecting the protection of personal information in the private sector”), which aimed to modernize legislative provisions for protecting personal information and was adopted in Québec on September 22, 2021, a significant milestone in enhancing privacy protection within the province. Law 5 extends these efforts by specifically focusing on safeguarding health and social services information. It introduces provisions designed not only to enhance protection but also to regulate access to this information by authorized entities under regulated conditions. Moreover, Law 5 grants individuals and authorized professionals specific access rights and establishes guidelines for researchers seeking to use this information for approved projects. The goal of Law 5 is to ensure the protection of healthcare data while facilitating efficient information use to improve service delivery and support for research initiatives (National Assembly of Québec, 2023).

In this context, health and social services information is defined as “any information that allows a person to be identified, even indirectly”. It involves a wide range of data, including details concerning a person’s physical and mental health, biological samples obtained from them, and specifics regarding the nature and location of services they receive. Researchers granted access to this data could have access to personal information of Québec residents, including their medical histories and family backgrounds. This information is crucial for conducting structured studies or systematic investigations, particularly for innovation purposes (National Assembly of Québec, 2023).

Under Law 5, the Québec government requires organizations to select a data anonymization method that aligns with the specific characteristics of their datasets and meets the law’s criteria for re-identification risk analysis. This legislation draws upon established anonymization practices outlined in the General Data Protection Regulation (GDPR), which has been enforced in Europe since 2018 in order to strengthen data protection measures, minimize privacy risks, and promote responsible data handling. These practices include techniques such as pseudonymization, k-anonymization, l-diversity, and differential privacy (European Union, 2018).

Despite referencing these established practices, Law 5 does not prescribe specific anonymization techniques nor establish a definitive framework for organizations to follow when selecting an appropriate privacy mechanism for their data. This approach affords organizations the flexibility to implement anonymization measures that can be tailored to their specific datasets. Such flexibility in the choice of privacy methods is crucial for adapting to the diverse types and complexities of data. However, this flexibility has the potential to result in a sub-optimal application of privacy measures, potentially increasing the risk of re-identification. A well-known example is the Massachusetts Group Insurance Commission case, where vague privacy requirements and the absence of clear anonymization standards led to the re-identification of supposedly anonymized health records, including those of the state governor (Sweeney, 2002). For instance, different interpretations of privacy requirements may lead to inconsistent levels of protection across different applications, creating gaps in overall privacy governance.

Upon examining the international landscape, numerous countries have implemented laws regarding the protection of personally identifiable information (PII) and protected health information (PHI). One such regulation is the aforementioned General Data Protection Regulation (GDPR), a data protection law that took effect in the European Union (EU) on May 25, 2018 (Zaeem and Barber, 2020). Another well-documented example is the Health Insurance Portability and Accountability Act (HIPAA) in the United States, which sets the standard for protecting sensitive patient data and was implemented on April 14, 2003 (Gostin et al., 2009). The GDPR serves as a strong foundation for Law 5 implemented in Québec, outlining the strengths and weaknesses of key privacy mechanisms. It recognizes the varying suitability of these mechanisms based on specific applications and underscores the importance of determining optimal privacy solutions on a case-by-case basis. However, certain techniques that are recognized and accepted under the GDPR framework have been found to have significant limitations (Begum and Nausheen, 2018). This highlights a need for the continuous improvement and strengthening of GDPR guidelines and practices.

Data and Machine Learning Privacy

Health care data is considered uniquely valuable as it contains some of the most intimate and intrinsic details about an individual’s life. Unlike other types of information, such as financial data, where it is relatively straightforward to open a new bank account or obtain a new Social Insurance Number in case of a breach, health care data is far more personal and irreplaceable. The deeply personal nature of medical history, genetic information, and treatment records makes it exceptionally sensitive and difficult to protect; once exposed, these details cannot be changed or replaced. Ensuring the privacy of medical records is therefore crucial, as it preserves the sensitive and comprehensive nature of the information they contain and protects the integrity and dignity of individuals (Nass et al., 2009).

In recent years, the healthcare sector has faced numerous privacy-related challenges and frequent data breaches, where from 2005 to 2019, 249.9 million healthcare records were compromised (Seh et al., 2020). The literature highlights a critical incident from January 2015, when Anthem disclosed a major security breach involving 78.8 million patient records being exposed. The breached data included highly sensitive information such as names, Social Security numbers, home addresses, birth dates, ID numbers, and health records (Seh et al., 2020). This breach is one of multiple examples that showcase the severe vulnerabilities in healthcare data security and underscore the need to address privacy challenges in healthcare.

Another area of particular importance in data privacy is the training, sharing, and deployment of machine learning models. For example, in 2023, researchers revealed that they could extract training data from OpenAI’s ChatGPT using a simple query (Nasr et al., 2023). At a time when machine learning models, including generative AI, are being incorporated into everyday healthcare practices and research, it is particularly important to consider what it means to make a particular dataset or algorithm “private”.

Privacy definitions vary based on acceptable information disclosures, and some controlled information leakage is necessary for insights, particularly in medical research. Privacy mechanisms can be statistical, adding noise or randomness to data, or deterministic, altering data in fixed ways through generalization, masking, suppression, or aggregation. These statistical and deterministic methods can still be vulnerable to inference attacks. However, k-anonymity is a deterministic privacy technique that was introduced to help protect people’s identities by making sure each person’s information looks the same as at least k−1 other people in the dataset. Despite its use of generalization and suppression to achieve this, k-anonymity remains susceptible to homogeneity and background knowledge attacks, motivating a more rigorous privacy guarantee.

The so-called “gold standard” of privacy mechanisms is incorporating differential privacy (DP), introduced by Cynthia Dwork in 2006, into an algorithm, or during processing and analysis (Arasteh et al., 2024). DP is a privacy protection mechanism designed to protect an individual’s information in a dataset by minimizing the influence of any person’s data on the output. In other words, the inclusion or exclusion of an individual’s data within a dataset should not significantly impact the output. Making an algorithm differentially private involves adding controlled perturbation or noise in the training process (Wang & Hegde, 2019).

Privacy-preserving training of an algorithm means that noise is introduced, which naturally affects the accuracy or utility of the algorithm. It is important to consider how this noise affects subgroups within the dataset, introducing fairness as another factor in this trade-off. The balance of the privacy, accuracy, and fairness of an algorithm is an area of active research. For example, it is known that the reduction in accuracy due to added privacy affects subgroups within the data differently. Bagdasaryan et al. (2019) show that the accuracy degrades more for underrepresented classes than well-represented classes in a dataset. Furthermore, if the original model is “unfair”, making an algorithm DP exacerbates this unfairness, and “the poor become poorer”. This discussion emphasizes the idea that the selection of a privacy mechanism is context-dependent and requires in-depth technical knowledge.

Opportunities for Strengthening the Current Privacy Framework

The exploration of legislation and privacy mechanisms above led us to the conclusion that it is possible that the current privacy framework in Québec could present challenges in ensuring the implementation of effective privacy mechanisms within the context of ML research. At the same time, there are challenges and drawbacks to imposing more stringent regulations on rapidly evolving technologies like AI and ML. Given the delicate balance between fostering innovation and ensuring data privacy, there are several approaches we think could be explored to address these privacy concerns without requiring any amendments to Law 5.

As mentioned above, different privacy mechanisms each have their own strengths and weaknesses. Given that the nuances around the strengths and weaknesses of certain mechanisms are still under investigation by researchers around the world (Patel et al., 2024) it is not always reasonable to assume that researchers who work in the field of ML and AI, but outside the realm of privacy in ML and AI, have sufficient knowledge in this domain to make optimal privacy decisions. One possible way to address this challenge would be for a governing body to release an official guide outlining best practices for privacy in ML and AI that align with Law 5. The Commission d’accès à l’information du Québec (CAI) has taken similar steps in the past by publishing “guides and information sheets” to help the public better understand the applicable laws (Guides et Fiches D’information, 2025). Providing researchers with official documentation (available in both French and English) that outlines the various privacy mechanisms along with their strengths, weaknesses, and caveats could help them to make more informed privacy decisions more efficiently. It could also give them greater confidence in their choices, knowing they are backed by official guidance, while offering the public reassurance that their data is being protected through government-reviewed and vetted measures.

The security of training data when models developed in research environments are later released publicly is a current topic of exploration. “Trusted Research Environments” or “TREs,” which are defined as “an environment supported by trained staff and agreed processes (principles and standards), providing access to data for research while protecting patient confidentiality” (Kavianpour et al., 2022). Work by Ritchie et al. (2023) investigates how, while TREs have, in place, stringent privacy measures to ensure that all work being done and data used in that environment remain safe and secure, new risks emerge when ML models developed in these environments get released into the outside world. The current frameworks in place at TREs assure all outputs released from TREs are checked to ensure data confidentiality; however, these checks can be costly and time-consuming (SACRO: Semi-Automated Checking of Research Outputs – DARE UK, 2025). While Quebec research institutions could consider implementing similar frameworks, doing so may require significant investments of time and resources to refine a system that already has known limitations.

To address some of these challenges, the Data and Analytics Research Environments (DARE) UK initiative is working on a project called Semi-Automated Checking of Research Outputs (SACRO). This initiative aims to reduce the operating costs of TREs while simultaneously reducing the amount of time taken to release research results (SACRO: Semi-Automated Checking of Research Outputs – DARE UK, 2025). By working with a range of TREs across various sectors, as well as members of the public, SACRO looks to produce a framework founded on rigorous statistical methods that provide guidance on quality assurance matters as well as a semi-automated system for checking multiple types of research outputs, including AI (SACRO: Semi-Automated Checking of Research Outputs – DARE UK, 2025). A tool like SACRO, paired with guidelines requiring Quebec researchers to report system-generated assessments, could serve as one possible avenue for strengthening privacy protections while maintaining public trust. Ideally, such a system could not only assess the “privacy rigor” of research outputs but also evaluate potential biases introduced by different privacy mechanisms.

Of course, adopting such an approach would come with its own set of challenges. Developing the necessary technological infrastructure and ensuring organizations have the expertise to effectively integrate a tool like SACRO into research workflows would require financial and operational considerations. Additionally, integrating a tool like SACRO into existing research workflows would require careful consideration of operational feasibility, training requirements, and potential regulatory adjustments. Other approaches could also be explored, such as enhancing manual review processes or establishing standardized privacy assessment frameworks. Ultimately, the goal would be to strike that perfect balance between efficiency, privacy, and accessibility, ensuring that research remains both rigorous and ethical while fostering innovation in Quebec’s healthcare sector.

Conclusion

Our research has led us to the conclusion that clearer privacy guidelines that evolve alongside ML technologies could be beneficial for the ML research community and those whose data is being used by these researchers. Based on our research, we believe that Law 5 could benefit from additional detail and specificity to better guide machine learning researchers working with healthcare data. While legislating emerging technologies is challenging, without more detailed privacy measures, Law 5 may fall short in protecting sensitive public data. Future work should focus on developing tailored privacy frameworks for ML, such as government-supported initiatives to release periodic, updated guidelines for researchers. There is also a pressing need to address the intersection of privacy with ethical concerns, like fairness and bias in healthcare algorithms. Overall, we advocate for continuous collaboration between researchers, policymakers, and healthcare professionals to ensure that privacy measures remain effective and adaptable. As legislation in Québec evolves, stakeholders must work collaboratively to ensure that innovation advances in tandem with strong protections for patient rights.

References

Arasteh, S. T., Ziller, A., Kuhl, C., Makowski, M., Nebelung, S., Braren, R., Rueckert, D., Truhn, D., & Kaissis G. (2024). Preserving fairness and diagnostic accuracy in private large-scale AI models for medical imaging. Commun Med 4, 46. https://doi.org/10.1038/s43856-024-00462-6.
Bagdasaryan, E., Poursaeed, O., & Shmatikov, V. (2019). Differential privacy has disparate impact on model accuracy. Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 1387, 15479–15488.
Begum, S. H., & Nausheen, F. (2018). A comparative analysis of differential privacy vs other privacy mechanisms for big data. In 2018 2nd International Conference on Inventive Systems and Control (ICISC) (pp. 512-516). IEEE.
European Union. (2018). General Data Protection Regulation (GDPR). https://gdpr-info.eu/
Gostin, L. O., Levit, L. A., & Nass, S. J. (Eds.). (2009). Beyond the HIPAA privacy rule: enhancing privacy, improving health through research.
Guides et fiches d’information. (2025). Commission d’Accès à l’Information Du Québec. https://www.cai.gouv.qc.ca/commission-acces-information/guide-fiches-information
Kavianpour, S., Sutherland, J., Mansouri-Benssassi, E., Coull, N., & Jefferson, E. (2022). A Review of Trusted Research Environments to Support Next Generation Capabilities based on Interview Analysis. Journal of Medical Internet Research. https://doi.org/10.2196/33720
Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E., Tramèr, F., & Lee, K. (2023). Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035.
Nass, S. J., Levit, L. A., Gostin, L. O., & Institute of Medicine (US) Committee on Health Research and the Privacy of Health Information: The HIPAA Privacy Rule (Eds.). (2009). Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. National Academies Press (US).
National Assembly of Québec. (2023). An Act respecting health and social services information and amending various legislative provisions, Bill 3, Chapter 5. Éditeur officiel du Québec.
Patel, H., Patel, A., & Patel, A. (2024). A Comprehensive Analysis of Privacy-Preserving Techniques in Machine Learning. 1836–1841. https://doi.org/10.1109/icac2n63387.2024.10895292
Ritchie, F., Tilbrook, A., Cole, C., Jefferson, E., Krueger, S., Mansouri-Bensassi, E., Rogers, S., & Smith, J. (2023). Machine learning models in trusted research environments — understanding operational risks. International Journal of Population Data Science, 8(1). https://doi.org/10.23889/ijpds.v8i1.2165
SACRO: Semi-Automated Checking of Research Outputs – DARE UK. (2025, January 23). DARE UK. https://dareuk.org.uk/how-we-work/previous-activities/dare-uk-phase-1-driver-projects/sacro-semi-automated-checking-of-research-outputs/
Seh, A. H., Zarour, M., Alenezi, M., Sarkar, A. K., Agrawal, A., Kumar, R., & Ahmad Khan, R. (2020). Healthcare Data Breaches: Insights and Implications. Healthcare, 8(2), 133. https://doi.org/10.3390/healthcare8020133.
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems, 10(05), 557-570.
Wang, B., & Hegde, N. (2019). Privacy-preserving Q-learning with functional noise in continuous spaces. Proceedings of the 33rd International Conference on Neural Information Processing Systems. USA: Curran Associates Inc. https://dl.acm.org/doi/10.5555/3454287.3455303.
Zaeem, R. N., & Barber, K. S. (2020). The effect of the GDPR on privacy policies: Recent progress and future promise. ACM Transactions on Management Information Systems (TMIS), 12(1), 1-20.