Research summary: Maximizing Privacy and Effectiveness in COVID-19 Apps

Top-level summary: This highly insightful work from the OpenMined team led by Andrew Trask provides a great amount of technical detail when trying to build an AI-enabled application to combat COVID-19. Along with articulating user and government needs from such an application, it provides necessary rationale for considerations to keep in mind when it comes to balancing privacy with usefulness of data to be provided to the health authorities as they strive to mitigate the spread of the epidemic. Giving details on the concepts of differential privacy, private set interactions, and private information servers, the article maps the importance of these techniques to meet the needs of privacy preservation in the collection and analysis of historical and current absolute location data, historical and current relative data and verified group identity information. These together help users achieve the goals of getting proximity alerts, exposure alerts, information on planning trips, symptom analysis and demonstration of proof of health. On the other hand they enable governments and health authorities to meet their goals of fast contact tracing, high-precision self-isolation requests, high-precision self-isolation estimation, high-precision symptomatic citizen estimation and demonstration of proof of health. All of these steps will help minimize negative economic impacts while accelerating the return of society to normalcy. We encourage people seeking to develop solutions or those responsible for verifying whether solutions respect the fundamental rights of citizens to read through our summary and the full article from OpenMined (linked at the end of this summary) to gain a comprehensive understanding of the issues and potential solutions in building applications.

While the insights presented in this piece of work are ongoing and will continue to be updated, we felt it important to highlight the techniques and considerations compiled by the OpenMined team as it is one of the few places that adequately capture, in a single place, most of the technical requirements needed to build a solution that respects fundamental rights while balancing them with public health outcomes as people rush to make AI-enabled apps to combat COVID-19. Most articles and research work coming out elsewhere are very scant and abstract in the technical details that would be needed to meet the ideals of respecting privacy and enabling health authorities to curb the spread of the pandemic.

The four key techniques that will help preserve and respect rights as more and more people develop AI-enabled applications to combat COVID-19 are: on-device data storage and computation, differential privacy, encrypted computation and privacy-preserving identity verification.

The primary use cases, from a user perspective, for which apps are being built are to get: proximity alerts, exposure alerts, information on planning trips, symptom analysis and demonstrate proof of health. From a government and health authorities perspective, they are looking for: fast contact tracing, high-precision self-isolation requests, high-precision self-isolation estimation, high-precision symptomatic citizen estimation and demonstration of proof of health.

While public health outcomes are at the top of the mind for everyone, the above use cases are trying to achieve the best possible tradeoff between economic impacts and epidemic spread. Using the techniques highlighted in this work, it is possible to do so without having to erode the rights of citizens.

This living body of work is meant to serve as a high-level guide along with resources to enable both app developers and verifiers implement and check for privacy preservation which has been the primary pushback from citizens and civil activists. Evoking a high degree of trust from people will improve adoption of the apps developed and hopefully allow society and the economy to return to normal sooner while mitigating the harmful effects of the epidemic.

There is a fair amount of alignment in the goals of both individuals and the government with the difference being that the government is looking at aggregate outcomes for society. Some of the goals shared by governments across the world include: preventing the spread of the disease, eliminating the disease, protecting the healthcare system, protecting the vulnerable, adequately and appropriately distributing resources, preventing secondary breakouts, minimizing economic impacts and panic.

The need for digital contact tracing is important because manual interventions are usually highly error prone and rely on human memory to trace how the person might have come in contact with. The requirement for high-precision self-isolation requests will avoid the need for geographic quarantines where everyone in an area is forced to self-isolate which leads to massive disruptions in the economy and can stall the delivery of essential services like food, electricity and water. The additional benefits of high-precision self-isolation is that it can help create an appropriate balance between economic harms and epidemic spread.

High-precision symptomatic citizen estimation is a useful application in that it allows for more fine-grained estimation of the number of people that might be affected beyond what the test results indicate which can further strengthen the precision of other measures that are undertaken. A restoration of normalcy in society is going to be crucial as the epidemic starts to ebb, in this case, having proof of health that helps to determine the lowest risk individuals will allow for them to participate in public spaces again further bolstering the supply of essential services and relieving the burden from a small subset of workers who are participating.

To service the needs of both what the users want and what the government wants, we need to be able to collect the following data: historical and current absolute location, historical and current relative position and verified group identity, where group refers to any demographic that the government might be interested in, for example, age or health status.

To create an application that will meet these needs, we need to collect data from a variety of sources, compute aggregate statistics on that data and then set up some messaging architecture that communicates the results to the target population. The toughest challenges lie in the first and second parts of the process above, especially to do the second part in a privacy-preserving manner.

For historical and current absolute location, one of the first options considered by app developers is to record GPS data in the background. Unfortunately, this doesn’t work on iOS devices and even then has several limitations including coarseness in dense, urban areas and usefulness only after the app has been running on the user device for some time because historical data cannot be sourced otherwise. An alternative would be to use Wi-Fi router information which can give more accurate information as to whether someone has been self-isolating or not based on whether they are connected to their home router. There can be historical data available here which makes it more useful though there are concerns with lack of widespread Wi-Fi connectivity in rural areas and tracking when people are outside homes. Other ways of obtaining location data could be from existing apps and services that a user uses – for example, history of movements on Google Maps which can be parsed to extract location history. There is also historical location data that could be pieced together from payments history, cars that record location information and personal cell tower usage data.

Historical and current relative data is even more important to map the spread of the epidemic and in this case, some countries like Singapore have deployed Bluetooth broadcasting as a means of determining if people have been in close proximity. The device broadcasts a random number (which could change frequently) which is recorded by devices passing by close to each other and in case someone is tested positive, this can be used to alert people who were in close proximity to them. Another potential approach highlighted in the article is to utilize gyroscope and ambient audio hashes to determine if two people might have been close together, though Bluetooth will provide more consistent results. The reason to use multiple approaches is the benefit of getting more accurate information overall since it would be harder to fake multiple signals.

Group membership is another important aspect where the information can be used to finely target messaging and calculating aggregate statistics. But, for some types of group membership, we might not be able to rely completely on self-reported data. For example, health status related to the epidemic would require verification from an external third-party such as a medical institution or testing facility to minimize false information.

There are several privacy preserving techniques that could be applied to an application given that you have: confirmed COVID-19 patient data in a cloud, all other user data on each individual’s device, and data on both the patients and the users including historical and current absolute and relative locations and group identifier information.

Private set intersections can be used to calculate whether two people were in proximity to each other based on their relative and absolute location information. Private set intersection operates similarly to normal set intersection to find elements that are common between two sets but does so without disclosing any private information from either of the sets. This is important because performing analysis even on pseudonymized data without using privacy preservation can leak a lot of information.

Differential privacy is another critical technique to be utilized, DP consists of providing mathematical guarantees (even against future data disclosures) that analysis on the data will not reveal whether or not your data was part of the dataset. It asserts that from the analysis, one is not able to learn anything about your data that they wouldn’t have been able to learn from other data about you. Google’s battle-tested C++ library is a great resource to start along with the Python wrapper created by the OpenMined team.

To address the need for verified group identification, one can utilize the concept of a private identity server. It essentially functions as a trusted intermediary between a user that wants to provide a claim and another party that wants to verify the claim. It functions by querying a service from which it can verify whether the claim is true and then serve that information up to the party wishing to verify the claim without giving away personal data. While it might be hard to trust a single intermediary, this role can be decentralized to provide for obtaining a higher degree of trust by relying on a consensus mechanism.

As the article will be continually updated, we encourage you to keep checking it for more information on how to implement your AI-enabled solution that meets the privacy requirements which will evoke trust in users while serving the needs of getting better public health outcomes.

Original live paper (continually being updated) by OpenMined: https://blog.openmined.org/covid-app-privacy-advice/