The risk of privacy breaches using large data sets of personal information is only increasing with advancements in data processing and artificial intelligence (AI). With any large-scale manipulation of personal data by organizations, there is potential for misuse or disclosure of protected personal information. For instance, AI and machine learning systems necessitate vast quantities of data, including personal information, to train algorithms in support of research and development projects. An emerging set of technologies known as “synthetic data” may present a solution to this problem.

What is Synthetic Data?

Synthetic data is a privacy-enhancing technology that allows for the anonymization of otherwise personally identifying information.  (IPC Comments on the Ontario Government’s White Paper on Modernizing Privacy in Ontario, (September 2021), by Patricia Kosseim, Commissione.)  It consists of data produced by an algorithm which has the statistical utility of real data, while effectively severing any traceability of this data back to individuals. In broad terms, this works by feeding source data (identifiable information which likely contains personal information) into an algorithm that learns from this data set to create an analogous, albeit fake, data counterpart. This process is described by the OPC in a recent blog post as “a generative model [that] is able to ‘learn’ the statistical properties of the source data without making strong assumptions about the underlying distributions of variables and correlations among them.” The resultant dataset can be entirely artificial, or it may retain elements of the original source data that are neither identifiable nor sensitive.

The pertinence of synthetic data in light of ongoing privacy concerns is evidenced by recent government initiatives. Canadian legislatures are struggling to balance personal privacy interests with the undeniable utility of these datasets to private and public stakeholders. This balance has taken the form of various attempts to regulate “de-identified” data, which generally refers to digital information that has been stripped of any personally identifying characteristics.

The Rush to Legislate Anonymity

A de-identification scheme was proposed in 2021 by the Government of Ontario’s white paper, Modernizing Privacy in Ontario (the White Paper). One of the stated purposes of this paper was to address gaps in Ontario’s legislative scheme for protecting digital privacy. In this paper, the Ontario government outlines its proposals for prospective privacy legislation, including the use of “de-identified” personal information by organizations. The White Paper defines de-identified information as “information about an individual that no longer allows the individual to be directly or indirectly identified without the use of additional information.”

There is currently no legal framework governing de-identified information, as it does not strictly fit within current privacy laws which prescribes the use of “personal information.” Nevertheless, inasmuch as information can be “de-identified,” it can also be “re-identified.” It is against this risk that governments are attempting to legislate protections.

It is important to distinguish de-identification from fully anonymous data. Anonymized information is defined in the White Paper as “information [that] has been altered irreversibly, according to generally accepted best practices, in such a way that no individual could be identified from the information, whether directly or indirectly by any means or by any person.” Anonymized information is not presently regulated.

In a lengthy commentary on the White Paper, the Office of the Information and Privacy Commissioner of Ontario (IPC) offered its insights on the proposed expansion of privacy regulations. The IPC’s position strongly supports the explicit inclusion of de-identified information in private sector privacy law, but recommends a more robust definition of this type of information. IPC’s proposed definition notably creates a clearer distinction with anonymized information (differences underlined):

“De-identified information” means information that does not identify an individual or could not be used in reasonably foreseeable circumstances, alone or in combination with other information, to identify an individual, but still presents a residual risk, however minimal, of re-identifying an individual.”

Most recently, in June of 2022, the Canadian Parliament introduced Bill C-27, which (among other things) would enact the Consumer Privacy Protection Act (CPPA) and replace Part 1 of the Personal Information Protection and Electronic Documents Act. The CPPA, if enacted, would adopt a definition of de-identification which mirrors the broader formulation proposed by the Ontario Government. The CPPA would also allow organizations to employ an individual’s personal information without their consent for research, analysis and development, or to be disclosed in a prospective business transaction. There is, however, no mention of how organizations might go about de-identification, other than to require that any technical and administrative measures employed be proportionate to the purpose of de-identification and the sensitivity of the personal information.

Cue synthetic data.

De-Identification via Synthetic Data

As it stands, Bill C-27 contains an important regulatory gap: how to de-identify mass amounts of personal information in order to lawfully employ it within this framework? If this bill comes into force, organizations that collect, use, or manipulate personal information, such as social media companies, may be positioned with having to convert large stores of personal data into de-identified information. An attractive option may be to use synthetic data to reduce potential exposure under this impending legislation.

Although still a developing technology, synthetic data initiatives with practical applications have already begun to surface. For example, a new project was recently launched by organizations based in Toronto to provide synthetic data sets to municipalities for the improvement of services provided to residents. It is clear that synthetic data will likely play an important role in data manipulation and machine learning, and may be a viable option for organizations looking to adhere to regulatory changes going forward.