It is sometimes necessary to create anonymised sets when working with data that contains sensitive attributes. Often the data were originally collected as part of a separate process where the data privacy requirements are higher.
This may contain (say) medical test results, clinician prescribing patterns, salary or bank details, or sensitive information about private activities. This is then merged or aggregated (with the correct user consents) to carry out some processing activity post collection, say as part of a research project, analysis, or data mining on a cohort of records.
Maintaining both anonymity and utility is a zero sum task. Anonymising data reduces the utility of the anonymous set, but making it less anonymous has the potential to expose the participant records to re-identification or disclosure.
K-anonymity is one of the most straightforward anonymisation algorithms to implement, although it is not suited to all situations and has some significant drawbacks.
Why use K-Anonymity?
It is possible to re-identify anonymised data by linking shared attributes between datasets, for example by linking age, postal code and gender from a medical record to a publically available dataset like the electoral roll details. By joining the two datasets together more detailed knowledge is available to a potential attacker.
The concept of k-anonymity was first introduced by Latanya Sweeney in a paper published in 2002 as an attempt to solve the problem: “Given person-specific field-structured data, produce a release of the data with scientific guarantees that the individuals who are the subjects of the data cannot be re-identified while the data remain practically useful.”
Datasets are said have k-anonymity if the information on an individual cannot be distinguished from at least k-1 individuals whose details also appear in the same dataset. That is, there is a 1/k chance that the individual can be uniquely identified in the data.
Tools and techniques for implementing k-anonymity are protected by patent so I do not intend to cover in this post any areas that are not already in the public domain.
Methods for implementing a k-anonymous dataset are based on the following techniques that can be implemented in databases (I’m using SQL Server as an example).
Generalisation.
By replacing all or some of the field values with an asterisk it is possible to mask the data field values so that the information is hidden. This can be implemented using the SQL REPLACE function.
Suppression.
By grouping specific attribute values together it is possible to generate grouped attributes. Use the SQL CASE statement to implement age groups or salary band information together to suppress specific data attributes, using LEFT, RIGHT, or SUBSTRING functions to partially replace information , or by using the SQL Server 2016 Dynamic Data Masking functionality.
The anonymised fields constitute “quasi-identifiers” which, when k-anonymity is implemented can only be matched to at least k-1 other records, meaning that an adversary would not be able to identify a specific personal record with certainty.
Generalising and suppressing data groups the data into a number of classes that help to maintain the anonymity of each row.
SQL Server Implementation
Given the example records in the table below (Figure 1), it is possible to use SQL to implement 3-anonymity (k=3) on the output dataset to maintain identity privacy.
PatientID | PatientName | Postcode | Age | Disease |
1 | Alice | 47678 | 29 | Heart Disease |
2 | Bob | 47678 | 22 | Heart Disease |
3 | Caroline | 47678 | 27 | Heart Disease |
4 | David | 47905 | 43 | Flu |
5 | Eleanor | 47909 | 52 | Heart Disease |
6 | Frank | 47906 | 47 | Cancer |
7 | Geri | 47605 | 30 | Heart Disease |
8 | Harry | 47673 | 36 | Cancer |
9 | Ingrid | 47607 | 32 | Cancer |
Implement 3-anonymity on this simple dataset by running the following SQL:
SELECT
–Suppress name data
‘*’ AS PatientName,
–Generalise postcode and age group
LEFT(Postcode,3) + ‘**’ AS Postcode,
CASE WHEN Age<30 THEN ‘Under 30’
WHEN Age>=30 AND Age<=40 THEN ’30 to 40′
WHEN Age>40 THEN ‘Over 40’ END AS AgeGroup,
Disease
INTO anon
FROM PatientData
Selecting from our temp table would give the following output (Figure 2):
PatientName | Postcode | AgeGroup | Disease |
* | 476** | Under 30 | Heart Disease |
* | 476** | Under 30 | Heart Disease |
* | 476** | Under 30 | Heart Disease |
* | 479** | Over 40 | Flu |
* | 479** | Over 40 | Heart Disease |
* | 479** | Over 40 | Cancer |
* | 476** | 30 to 40 | Heart Disease |
* | 476** | 30 to 40 | Cancer |
* | 476** | 30 to 40 | Cancer |
And we can check that 3-anonymity is implemented by using the following SQL (you can also check by doing a MIN(COUNT()) to check that all the classes conform to the correct k-anonymous requirement.
— Check for 3 Anonymity.
— Confirmed as 3-Anonymous wrt Name,Postcode and agegroup
— as all groups (classes) contain at least k-1 records with the same identifier.
— defends against identity disclosure, but not attribute diclosure.
SELECT DISTINCT PatientName,Postcode,agegroup, COUNT(*) as KAnonymity
FROM anon
GROUP BY PatientName, Postcode, Agegroup
Running this SQL returns:
PatientName | Postcode | agegroup | KAnonymity |
* | 476** | 30 to 40 | 3 |
* | 476** | Under 30 | 3 |
* | 479** | Over 40 | 3 |
Showing that for each anonymised class there are at least 3 records, meaning that the probability of identifying a single record from the set has a probability of 1/3.
Drawbacks to k-anonymity
K-anonymity can be susceptible to the following attacks that would allow an outsider to re-identify the anonymous records. Consider the following SQL:
— If disease is the sensitive field, data anonymity can be compromised by homogeneity
— or Background knowledge.
SELECT PatientName, Postcode, AgeGroup, Disease,
COUNT(*) AS NumberofRows
FROM anon
GROUP BY PatientName, Postcode, AgeGroup, Disease
PatientName | Postcode | AgeGroup | Disease | NumberofRows |
* | 476** | 30 to 40 | Cancer | 2 |
* | 476** | 30 to 40 | Heart Disease | 1 |
* | 476** | Under 30 | Heart Disease | 3 |
* | 479** | Over 40 | Cancer | 1 |
* | 479** | Over 40 | Flu | 1 |
* | 479** | Over 40 | Heart Disease | 1 |
Homogeneous pattern attack.
As can be seen from the listing, because of the distribution (or lack of entropy) of disease types in patients under 30 it is possible to infer from this that if I know a patient is under 30 and included in the data set then he or she must have heart disease.
Background knowledge attack.
If, for instance I possess the background knowledge that the person I want to identify is in the age group 30 to 40 years old and there is no family history of Cancer, then I am able to infer (with reasonable certainty) that the patient I wish to identify has been diagnosed with Heart Disease.
The key take away here for SQL professionals is that the implementation of suppression and generalisation to implement k-anonymity can deal with maintaining identity disclosure but cannot provide protection against attribute disclosure.
Summary
It can be demonstrated that k-anonymity is a relatively easy method of ensuring that identity data in sensitive data sets cannot be definitively identified.
However, although it can be of use to protect identity, it does not automatically protect against attribute disclosure by homogeneous or background knowledge attacks.
Applying k-anonymity to a large data set could require up to Rows/k different anonymised classes to implement properly, and this task depends upon understanding the distribution of records within the set.
Not all types of data sets (especially those with high data entropy, like phone numbers) are suited to this method and all anonymisation results in some loss of utility (although in some cases this may not be a problem for the purposes of data mining).
The difficulty of identifying classes is one of a number of problems that can be classed as NP hard, meaning there is no simple automated way in which the distribution of data classes can be determined, other than by trial and error, adjusting the suppression and generalisation fields to achieve the best fit.
It is possible to partially reduce the risks of attribute disclosure by implementing statistical distribution algorithms such as l-diversity, t-closeness or by applying differential privacy policy. However, these are non-trivial amendments to the data distribution that may involve considerable work to implement and interpret.
The forthcoming introduction of the GDPR legislation will make it incumbent upon data processors to consider issues of data minimisation in their operations, holding and using data that is ‘necessary’ to their operations and SQL developers should add techniques such as the ones mentioned in this article to help demonstrate compliance. Privacy preserving data mining techniques are an up and coming part of the landscape for developers who work with processing sensitive data.
Suppression and generalisation techniques can go some way to protecting identity and should be considered in cases where applications or decision support systems are being implemented to process potentially sensitive information.
References:
Sweeney, L., 2002. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05), pp.557-570.
Li, N., Li, T. and Venkatasubramanian, S., 2007, April. t-closeness: Privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering (pp. 106-115). IEEE.
Comments are closed.