Introduction

In a previous post on K-anonymity we looked at how to implement anonymous datasets suitable for sharing whilst preserving the identity of the record subject.

There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.

A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.

The new ℓ -diverse equivalence classes require that the anonymous record is ‘well represented’ in the records. As an example, if we have 4 types of disease diagnosis, then for 2-diversity the equivalence records should expose the sensitive record as no more than two of four possible records.

The production of a 2-diverse table will look like this:

Deﬁnition of ℓ-Diversity (The Science bit)

A data set is said to satisfy ℓ -diversity if, for each group of records sharing a combination of key attributes, there are at least ℓ “well represented” values for each conﬁdential attribute. A table is said to have l -diversity if every equivalence class of the table has l-diversity [1, 2].

Well represented can be defined in three separate ways.

1. Distinct ℓ-diversity. There must be at least ℓ distinct values of the sensitive attributes.
2. Entropy ℓ-diversity. Formally, the entropy of a group, G for a particular confidential attribute with domain c can be defined as:

where p(G,c) is the probability of a record having the value of c in group G. A dataset can be said to possess entropy ℓ-diversity if, for each group G the Entropy H(G) ≥  log ℓ.

1. Recursive (c,ℓ) diversity. A compromise way of ensuring that the most frequent values do not appear too frequently and that the least frequent values do not appear too rarely.

Attack Vectors

ℓ -diversity is not immune from disclosure attacks. Two attacks methods are:

Skewness attack. If, as in our example, QI group 1 has the same number of patients with and without heart disease, and in that case it satisﬁes 2-diversity. However, if an intruder can link a speciﬁc patient to that group, that patient can be considered to have 50% probability of having heart disease, instead the considerably smaller probability for the overall data set. In this case, the penalty for ℓ -diversity is information loss, as many of the ‘non-sensitive’ records will have been dropped from the table to satisfy the ℓ -diversity requirement.

Similarity attack. If values of a sensitive attribute within a group are ℓ -diverse but semantically similar, attribute disclosure also takes place. E.g. if patients in a 3-diverse data set where Disease is a conﬁdential attribute all have values in {“lung cancer”, “liver cancer”, “stomach cancer”} an intruder linking a speciﬁc individual to that group can infer that the individual has cancer. Likewise, if the conﬁdential attribute is numerical and values within a group are ℓ -diverse but very similar, the intruder can estimate the conﬁdential attribute value for an individual in that group to a narrow interval.

This leakage of sensitive information occurs because while n-diversity requirement ensures “diversity” of sensitive values in each group, it does not take into account the semantic closeness of these values.

Summary

It has been argued that ℓ -diversity may be difﬁcult and unnecessary to achieve, and even achieving ℓ-diversity may be insufﬁcient to prevent attribute disclosure.

Implementing an ℓ -diverse optimal dataset is an NP-hard computing problem [3]. When there is more than one sensitive field the l-diversity problem becomes more difficult due to added dimensionalities [4]. This becomes a limiting factor because the feasibility of implementing such privacy safeguards is not easily implemented using machine algorithms.

The implementation of an ℓ -diverse data reporting strategy leads to large information loss when the confidential records to be protected are within a much larger group of non- sensitive records. The promotion of privacy enhancement leads to a loss of utility. Conversely, the sensitive records may become more prone to disclosure because they are now part of a reduced set of data observations.

The problems of implementing anonymous shared datasets is one for which there are many different approaches and resolutions. Selectively implementingℓ -diversity may offer an alternative to k-anonymity in areas where the loss of data and utility are balanced by the need to share important but sensitive records between partners.