Introduction

In a previous post on K-anonymity we looked at how to implement anonymous datasets suitable for sharing whilst preserving the identity of the record subject.

There are problems with K-anonymous datasets, namely the homogeneous pattern attack, and the background knowledge attack, details of which are in my original post. A slightly different approach to anonymising public datasets comes in the form of ℓ -diversity, a way of introducing further entropy/diversity into a dataset.

A sensitive data record is made of the following microdata types: the ID; any Key Attributes; and the confidential outcome attribute(s). ℓ -diversity seeks to extend the equivalence classes that we created using K-anonymity by generalisation and masking of the quasi-identifiers (the QI groups) to the confidential attributes in the record as well. The ℓ -diversity principle demands that, in each QI-group, at most 1/ ℓ of its tuples can have an identical sensitive attribute value.

The new ℓ -diverse equivalence classes require that the anonymous record is ‘well represented’ in the records. As an example, if we have 4 types of disease diagnosis, then for 2-diversity the equivalence records should expose the sensitive record as no more than two of four possible records.

The production of a 2-diverse table will look like this:

Table2

 

Definition of ℓ-Diversity (The Science bit)

A data set is said to satisfy ℓ -diversity if, for each group of records sharing a combination of key attributes, there are at least ℓ “well represented” values for each confidential attribute. A table is said to have l -diversity if every equivalence class of the table has l-diversity [1, 2].

Well represented can be defined in three separate ways.

  1. Distinct ℓ-diversity. There must be at least ℓ distinct values of the sensitive attributes.
  2. Entropy ℓ-diversity. Formally, the entropy of a group, G for a particular confidential attribute with domain c can be defined as:

equation

        where p(G,c) is the probability of a record having the value of c in group G. A dataset can be said to possess entropy ℓ-diversity if, for each group G the Entropy H(G) ≥  log ℓ.

  1. Recursive (c,ℓ) diversity. A compromise way of ensuring that the most frequent values do not appear too frequently and that the least frequent values do not appear too rarely.

Attack Vectors

ℓ -diversity is not immune from disclosure attacks. Two attacks methods are:

 Skewness attack. If, as in our example, QI group 1 has the same number of patients with and without heart disease, and in that case it satisfies 2-diversity. However, if an intruder can link a specific patient to that group, that patient can be considered to have 50% probability of having heart disease, instead the considerably smaller probability for the overall data set. In this case, the penalty for ℓ -diversity is information loss, as many of the ‘non-sensitive’ records will have been dropped from the table to satisfy the ℓ -diversity requirement.  

Similarity attack. If values of a sensitive attribute within a group are ℓ -diverse but semantically similar, attribute disclosure also takes place. E.g. if patients in a 3-diverse data set where Disease is a confidential attribute all have values in {“lung cancer”, “liver cancer”, “stomach cancer”} an intruder linking a specific individual to that group can infer that the individual has cancer. Likewise, if the confidential attribute is numerical and values within a group are ℓ -diverse but very similar, the intruder can estimate the confidential attribute value for an individual in that group to a narrow interval.

This leakage of sensitive information occurs because while n-diversity requirement ensures “diversity” of sensitive values in each group, it does not take into account the semantic closeness of these values.

Summary

It has been argued that ℓ -diversity may be difficult and unnecessary to achieve, and even achieving ℓ-diversity may be insufficient to prevent attribute disclosure.

Implementing an ℓ -diverse optimal dataset is an NP-hard computing problem [3]. When there is more than one sensitive field the l-diversity problem becomes more difficult due to added dimensionalities [4]. This becomes a limiting factor because the feasibility of implementing such privacy safeguards is not easily implemented using machine algorithms.

The implementation of an ℓ -diverse data reporting strategy leads to large information loss when the confidential records to be protected are within a much larger group of non- sensitive records. The promotion of privacy enhancement leads to a loss of utility. Conversely, the sensitive records may become more prone to disclosure because they are now part of a reduced set of data observations.

The problems of implementing anonymous shared datasets is one for which there are many different approaches and resolutions. Selectively implementingℓ -diversity may offer an alternative to k-anonymity in areas where the loss of data and utility are balanced by the need to share important but sensitive records between partners.

References / Further Reading

[1]: Machanavajjhala, A., Gehrke, J., Kifer, D. and Venkitasubramaniam, M., 2006, April. l-diversity: Privacy beyond k-anonymity. In 22nd International Conference on Data Engineering (ICDE’06) (pp. 24-24). IEEE.

[2]: Domingo-Ferrer, J. and Torra, V., 2008, March. A critique of k-anonymity and some of its enhancements. In 2008 Third International Conference on Availability, Reliability and Security (pp. 990-993). IEEE.

 [3]: Xiao, X., Yi, K. and Tao, Y., 2010, March. The hardness and approximation algorithms for l-diversity. In Proceedings of the 13th International Conference on Extending Database Technology (pp. 135-146). ACM.

[4]: Aggarwal, C.C. and Philip, S.Y., 2008. A general survey of privacy-preserving data mining models and algorithms. In Privacy-preserving data mining (pp. 11-52). Springer, Boston, MA.