New legislation dealing with the handling of personal data, most notably but not exclusively the GDPR, emphasise the need to keep customer identity data safe. The recommendations for doing this include encryption, anonymisation, pseudonymisation and data masking (see ICO GDPR guidance ). The concepts are related by purpose but have different realms of implementation, restriction and use.
This is a particular problem for universities, health providers and public data bodies, who collect individual level data for providing a service, but also need to be able to utilise these data to provide the insights and benefits of generalised research and outcome improvement .
This article shows how a relational database implementation can be leveraged to implement a privacy aware data mining capacity using encryption techniques and architecture to provide pseudonymous data sets that can be reasonably shared whilst minimising the risks of data re-identification.
Levels of Protection
|Protection Level||Method||Related Articles||Drawbacks|
symmetric or asymmetric
|Encryption||Key matching; Processing overhead; Key management|
Strong Identity Protection
|k-anonymity||Not applicable to all data sets; re-identification risk; need to know data shape|
‘Golden mean’ Protection
|Pseudonym service needed; quality assurance; user management|
|Dynamic Data Masking||Key Matching; reusability in large/ complex sets|
In pseudonymisation, matching data sets at individual row level is done using key fields, which are then pseudonymised for consumption. Candidates for key fields include those combinations that are most often used to match the datasets, e.g. DoB/Gender/Postcode, credit card numbers, IP addresses or email identifiers. Allocation of persistent pseudonyms are used to build up profiles over time to allow data mining to happen in a privacy sensitive way.
All methods for privacy aware data mining carry additional complexity associated with creating pools of data from which secondary use can be made, without compromising the identity of the individuals who provided the data. Pseudonymisation can act as the best compromise between full anonymisation and identity in many scenarios where it is essential that the identity is preserved, whilst minimising the risks of re-identification beyond reasonable means.
The following semi-technical architectures have been adapted from a medical context , an area in which database professionals have had a good deal of experience with the sensitive collection and secondary use sharing of datasets.
Architectures for Pseudonymity
The first step is to identify the pseudonymisation type, as this will determine the most suitable approach for your security architecture.
- One way pseudonyms allow record linkage but cannot be reversed.
- Two way pseudonyms allow authorised personnel to re-connect the identity in cases where feedback needs to be made to the individual.
The second step is to identify the use or purpose of the publication of the data set
Single data source with one time secondary use
This is the typical application scenario for using anonymization. It utilises one-way masking or anonymization techniques to hide the identity of personal information. Identity data may need to be encrypted to ensure that this is not inadvertently revealed.
Overlapping data sources with one time secondary use
This is the typical application scenario for using one-way pseudonyms. It needs a unique matching ID across the data sources in order to link them together using a Trusted Third Party (TTP) service architecture.
The secondary user encryption key is used to encrypt the information. The data source sends data and the ID to the TTP, which then encrypts the data and the ID with the secondary user public key. The TTP then passes the encrypted ID and data to the secondary user. This means that only the secondary user can decrypt the data (using his/her key).
Cybersecurity measures on sender authentication and authorisation are needed to prevent encryption attacks.
One time secondary use with a need for re-identification at a later date
This architecture is the simplest implementation of two-way pseudonymisation and can re-identify the person, e.g. to feedback on research findings. It involves creating and using a refined ID reference list located at a trusted third party (TTP).
Firstly, a project specific ID is obtained from an ID generating TTP service, which stores the identity list and ensures the correct linkage between data sets, with the confidential ID stored at the data source. Secondly, a separate TTP service encrypts the project ID and the data to a pseudonym.
Re-identification is achieved by decrypting the project ID from the pseudonymisation service and passing this back to the data source.
From a cybersecurity angle, by implementing a two-step approach the data is protected from a single point attack on the ID list, and the identity data is kept in-context, an issue where data is collected in sensitive areas.
Pseudonymous Research Data Pool
This architecture is for many ‘general purpose’ secondary uses, including research and statistical analysis.
Using the same architecture as previously described, a pool of secondary use data is collected for potentially many uses. It permanently stores data and pseudonyms on the secondary use side of the diagram, to enable longitudinal analysis to take place.
Cybersecurity concerns include controlled access to the pool by contract or specified role. This process also demands careful quality management before ID’s are pseudonymised, which may be implemented using an additional data management TTP.
Central database with many secondary uses
This architectural paradigm can be used to better support long term observations and feedback and can be used in clinical settings or for research purposes.
The central database is implemented as a TTP service and the owner of the data is responsible for quality control. It contains no identity data, only the ID, and authorised access is granted via the identity list. Additional data sources are able to gain authorised access by using their own ID when it is added to the identity list. Where access is required the dataset is exported in anonymous or pseudonymous form by a TTP with a project specific identifier key, different projects obtain different pseudonyms.
This model allows the greatest flexibility of research use, and allows update and feedback to data subjects for reconnection purposes. This, however, comes with sophisticated quality assurance, communication and access requirements. The drawbacks of running a centralised research database are detailed in reference .
With pseudonymisation different uses demand different solutions. They should be evaluated up-front according to the sensitivity of the data, need for data linkage, re-linkage and updating.
Many commercial companies will have spent most of their existence joining datasets using identifiers that may be considered to be an infringement of privacy practice, and are unsure of how to provide the benefits of data sharing alongside the necessity of data privacy.
Data quality and assurance is a big task for large generalised research projects, and one which many practitioners or professionals are not familiar or equipped to deal with. Master Data Management and Data Quality solutions are available to be implemented as TTP services, but come with a training overhead.
I always refer to the ICO website guidance on GDPR in the UK, here
 Mourby, M., Mackey, E., Elliot, M., Gowans, H., Wallace, S.E., Bell, J., Smith, H., Aidinlis, S. and Kaye, J., 2018. Are ‘pseudonymised’data always personal data? Implications of the GDPR for administrative data research in the UK. Computer Law & Security Review, 34(2), pp.222-233.
 Pommerening, K. and Reng, M., 2004. Secondary use of the EHR via pseudonymisation. Studies in health technology and informatics, pp.441-446.
 Quantin, C., Jaquet-Chiffelle, D.O., Coatrieux, G., Benzenine, E. and Allaert, F.A., 2011. Medical record search engines, using pseudonymised patient identity: an alternative to centralised medical records. international journal of medical informatics, 80(2), pp.e6-e11.
 Hagger-Johnson, G., Harron, K., Fleming, T., Gilbert, R., Goldstein, H., Landy, R. and Parslow, R.C., 2015. Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records. BMJ open, 5(8), p.e008118.
I’m presenting at the COSAC World Congress 30 Sept- 4th Oct 2018, in Ireland. Booking details are here.
Hi there. Thanks for sharing the valuable information. Good work!
I have read a similar article on 8 USE CASES OF DATA MINING BY INDUSTRY. Please do check it out.