The production of massive amounts of data as a result of the ongoing ‘Big Data’ revolution has transformed data analysis. The availability of analysis tools and decreasing storage costs, allied with a drive by businesses to leverage these datasets with purchased and publicly available data can bring insight and monetize this new resource. This has led to an unprecedented amount of data about the personal attributes of individuals being collected, stored, and lost. This data is valuable for analysis of large populations, but there are a considerable number of drawbacks that data scientists and developers need to consider in order to use this data ethically.
Here are just a few considerations to take into account before ripping open the predictive toolsets from your cloud provider:
- Contextual Integrity. Data is collected over different contexts which have different reasons and permissions for capture. Ensure that the data you capture is valid for that context and cannot be misused for other purposes. There could be unintended side effects of mixing public and personal data. An example is notifying other parties of location data without consent, as there are numerous examples of stalkers using applications to track others.
- History Aggregation. History is an important part of many efforts to defining profiles for, say, basket analysis or financial purposes. The history is based on a simple list of transactions, but to use this data could include potentially privacy compromising data. Consider using techniques that aggregate scores, spends and frequencies rather than individual data. This also has the advantage of cutting the number of records needed to handle records and the noisy signals transactions produce. Using techniques like Markov Chains in complex historical and longitudinal analysis can assist in this task by converting historical data into states that can be computed.
- Prevent Discrimination. We all possess prejudices (yes, even me). Ensure that assumptions are evaluated before the data science exercise to ensure that these do not become generalised as part of the analysis, and bear in mind that incorrect assumptions may have been included as part of the original data collection. To this end, do not include sensitive indicators of race, sexuality, social class or other indicators of individual difference into your analyses (Note that this may be unavoidable in some circumstances, e.g. medical research). Ensure that the business processes that you use do not convert prejudice against individuals into discrimination against groups.
- Feature selection and Over-identification (definition). During machine learning, the number of features required to identify classes of individuals is carried out by either increasing or decreasing stepwise the number of salient factors in order to classify the research domain. Including too many features often leads to model over-identification and a decrease in the predictive power of the model. Not using personal data makes this job easier and reduces the chances of over-identification and reduced the number of dimensions the analysis need to consider.
- Generalising Personal Data. Profiling data relies in part in identifying a common classifier for individuals, especially in cases like location services and advertising. Fuzzing or grouping the data will cause some loss of accuracy but maintains anonymity of the individual. Examples include using only postal district rather than postcode and generalising features into relevant groups over individual attributes (e.g. Age Groups vs using date of birth).
- De-identify and Re-identify. Although the usefulness of a dataset is directly proportional to the amount of information it contains, it will still be of use by replacing sensitive data by identifiers and classifiers rather than the sensitive data itself. In order to re-identify the records it is preferable to use identifiers for individuals, and retain these in a separate, secured, master database. Once the analysis is complete and if individuals need to be identified (e.g. members of a criminal group, or patients that share the same defective gene) this can be done efficiently and securely.
- The Law of Large Numbers (definition). This law states that the greater the number of observations, the less likely there is to be instability of the results achieved as a result of repeated experiments. This stability is the holy grail of predictive analysis, allowing the results to generalise across populations. Big Data analytics has the luxury of allowing developers to do such analysis, generalising individual attributes into those of equivalence classes for the purposes of prediction. However, knowing an individual is part of a class removes any need to know the characteristics of the individual, which brings us to the next point.
- Prediction does not require the individual. Predictive algorithms work on the basis of calculating the likelihood a person is in, or has the intention to be in, a certain classification or state. Attitude and behaviour prediction does not require personal data at all. Use the derived classes to drive the prediction, not the individual.
- The GDPR. If the utility of personal data is required then a Data Protection Impact Assessment (ICO DPIA Guidance) is required. Also consider the merits of using encryption to keep personal data safe from interception or loss. Fines under the GDPR are understandably severe, of up to 4% of annual turnover. Avoiding unnecessary processing and minimisation of personal data is the smart approach.
Data Science is the application of scientific method to datasets. This data is socially created and relates to the private and personal lives of individuals that must be handled with extreme care. Although scientific method is a logical approach to analysis, the nature of social situations means that the dimensionality and complexity of using personal data for outcomes needs to be evaluated to avoid disclosure or undue distress.