Introduction
The Big Data community, like the Business Intelligence and the Management Information communities before is driven to deliver information, insight and value by extracting data from separate systems, formats and locations and building business and customer value by providing additional insights and information products.
Conventional wisdom holds that storing data in partitioned stores (Data Silos) is a bad thing as this fragmentation can lead to duplication, data errors and incomplete problem analysis. By locking information in discrete structures the benefits of sharing these data cannot easily be realised and considerable effort needs to be made by analysts and integration experts to make it ready for consumption.
It is also important to remember how and why these silos exist in the first place, and the social, organisational, legal and operational design decisions that determined how these structures came to barrier the information this way. Although they make integration difficult data silos are nothing if not persistent, and exist in every business in every industry sector, so there may be more factors in play than just deliberately making data hard to access.
Reasons Why Data Silos Exist
- The data may be held in structures that render the data architectures incompatible. This can be due to incompatibility between the data models used, the policies implemented, and the rules and standards employed in the design.
- Systems may be incompatible at the technical level, and the technology that was used to capture the information may not support the desired data structure or format, requiring processing before the data can be consumed.
- Separate systems often underpin separate services or functions and data may be partitioned for privacy purposes. As an example, clinical systems will often hold person specific confidential data that is deliberately kept apart from the associated medical, billing and insurance data for confidentiality and sensitivity reasons.
- The systems in question may have different user groups and be based on the needs of these users. Company mergers and acquisitions often lead to incompatible or duplicated systems. Due to their origins, this may require additional integration effort.
- The system may serve a specific legislative or regulatory reason which means that the data contained within must be physically partitioned from other data sources. Some data may be partitioned for specific auditing and access control purposes.
Using and Integrating Data Silos
Analyse the primary focal relationship that is underpinned by the information stores in question. Make sure this relationship is not compromised by including other secondary information stores.
- Ensure end to end coverage of the new systems affords at least as much security to the highest rated data, otherwise you may find the most sensitive data is compromised. Security does not only protect consumers, it also protects your company, its’ reputation, your client relationships and IP, so be sure that you do not expose externally the data that was neatly being kept in that silo.
- If you are building new information software that incorporates the characteristics of a cohort of people in order to provide a new product, ensure that the data you use does not compromise the privacy of individuals. Depersonalised or aggregated data can often be used for secondary uses, for example in calculating and reporting hospital bed occupancy rates, or the average congestion on roads. This does not require the personalised data of the original data sets.
- Joining separate data stores may generate new information and insights. This can be from the inclusion of other external or public data sources, or because the collective power of the data stores gives new insights (due to grouped analysis new insights into, say, medical safety issues may become apparent).
- Maintain adequate audit trails of data provenance, cleansing and loading routines and any transformations carried out on the information. You should be able to explain and show how your big data insights were arrived at.
- Some silos may be more defensible. Liberating data from these silos may increase the attack surface area or encourage data leakage. Ensure that security mechanisms that exist are re-instated in the new information products.
- Be sure you understand the nature of the possible privacy and security infringements that may occur. Seek advice on what amount of data sharing is appropriate, and whether additional governance of the newly produced information store is needed. Ethical data processing requires thinking about the unintended side effects of using the information transparently.
- Be clear about the reasons for republishing the data and the potential impact this may have.
Thinking Ahead
Analyse source systems to understand the assumptions that were in place when the silo was set up, as this may give insight into why it is currently not integrated. Information stores always reflect the power and control structures that created them in the first place and analysts should be cognisant of ownership and permissions structures when integrating.
Developers should also be aware of the potential security, privacy and consent risks inherent in repurposing personal data. If personal data is found in the public domain, it does not stop being personal data. Information privacy can be thought of as contextual integrity, and if the context is changed, then the privacy afforded will be also.