The Problem
There is a problem with NULL that has persisted since the Relational Model was implemented in database products in the 1970’s.
“The simple scientific fact is that an SQL table that contains a null isn’t a relation; thus, relational theory doesn’t apply, and all bets are off ” C.J.Date (2014).
The SQL language is based on Relational Logic, and as such evaluation of a logical SQL expression resolves to ‘TRUE’ or ‘FALSE’. Adding NULL Values to a database breaks the relations implicit in the model and leads to ‘TRUE’, ‘FALSE’ and ‘UNKNOWN’.
At best this leads to increased code complexity by having to use null handling functions, horizontally decomposed WHERE clauses and inference. At worst it leads to incomplete or incorrect information, errors, and interpretation problems. This reduces the reliability, maintainability, integrity, usability, performance and ultimately the confidence in your database.
This problem relates not only to SQL databases but to other data sources that are supposed to rely on TRUE-FALSE logic but include NULL values in relations between entities.
Why NULL Exists
SQL Databases are modelled as domains, and as such the designer needs to be able to define what the domain encompasses. This includes defining the boundaries, identifying and delineating components, and identifying relationships. Almost by definition the designer will have incomplete information about the information that is relevant, especially when implementing new systems. In logic terms Null is not a value, it is not zero, it is unknown.
NULL exists because the following general conditions apply:
Existence –The attribute does not exist in the domain, or domain understanding is wrong. This means there is a missing entity in our domain model or entites are mixed in a table. E.g table contains hair colour for a car entity, Number of pregnancies for male patients.
Missing – The information has not been given at the time a row was created. E.g. A customer may decline to give their age.
Not Yet – Data is contingent upon an unknown event in the future, E.g. Termination date or Date of death.
Does not apply– Is not applicable for this instance of a record. E.g. Hair colour for bald people.
Placeholders – Indicates that we know that a bit of data exists, but we don’t know what it is, in this case keeping a NULL is useful for CUBE or ROLLUP queries.
In the real world applications of data structures NULLs are often unavoidable. However, it confuses users, designers and DBA’s (generally) hate it. It complicates Reporting, ETL, Business Intelligence and Data Science initiatives. As such, users need to be aware of the design and query compromises they need to use.
How to handle NULL in the SELECT Clause
SQL Server provides the following functions to help with handling NULL values in statements:
- NULLIF()
- Syntax: NULLIF (expression1, expression2)
- Returns NULL if both expressions are equal, else returns the first expression.
- ISNULL() to check the state of a field
- Syntax: ISNULL (check_expression, replacement_value)
- Returns replacement value that must be implicitly convertible to check expression data type.
- COALESCE() to use the first non-null field.
- Syntax: COALESCE( exp1, exp2,…expn)
- Can use multiple input expressions.
- Returns the datatype of the expression with highest precedence.
- Slower than ISNULL(), but reduces code clutter.
How to Handle NULL in the WHERE clause
Before execution the whole SQL statement is being evaluated to find out if it is TRUE (in which case it returns rows) or FALSE (in which case it returns no rows). When including rows containing NULL the logical outcome of the BOOLEAN operators (AND, OR, NOT) is UNKNOWN. (Depending on how combinations of Boolean operators are managed in your database system)
To manage the special cases where we wish to evaluate NULL to either a true or false value we can use the IS NULL and IS NOT NULL. We have to use these operators in the WHERE clause, not using the = operator.
- IS NULL
- SELECT * FROM Customers WHERE CustName IS NULL
- IS NOT NULL
- SELECT * FROM Customers WHERE CustName IS NOT NULL
- Use Horizontal Decomposition to add other conditionals
- SELECT * FROM Customers WHERE (CustName IS NOT NULL OR Custname=‘Bob’)
How to Design for NULL values
As a throwback to the data processing origins of SQL data entry it is not unusual to see ‘00000’ or ‘99999’ values inserted in the place of missing data. However, there are drawbacks to this approach which should be handled in other ways. This comes down to the table designers understanding of the problem space and can be managed in the following ways.
- Design Integrity into your tables.
- Use NOT NULL CHECK() constraints where possible.
- Do not use as a Primary key if there is ANY possibility that an incoming value could be NULL.
- Avoid in FOREIGN KEY relationships
- Consider using de-normalised separate tables to get around this.
- Use default field values where appropriate. Bear in mind arithmetic consequences of using 0, -99 as defaults.
- Know where a NULL field is likely to be filled by an event.
- Track missing data using companion codes or record type identifier codes. Where the table contains different types of entities use an indicator to show the entity type or what stage a record is at.
- Avoid passing NULL values into your database from poor application data validation.
- Ensure initialisation of variables by using defaults and appropriate auto filling of variable values.
- Fill data values by using lookups.
- Validate data to prevent audit difficulties. The presence of NULL leads to difficulties for the business, if, for instance a payment authoriser code is left as NULL.
- Use consistent datatypes and nullability across apps.
- Remember that NULL is not “NULL”!
- Determine the impact of missing data. What is the possible downstream impact of your design?
When validating data and multiple fields may contain NULL, consider using a check code field to indicate where records need attention. Check the NULL status of each field in a record using SELECT ISNULL (Field,1) and build a count of the number of fields that fail validation as part of the data cleansing process.
Proactive database management is about catching the information before it enters the database and cleaning up what is already there. Products exist, either as part of the database tools, or as external tools that assist in maintaining the integrity of your data.
Master Data Services is included as part of SQL Server and allows Models, Entities, Attributes, Rules and Versions to be defined and implemented and Data Quality Services can help ensure domain validity and knowledge driven data quality and is good for data correction, enrichment, standardization, and de-duplication.
Performance Impacts
Time spent in designing appropriate data quality controls will reduce the time spent retrieving from and maintaining the database.NULL slows down the working of indexes, decreases query performance and increases search times, as well as increasing your SQL code complexity. As a result it can decrease the confidence in the information gained from the database.
NULL and Data Science
As I always like to think of the upside to any management problems we can use the fact that NULL indicates a value is not known or indicates missing or incomplete data as a catalyst to discover where our domain modelling and understanding is lacking.
NULL may point to missing entities or uncaptured events. However, be aware that it may also skew the results of data tools that disregard NULL values. The presence of NULL values in data sources happens but this degrades the quality of the information that can be built upon it.
The three valued logic points us towards the three states of our knowledge. The data we have captured can be thought of in terms of ‘Known knowns’, ‘Unknown knowns’ and ‘Unknown unknowns’, or TRUE, FALSE and UNKNOWN respectively. NULL values show us where unknown unknowns may be lurking and data discovery and knowledge building begin by examining what it is that is unexplained.
Summary
The presence of NULL in datasets can have an adverse effect on downstream systems, in particular Reporting, BI, Predictive Analytics or Machine Learning that rely on the integrity and the TRUE/FALSE qualities of the data.
You can reduce the impact on your information by managing the quality of data going in; designing tables with integrity constraints; design and use apps to validate the input; using the NULL handling functions in SELECT or WHERE clauses to design queries to ensure correct results are returned.
If you are contemplating a Data Science exercise, use these NULLs as clues to pick up where domain knowledge is lacking.
References:
Date, C.J., 2008. A critique of Claude Rubinson’s paper nulls, three-valued logic, and ambiguity in SQL: critiquing Date’s critique. SIGMOD record, 37(3), pp.20-22.
Rubinson, C., 2007. Nulls, three-valued logic, and ambiguity in SQL: critiquing date’s critique. ACM SIGMOD Record, 36(4), pp.13-17.
https://technet.microsoft.com/en-us/library/ms191504(v=sql.105).aspx How Microsoft SQL Server handles NULL Values
https://www.simple-talk.com/sql/t-sql-programming/how-to-get-nulls-horribly-wrong-in-sql-server/ A fully comprehensive, hands on posting about SQL Server NULLs.