Data is increasingly being recognized as a valuable asset, as opposed to something you need to contend with in operating. In order to use data as an asset, you need to ensure that your data is of the best quality possible. But, what is data quality, exactly? How do you measure it? Most importantly, how do you ensure you have quality data?
High quality data is data that can be provided when needed (timely) and accurately, which leads to high-trust in the data. While this is an end result, they are measurements of success in data, not a path to high quality data. In order to arrive at high quality data, there are several data quality checks that your company should be performing. Here they are (in no particular order):
- Data Is Known
At first glance, this may seem like a silly check, but too often we think about the data that is right in front of us or that we are working with and don’t ask the simple, but important question, “Do I know about all of the data that exists in my organization?” The first check that every organization should be doing is to identify all data being used or produced within their business.
- Data Is Understood
Once we have a list of all of the various data that is being used in, we need to make sure that we, and the broader organization, understand what that data is and represents. There needs to be an explicit definition of the data: what it is, how it is being used, why it is important to the organization. Understanding of the datasets, however, is not enough. We also need to understand the details of each dataset. What the fields mean, their context, their format, and their use are all important components of this understanding.
- Data Is Not Duplicated And Stored in Multiple Places (unless absolutely necessary)
Duplicity in data will eventually lead to low data quality as data in one location becomes out-of-sync with the same data stored in another location. More mature data organizations have consolidated data so that it is not stored in more than one location and, in many cases, have created mastered data which is the single source of truth for a dataset to be used throughout an organization.
Data must be accurate enough to make good business decisions. If it is not, data has become a liability, not an asset. This is where validating data against other sources (often external) can be extremely valuable.
There are two main aspects of Accessibility: Availability and Durability. How easy is it for users to gain access to the data in a way that is meaningful for them and their purpose? A highly accessible dataset is one that is intuitive, virtually frictionless, always available, does not degrade over time, and requires very little manipulation after it is accessed by the user.
This type of data quality check ensures that records within all datasets are not missing data or data elements. With missing data, even for a handful of records, business decisions and outcomes become inaccurate and unreliable.
Consistency checks revolve around instances where data is either stored in multiple places (should that have been necessary) or is being transported to multiple places. Data quality checks like this one validate that the data in one place is the same as data in another. Is the source data the same as the destination data, taking translations into account?
This is a validation that each record in a dataset isn’t a significant outlier within the dataset. If it is, that data has been subject to further analysis to ensure its accuracy and/or completeness. This type of check needs to be done in context with the broader dataset to have meaning.
Reliability checks are meant to validate that the data is present whenever it is needed. Data is either consistently available to users or consistently delivered to them. If data cannot be relied on to be there when called upon, you cannot claim high quality data. Even if data is accessible, it is not necessarily reliable.
Checking to see if data is valid really is about ensuring that data conforms to the standards set, either in metadata or within business rules. These checks will often run data through a series of rules to verify that the data conforms.