Ian Huntly, CEO of Rifle-Shot Performance Holdings, official representatives of SoftExpert in South Africa and Arkady Maydanchik, a recognised leader and innovator in the fields of data quality and information integration, offer advice on how to steer clear of common pitfalls and build an efficient data quality management programme.
The corporate data universe consists of numerous databases linked by countless data interfaces. While the data continuously moves about and changes, the databases and the programs responsible for data exchange are endlessly redesigned and upgraded. Typically, this dynamic results in information systems getting better, while data quality deteriorates.
This is unfortunate, since quality is what determines data’s intrinsic value to businesses and consumers. Information technology magnifies this intrinsic value. Thus, high quality data, combined with effective technology, is a great asset, while poor quality data, combined with effective technology, is an equally great liability.
Yet we tolerate enormous inaccuracies in databases and accept that most of them are riddled with errors, while corporations lose millions of dollars because of flawed data. Even more disheartening is the continuous growth in the magnitude of data quality problems, fostered by exponential increase in the size of databases and the further proliferation of information systems.
Inadequate staffing of data quality teams
“Who should be responsible for data quality management?” is a frequently asked question in the data quality profession. Uncertainty exists partly because the profession is still in its infancy; no clearly defined group has the appropriate expertise and responsibility. Even companies that form data quality departments often staff them with employees who have expertise in general IT and data but who have no specific data quality knowledge.
Most people in charge of data quality initiatives lack data quality experience. As a result, data quality management programs tend to follow one of two scenarios. Data quality management is an IT discipline and requires IT expertise.
As it takes two to tango, so a data quality management team must include both IT specialists and business users. In addition, a team needs data quality experts, those who have first-hand experience in designing, implementing, and fine-tuning data quality rules and monitors.
Hoping data will get better by itself
One of the key misconceptions is that data quality will improve by itself as a result of general IT advancements. Over the years, the onus of data quality improvement was placed on modern database technologies, better information systems and sophisticated data integration solutions.
In reality, most IT processes affect data quality negatively. New system implementations and system upgrades are a major source of data quality problems. Data integration interfaces create thousands of errors in the blink of an eye.
Even routine data processing is prone to error. Thus, if we do nothing, data quality will continuously deteriorate to the point at which the data becomes a huge liability. The only way to address the data quality challenge is by a systematic, on-going program that assesses and improves existing data quality levels, and continuously monitors data quality and prevents its future deterioration as much as possible.
Lack of data quality assessment
Nearly all data quality management programs focus on data quality improvement. A major obstacle on the path to higher data quality, however, is that most organisations, aware of the importance of data quality, are unaware of the extent of the problems with their data.
Their knowledge of data quality problems is usually anecdotal, rather than factual. Typically organisations either underestimate or overestimate the quality of their data and they rarely understand the impact of data quality on business processes.
These two pitfalls cause the failure of many BI projects. Furthermore, data quality improvement initiatives, when put in place, often fail because no method is provided for measuring data quality improvements.
Assessment is the cornerstone of any data quality management program. It helps describe the state of the data and advances understanding of how well the data supports various processes. Assessment also helps the business estimate how much the data problems are costing it.
Narrow focus
Systematic data quality management efforts originated in the 90s from analysing, matching, standardising and de-duplicating customer data. Over the years, great strides have been made in this area. Modern tools and solutions allow businesses to achieve very high rates of success. A good number of organisations have implemented these solutions by now and it is fair to say that overall corporate customer data quality is at the highest level ever. This progress makes many organisations feel good about their data quality management efforts.
Unfortunately, the same cannot be said about the rest of the data universe. Data quality has continually deteriorated in the areas of human resources, finance, product orders and sales, loans and accounts, patients and students, and myriad other categories. Yet these types of data are far more plentiful and certainly no less important than customer names and addresses.
The main reason we fail to adequately manage quality in these data categories is that their structure is far more complex and does not allow for a “one size fits all” solution. More effort and expertise are required and data quality tools offer less help. Until organisations require data quality management programs to focus equally on all of their data, we cannot expect significant progress.
Bad metadata
The greatest challenge in data quality management is that actual content and structure of the data is rarely understood. More often, we rely on the theoretical data definitions and data models. Since this information is usually incomplete, out-dated and incorrect, the actual data looks nothing like what is expected. The solution is to start data quality management programs with extensive data profiling, the term used to describe a collection of experimental techniques aimed at examining the data and understanding its actual structure and dependencies.
Ignoring data quality during data conversions
Data warehouses begin their life with data conversions from various operational databases, usually a rather violent beginning. Data conversion usually takes the better half of the implementation effort and almost never goes smoothly.
Every system is made of three layers: database, business rules, and user interface. What users see is not what is actually stored in the database, especially in older “legacy” systems. During data conversion, the data structure is usually the centre of attention. The data is mapped between old and new databases. However, since the business rule layers of the source systems are poorly understood, this approach inevitably fails.
Another problem is the typical lack of reliable metadata about the source database. The quality of the data after conversion is directly proportional to the amount of time spent to analyse and profile the data and uncover the true data content. Unfortunately, the common practice is to “convert first and deal with data quality later.” The ideal data conversion project begins with data analysis, comprehensive data quality assessment and data cleansing. Only then can we proceed to coding transformation algorithms.
Winner-loser approach in data consolidation
Most data warehouses draw data from multiple operational systems. The need to consolidate data from multiple sources adds the new dimension of complexity to basic data conversion, as the data in the consolidated systems often overlap. There are simple duplicates, overlaps in subject populations and data histories and numerous data conflicts.
The traditional approach is to set up a winner-loser matrix indicating which source data element is picked up in case of a conflict. The correct approach to data consolidation is to view it in a similar light as data cleansing.
Inadequate monitoring of data interfaces
It is not uncommon for a data warehouse to receive hundreds of batch feeds and uncountable real-time messages from multiple data sources every month. These on-going data interfaces can be usually tied to the greatest number of data quality problems. The problems tend to accumulate over time and there is little opportunity to fix the ever-growing backlog as we strive toward faster data propagation and lower data latency.
The solution to interface monitoring is to design programs operating between the source and target databases, which are entrusted with the task of analysing the interface data before it is loaded and processed. Individual data monitors use data quality rules to test data accuracy and integrity.
Their objective is to identify all potential data errors. Advanced monitors that use complex business rules to compare data across batches and against target databases identify more problems. Aggregate monitors search for unexpected changes in batch interfaces. They compare various aggregate attribute characteristics (such as counts of attribute values) from batch to batch. A value outside of the reasonably expected range indicates a potential problem.
Forgetting about data decay
The data is accurate only if it represents real world objects. This assumes perfect data collection processes, of course, and in reality, object changes regularly go unnoticed to computers. Thus, accurate data can become inaccurate over time, without any physical changes made to it.
Whether the cause is a faulty data collection procedure or a defective data interface, the situation when the data gets out of sync with reality, is rather common. The solution to the problem is recurrent data quality assessment and sample comparison against trusted sources. This provides information about the rate of decay and shows the categories of data that are most prone to quick decay. Such knowledge can be used to improve data collection procedures and data interfaces.
Poor organisation of Data Quality Metadata
Data quality initiatives produce enormous volumes of valuable metadata. Data quality assessment tells us about existing data problems and their effect on various business processes. When done recurrently, assessment also shows data quality trends. Data cleansing determines causes of errors and possible treatments.
It also creates an audit trail of corrections so that, at a later point, we can discover how a particular data element came to look the way it does. Interface monitoring identifies on-going data problems and tells about data lineage, as does data conversion and consolidation.
The common problem in data quality management is inadequate architecture of the data quality metadata repositories. Data quality assessment projects routinely generate innumerable unstructured error reports with no effective way of summarising and analysing the findings.
Data cleansing initiatives typically lack audit trail mechanisms and ETL processes often lack data lineage information. As a result, the value of the data quality initiatives is greatly diminished. In the worst-case scenario, the projects are totally abandoned.
The solution is to design a comprehensive data quality metadata warehouse (DQMDW), which is the collection of tools for organisation and analysis of all metadata relevant to or produced by the data quality initiatives. It is a rather complex solution, combining elements of object-oriented metadata repository with analytical functionality of a data warehouse.
However, in absence of a well-designed DQMDW, data quality metadata will suffer from the very malady it is intended to cure – poor quality.