An Introduction to Dynamic Data Quality Challenges

The Challenge of Test Data Quality in Data Processing

From Content to Context

Research in data and information quality has made significant strides over the last 20 years. It has become a unified body of knowledge incorporating techniques, methods, and applications from a variety of disciplines including information systems, computer science, operations management, organizational behavior, psychology, and statistics. With... (more)

A Probabilistically Integrated System for Crowd-Assisted Text Labeling and Extraction

The amount of text data has been growing exponentially in recent years, giving rise to automatic information extraction methods that store text... (more)


Jan. 2016 -- New book announcement


Carlo Batini and Monica Scannapieco have a new book:

Data and Information Quality: Dimensions, Principles and Techniques 

Springer Series: Data-Centric Systems and Applications, soon available from the Springer shop

The Springer flyer is available here

Experience and Challenge papers:  JDIQ now accepts two new types of papers. Experience papers describe real-world applications, datasets and other experiences in handling poor quality data. Challenges papers briefly describe a novel problem or challenge for the IQ community. See Author Guidelines for details.

An Exploratory Case Study to Understand Primary Care Users and Their Data Quality Tradeoffs

Primary care data is an important piece of the evolving healthcare ecosystem. In addition to supporting the provision of patient care, primary care data can be used for a number of important secondary purposes. Understanding the tradeoffs between timeliness, accuracy, completeness and usefulness of primary care data is important to design systems that generate high quality data. As a case study, data quality measures and metrics are developed with a focus group of managers from a primary care organization. After calculating and extracting measurements of data quality, each measure was modeled with logit binomial regression to characterize tradeoffs and data quality interactions. Measures for accuracy, completeness and timeliness were calculated for 196,967 patient encounters. Report generation was measured as a proxy for the usefulness dimension. Based on the analysis, there was a positive relationship between accuracy and completeness, and a negative relationship between timeliness and usefulness. Importantly, the use of data was associated with an increase in completeness and accuracy. There were limitations to the measures and metrics developed with the focus group, however it was agreed that the measures were reasonable proxies for the data quality dimensions under study. The results provide meaningful insight on user tradeoffs and can be used in the design of systems in primary care.

QDflows: A System Driven by Knowledge Bases for Designing Quality-aware Data Flows

In the Big data era, data integration is becoming increasingly important. It is usually handled by data flows processes that extract, transform, and clean data from several sources, and populate the Data Integration System (DIS). Designing data flows is facing several challenges. In this paper, we deal with data quality issues such as: (1) expressing a set of quality rules, (2) enforcing them on the DIS to detect violations, and (3) repairing inconsistencies. We propose QDflows, a system for designing quality-aware data flows that considers as input: (1) a high quality Knowledge Base (KB) as the global schema of integration, (2) a set of data sources and a set of validated users requirements, (3) a set of defined mappings between data sources and the KB, and (4) a set of quality rules specified by the users. QDflows uses an ontology to design the DIS schema. It offers the ability to define the DIS ontology as a module of the knowledge base, based on the validated users requirements. The DIS ontology model is then extended with multiple types of quality rules specified by users. QDflows extracts and transforms data from the sources in order to populate the DIS. It detects violations of the quality rules enforced on the DIS, constructs repair patterns, searches for horizontal and vertical matches in the knowledge base, and performs an automatic repair when possible or generates possible repairs. It interactively involves users to validate the repair process before loading the clean data into the DIS. Using real-life and synthetic datasets, DBpedia and Yago knowledge bases, we experimentally evaluate the generality, effectiveness, and efficiency of QDflows. We also showcase an interactive tool implementing our system.


