Is the answer in the errors? Hans-Jörg Schulz aims to find out!
Errors are almost never considered a subject of analysis itself. One reason why errors rarely are analyzed is the difficulty of pinpointing what exactly an error is. With the “Visual Analytics of Data Errors (VADE)” project, Hans-Jörg Schulz’s aims to find out. The project is supported with DKK 1.6 million by Aarhus University Research Foundation ’s AUFF Nova grant.
A core problem in data-driven science and fact-based decision making is poor data quality. The standard solution is to fight it with automated error detection to rid the data of its errors. Yet, the VADE project proposes a different approach; Instead of fighting erroneous data, it should be systematically mined and analyzed. The project aims to embrace the erroneous data as a source for important information on the underlying problems of the data management pipeline that should be analyzed and reported as first-class data properties.
The fundamental challenge is to cope with the unspecific nature of errors that does not follow a given schema or definition and that in most cases cannot be found through a database query or standard statistics. This requires new ways of dealing with errors that go beyond what pure statistics and machine learning can provide.
To address this challenge, VADE follows the principle of mixed-initiative analysis that combines the computational power of modern IT with the knowledge of domain. This allows the human user to gauge the erroneousness of data and to parametrize their inclusion in the analysis of errors. This mixed-initiative approach is heavily facilitated by data visualization, which provides the interactive interface between the computer and the analyst: computational results are added to the visualization by the computer, while the analyst uses the visualization to trigger, steer, and configure computations.
While all methods, algorithms, and software developed will be generally applicable, the project will be bootstrapped with concrete use-cases in collaboration with Ass. Prof. Søren Drud-Heydary Nielsen from Department of Food Science. These use-cases will focus on low-quality mass spectrometry data containing many missing values and being afflicted by a low signal-to-noise ratio. The work in this domain will serve as a proof-of-concept for the fundamental idea of VADE and have immediate concrete benefits to the food sciences.
VADE advocates for appreciating and handling erroneous data as helpful indicators of problems in the data management practices underlying the data. By showing this is not only possible, but also beneficial, VADE will pave the way for establishing errors as a first-class data property with their own computation and visualization methods.