Automatic Quality Assessment of Datasets for Machine Learning

Keywords

Loading...
Thumbnail Image

Issue Date

2025-04-15

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

This project explores data quality and it’s impact on machine learning performance. We introduce an assessment framework to au tomatically quantify the data quality of a given dataset based on the data quality dimen sions completeness, consistency, and accuracy. Datasets for the project were synthesized by systematically introducing quality issues of varying severity. Training a range of machine learning models on these datasets reveals the large impact of the completeness and accu racy dimensions on model performance, and a lower impact associated with the consistency of data. The framework effectively assesses data quality in dataset versions where a single dimension is degraded but it’s performance can be improved on versions with multiple degraded dimensions. Future work includes refining quality scores for the dimensions and extending the framework to include more ma chine learning tasks and models.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen