Automatic Quality Assessment of Datasets for Machine Learning
Keywords
Loading...
Authors
Issue Date
2025-04-15
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
This project explores data quality and it’s
impact on machine learning performance. We
introduce an assessment framework to au
tomatically quantify the data quality of a
given dataset based on the data quality dimen
sions completeness, consistency, and accuracy.
Datasets for the project were synthesized by
systematically introducing quality issues of
varying severity. Training a range of machine
learning models on these datasets reveals the
large impact of the completeness and accu
racy dimensions on model performance, and a
lower impact associated with the consistency
of data. The framework effectively assesses
data quality in dataset versions where a single
dimension is degraded but it’s performance
can be improved on versions with multiple
degraded dimensions. Future work includes
refining quality scores for the dimensions and
extending the framework to include more ma
chine learning tasks and models.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen
