Linguistic Bias in Text Classi cation
Keywords
Loading...
Authors
Issue Date
2022-05-05
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
A growing concern within the eld of machine learning is the presence of (unintended)
bias and its implication on fairness. While gender and ethnicity bias have been a focus
within literature, this research is the rst systematic study on linguistic bias in text
classi cation. With di erent languages spoken around the world, linguistic bias in text
classi cation algorithms may result in unfair treatment of groups. Therefore, this paper
focuses on quanti cation and mitigation of unintended linguistic bias in text classi -
cation. The research uses a data set of 35,000 tweets, embedded using the mBERT
sentence transformer, to train multiple algorithms. In the rst experiment, linguistic
bias presence is determined given an assumed-Dutch data set. In the second experiment,
the e ect of code-switching bias in data on linguistic bias in text classi cation is exam-
ined. In the third experiment, various bias mitigation techniques are implemented and
evaluated. The results of the experiments show that linguistic bias can be present in
predictions of text classi cation algorithms and that code-switching bias in data a ects
linguistic bias. A new approach to bias mitigation, a machine translation preprocessing
technique, was introduced and found to be a simple method to greatly reduce linguistic
bias. With this thesis, opportunities are identi ed for future research on the quanti -
cation and mitigation of linguistic bias, providing concrete tools for the deployment of
linguistically fair classi cation models.
Description
Citation
Faculty
Faculteit der Sociale Wetenschappen
