Linguistic Bias in Text Classi cation

Keywords

Loading...
Thumbnail Image

Issue Date

2022-05-05

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

A growing concern within the eld of machine learning is the presence of (unintended) bias and its implication on fairness. While gender and ethnicity bias have been a focus within literature, this research is the rst systematic study on linguistic bias in text classi cation. With di erent languages spoken around the world, linguistic bias in text classi cation algorithms may result in unfair treatment of groups. Therefore, this paper focuses on quanti cation and mitigation of unintended linguistic bias in text classi - cation. The research uses a data set of 35,000 tweets, embedded using the mBERT sentence transformer, to train multiple algorithms. In the rst experiment, linguistic bias presence is determined given an assumed-Dutch data set. In the second experiment, the e ect of code-switching bias in data on linguistic bias in text classi cation is exam- ined. In the third experiment, various bias mitigation techniques are implemented and evaluated. The results of the experiments show that linguistic bias can be present in predictions of text classi cation algorithms and that code-switching bias in data a ects linguistic bias. A new approach to bias mitigation, a machine translation preprocessing technique, was introduced and found to be a simple method to greatly reduce linguistic bias. With this thesis, opportunities are identi ed for future research on the quanti - cation and mitigation of linguistic bias, providing concrete tools for the deployment of linguistically fair classi cation models.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen