Data augmentations on low-resource abstractive summarization
Keywords
Loading...
Authors
Issue Date
2022-07-19
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Natural Language Processing (NLP) tasks can struggle to perform
well when the language model does not have access to a large
dataset. A common approach in other fields such as computer vision
and automatic speech recognition already make more regular
use of data augmentations, which modifies the data such that new
data can be extracted from it. This way, the dataset can be artificially
enlarged. in NLP, data augmentation tools are less common
to use, and the options on the type of augmentations are limited.
In this thesis, novel ways to perform data augmentation on textual
data is invented, experimented, and evaluated on. Automatic abstractive
summarization is used as a use-case to test if low-resource
datasets can be increased such that the ROUGE evaluation metric
does not suffer. Augmentations that come from a syntactic angle
generally perform better, presumably because the ROUGE score
favours closeness in wording to the original text. Semantic augmentations
heavily varied, depending on how many changes were
being made to the choice in word usage. While one of the natural
language generation (NLG) methods was one of the standouts, replacing
words with their synonym based on word vector similarity
was disappointing in its effectiveness, as it varied a lot. For future
direction, syntactic augmentations could be even more fleshed out
for optimal quality retention when augmenting. More NLG methods
should also be explored. While they do take longer time to augment,
they also bring high potential with them.
Description
Citation
Faculty
Faculteit der Sociale Wetenschappen