Data augmentations on low-resource abstractive summarization

Keywords

Loading...
Thumbnail Image

Issue Date

2022-07-19

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Natural Language Processing (NLP) tasks can struggle to perform well when the language model does not have access to a large dataset. A common approach in other fields such as computer vision and automatic speech recognition already make more regular use of data augmentations, which modifies the data such that new data can be extracted from it. This way, the dataset can be artificially enlarged. in NLP, data augmentation tools are less common to use, and the options on the type of augmentations are limited. In this thesis, novel ways to perform data augmentation on textual data is invented, experimented, and evaluated on. Automatic abstractive summarization is used as a use-case to test if low-resource datasets can be increased such that the ROUGE evaluation metric does not suffer. Augmentations that come from a syntactic angle generally perform better, presumably because the ROUGE score favours closeness in wording to the original text. Semantic augmentations heavily varied, depending on how many changes were being made to the choice in word usage. While one of the natural language generation (NLG) methods was one of the standouts, replacing words with their synonym based on word vector similarity was disappointing in its effectiveness, as it varied a lot. For future direction, syntactic augmentations could be even more fleshed out for optimal quality retention when augmenting. More NLG methods should also be explored. While they do take longer time to augment, they also bring high potential with them.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen