Exploring the sentiment analysis performance of BERT models on domain specific Twitter data when combined with an intelligent pre-processor
Keywords
Loading...
Authors
Issue Date
2022-06-19
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Bidirectional Encoder Representations from Transformers
(BERT) are deep learning language models,
used to understand the meaning of language based
on context. BERT models are widely used in Natural
Language Processing (NLP) research for tasks
such as Sentiment Analysis (SA). Social media platforms
such as Twitter offer a large quantity of data
to run SA on. However Twitter data is very noisy,
due to an extensive use of hashtags, emojis, abbreviations,
and slang. This noise impairs the performance
of BERT models on the SA task. There are
BERT models that are pre-trained on Twitter data,
however the features labeled as noise are not included
in the pre-training. Another problem arises
when Tweets contain a high count of niche vocabulary
words that did not occur in the pre-training
of the BERT models. We propose a fine-tuned pretrained
BERT model combined with a pipeline of
pre-processing methods called the “intelligent preprocessor”
to overcome the challenges. The intelligent
pre-processor is used to translate Twitter noise
into a language structure that optimizes the models
performance. Domain knowledge is used to help
the intelligent pre-processor detect niche vocabulary
and replace it with a common language alternatives.
The proposed model outperformed the
baseline pre-trained Twitter-based BERT model on
a sentiment analysis task and confirmed findings of
earlier research.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen
