Exploring the sentiment analysis performance of BERT models on domain specific Twitter data when combined with an intelligent pre-processor

Keywords

Loading...
Thumbnail Image

Issue Date

2022-06-19

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

Bidirectional Encoder Representations from Transformers (BERT) are deep learning language models, used to understand the meaning of language based on context. BERT models are widely used in Natural Language Processing (NLP) research for tasks such as Sentiment Analysis (SA). Social media platforms such as Twitter offer a large quantity of data to run SA on. However Twitter data is very noisy, due to an extensive use of hashtags, emojis, abbreviations, and slang. This noise impairs the performance of BERT models on the SA task. There are BERT models that are pre-trained on Twitter data, however the features labeled as noise are not included in the pre-training. Another problem arises when Tweets contain a high count of niche vocabulary words that did not occur in the pre-training of the BERT models. We propose a fine-tuned pretrained BERT model combined with a pipeline of pre-processing methods called the “intelligent preprocessor” to overcome the challenges. The intelligent pre-processor is used to translate Twitter noise into a language structure that optimizes the models performance. Domain knowledge is used to help the intelligent pre-processor detect niche vocabulary and replace it with a common language alternatives. The proposed model outperformed the baseline pre-trained Twitter-based BERT model on a sentiment analysis task and confirmed findings of earlier research.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen