Author Identifcation in Short Texts

Fissette, M.V.M.

Author Identifcation in Short Texts

Files

Fissette,M.BaThesis10.pdf (1.31 MB)

Authors

Fissette, M.V.M.

Issue Date

2010-08-11

Language

en

URI

http://theses.ubn.ru.nl/handle/123456789/73

Abstract

Most research on author identification considers large texts. Not many research is done on author identification for short texts, while short texts are commonly used since the rise of digital media. The anonymous nature of internet applications offers possibilities to use the internet for illegitimate purposes. In these cases, it can be very useful to be able to predict who the author of a message is. Van der Knaap and Grootjen [28] showed that authors of short texts can be identified using single words (word unigrams) with Formal Concept Analysis. In theory, grammatical information can also be used as an indication of the author of the text. Grammatical information can be captured by word bigrams. Word bigrams are pairs of successive words, so they reveal some information on the sentence structure the author used. For this thesis I performed experiments using word bigrams as features for author identification to determine whether performance increases compared to using word unigrams as features. In most languages many grammatical relations within a sentence are between words that are not successive. The DUPIRA parser, a natural language parser for Dutch, produces dependency triplets that represent relations between non successive words, based on the Dutch grammar. I used these triplets as features, either alone or in combination with unigrams or bigrams. People often use smileys when communicating with someone using digital media. Therefore, I also examined the influence of smileys on author identification. The messages used for the experiments are obtained from the subsection `Eurovision Songfestival 2010' of the fok.nl message board. With these messages the data files for 7 feature sets were constructed: word unigrams excluding smileys, word unigrams including smileys, word bigrams excluding smileys, word bigrams including smileys, only dependency triplets, triplets+word unigrams, triplets+word bigrams. A support vector machine algorithm (SVM) was used as the classification method. This is a commonly used algorithm for author identification. There are different implementations of SVM. In this thesis SMO, LibSVM and LibLINEAR are compared. The LibLINEAR algorithm gave the best results. The results revealed that in all conditions the performance is above chance level. So all reveal some information about the author. The performance for the word unigrams including smileys showed the best results, while the performance using the dependency triplets is the lowest. Results also revealed that when smileys are considered the performance increases, so smileys provide additional information about the author.

Supervisor

Grootjen, F.A.

Faculty

Faculteit der Sociale Wetenschappen

Programme

Artificial Intelligence

Specialisation

Bachelor Artificial Intelligence

Collections

Faculteit der Sociale Wetenschappen

Full item page

Author Identifcation in Short Texts

Keywords

Files

Authors

Issue Date

Language

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

URI

DOI

Abstract

Description

Citation

Supervisor

Faculty

Programme

Specialisation

Collections