Author Identifcation in Short Texts

Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Most research on author identification considers large texts. Not many research is done on author identification for short texts, while short texts are commonly used since the rise of digital media. The anonymous nature of internet applications offers possibilities to use the internet for illegitimate purposes. In these cases, it can be very useful to be able to predict who the author of a message is. Van der Knaap and Grootjen [28] showed that authors of short texts can be identified using single words (word unigrams) with Formal Concept Analysis. In theory, grammatical information can also be used as an indication of the author of the text. Grammatical information can be captured by word bigrams. Word bigrams are pairs of successive words, so they reveal some information on the sentence structure the author used. For this thesis I performed experiments using word bigrams as features for author identification to determine whether performance increases compared to using word unigrams as features. In most languages many grammatical relations within a sentence are between words that are not successive. The DUPIRA parser, a natural language parser for Dutch, produces dependency triplets that represent relations between non successive words, based on the Dutch grammar. I used these triplets as features, either alone or in combination with unigrams or bigrams. People often use smileys when communicating with someone using digital media. Therefore, I also examined the influence of smileys on author identification. The messages used for the experiments are obtained from the subsection `Eurovision Songfestival 2010' of the message board. With these messages the data files for 7 feature sets were constructed: word unigrams excluding smileys, word unigrams including smileys, word bigrams excluding smileys, word bigrams including smileys, only dependency triplets, triplets+word unigrams, triplets+word bigrams. A support vector machine algorithm (SVM) was used as the classification method. This is a commonly used algorithm for author identification. There are different implementations of SVM. In this thesis SMO, LibSVM and LibLINEAR are compared. The LibLINEAR algorithm gave the best results. The results revealed that in all conditions the performance is above chance level. So all reveal some information about the author. The performance for the word unigrams including smileys showed the best results, while the performance using the dependency triplets is the lowest. Results also revealed that when smileys are considered the performance increases, so smileys provide additional information about the author.
Faculteit der Sociale Wetenschappen