Classifying texts using MScan

Swanenberg, M.R.H.

Classifying texts using MScan

Files

Swanenberg, M.-s4331095.pdf (230.21 KB)

Authors

Swanenberg, M.R.H.

Issue Date

2019-06-20

Language

en

URI

https://theses.ubn.ru.nl/handle/123456789/12568

Abstract

This thesis is about the creation of MScan, a Dutch natural language processor and whether it can be used for the classi cation of texts written by children. There already exists a Dutch natural language processor, called Tscan. Tscan computes a lot of features, sometimes more than necessary and this takes quite some time to do. Therefore MScan has been created. MScan returns less fea- tures, but is much faster than Tscan. For classi cation problems where not that much information is needed, MScan could be the perfect solution. The data which was used as input for MScan came from the corpus BasiScript, which contains over 86.000 written texts. MScan gets input les with the .txt extension and produced output les with the .csv extension. A logistic and a linear classi er were trained in WEKA and they correctly predicted 44.05% and 27.27% of the instances respectively. The logitistic classi er seems to t this problem best since the grades the children are in are discrete and not continuous. The chancelevel of a random prediction in this situation is about 21.90%. This means that the logistic classi er correctly predicts more than twice as many instances as when it would be classi ed randomly. The accuracy of both classi- ers can still be improved by only using texts which are longer than a certain threshold. If new, better, models are trained for the NameFinder class and the POSTagger class in MScan, MScans output will be even more precise and this might also lead to a better classi cation in WEKA.

Supervisor

Grootjen, F.A.

Faculty

Faculteit der Sociale Wetenschappen