Classifying texts using MScan
Keywords
No Thumbnail Available
Authors
Issue Date
2019-06-20
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
This thesis is about the creation of MScan, a Dutch natural language processor
and whether it can be used for the classi cation of texts written by children.
There already exists a Dutch natural language processor, called Tscan. Tscan
computes a lot of features, sometimes more than necessary and this takes quite
some time to do. Therefore MScan has been created. MScan returns less fea-
tures, but is much faster than Tscan. For classi cation problems where not
that much information is needed, MScan could be the perfect solution. The
data which was used as input for MScan came from the corpus BasiScript,
which contains over 86.000 written texts. MScan gets input les with the .txt
extension and produced output les with the .csv extension. A logistic and a
linear classi er were trained in WEKA and they correctly predicted 44.05% and
27.27% of the instances respectively. The logitistic classi er seems to t this
problem best since the grades the children are in are discrete and not continuous.
The chancelevel of a random prediction in this situation is about 21.90%. This
means that the logistic classi er correctly predicts more than twice as many
instances as when it would be classi ed randomly. The accuracy of both classi-
ers can still be improved by only using texts which are longer than a certain
threshold. If new, better, models are trained for the NameFinder class and the
POSTagger class in MScan, MScans output will be even more precise and this
might also lead to a better classi cation in WEKA.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen