HI ERARCHICAL ECONOMIC ACT IVI TY CLAS S I F ICAT ION
Keywords
No Thumbnail Available
Authors
Issue Date
2020-11-04
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Gaining insight in what economic activities the businesses with a
domain in the .nl zone participate in can help to increase online
security. In this research project, the focus lies on classification of
economic activity for business and e-commerce related domains in the
.nl zone based on the text found on those domains.
Not all economic activities are represented as well as others in
number of domains participating in that economic activity. A selection
of which economic activities were considered was made based on the
statistics of the Dutch National Statistics Office (CBS). Certain economic
activities are better represented in number of businesses than others.
The number of businesses operating in several economic activities
is low enough to manually label the corresponding domains. Not
considering those activities in the final classification model results in
higher classification performance.
The main experiments in this research project consisted of three
parts: Analyzing the influence of textual features extraction methods,
the classification methods and the classification approach. The results
indicate that, in order to achieve optimal classification performance, a
term frequency inverse document frequency (tfidf) feature extraction
method should be combined with a linear classifier trained using
Stochastic Gradient Descent (SGD) of the Modified Huber (MH) loss
function. These findings show that more complicated feature extraction
methods or more complicated classifiers do not guarantee higher
classification performance.
Hierarchical classification can be employed to perform classification
on the second level the economic activity taxonomy. The final and
most important experiment in this research project is designed to
analyze if an advantage of hierarchical second level economic activity
classification exists when compared to regular “flat” classification. The
results show that when the hierarchical classification approach is used,
a higher classification performance can be achieved.
From these results can be concluded that the use of more complicated
methods for both feature extraction and classification does not
guarantee increased classification performance. Classification performance
can however be increased by exploiting an exisiting hierarchical
structure in the data. Using the example of economic activity classification,
we show that this performance increase generalizes from
benchmark datasets to a non-benchmark problem.
The research project itself was deemed a success: Stichting Internet
Domeinregistratie Nederland (SIDN), the administrator of the .nl
zone, decided to take the second level economic activity classifier into
iii
production. Every month, the hierarchical classifier is used to generate
economic activity classifications for all business and e-commerce
related domains in the .nl zone.
Description
Citation
Faculty
Faculteit der Sociale Wetenschappen