HI ERARCHICAL ECONOMIC ACT IVI TY CLAS S I F ICAT ION
No Thumbnail Available
Gaining insight in what economic activities the businesses with a domain in the .nl zone participate in can help to increase online security. In this research project, the focus lies on classification of economic activity for business and e-commerce related domains in the .nl zone based on the text found on those domains. Not all economic activities are represented as well as others in number of domains participating in that economic activity. A selection of which economic activities were considered was made based on the statistics of the Dutch National Statistics Office (CBS). Certain economic activities are better represented in number of businesses than others. The number of businesses operating in several economic activities is low enough to manually label the corresponding domains. Not considering those activities in the final classification model results in higher classification performance. The main experiments in this research project consisted of three parts: Analyzing the influence of textual features extraction methods, the classification methods and the classification approach. The results indicate that, in order to achieve optimal classification performance, a term frequency inverse document frequency (tfidf) feature extraction method should be combined with a linear classifier trained using Stochastic Gradient Descent (SGD) of the Modified Huber (MH) loss function. These findings show that more complicated feature extraction methods or more complicated classifiers do not guarantee higher classification performance. Hierarchical classification can be employed to perform classification on the second level the economic activity taxonomy. The final and most important experiment in this research project is designed to analyze if an advantage of hierarchical second level economic activity classification exists when compared to regular “flat” classification. The results show that when the hierarchical classification approach is used, a higher classification performance can be achieved. From these results can be concluded that the use of more complicated methods for both feature extraction and classification does not guarantee increased classification performance. Classification performance can however be increased by exploiting an exisiting hierarchical structure in the data. Using the example of economic activity classification, we show that this performance increase generalizes from benchmark datasets to a non-benchmark problem. The research project itself was deemed a success: Stichting Internet Domeinregistratie Nederland (SIDN), the administrator of the .nl zone, decided to take the second level economic activity classifier into iii production. Every month, the hierarchical classifier is used to generate economic activity classifications for all business and e-commerce related domains in the .nl zone.
Faculteit der Sociale Wetenschappen