Leveraging Text Classification Techniques to Find Events on the Web

Böhm, H.M.

Leveraging Text Classification Techniques to Find Events on the Web

Files

Bohm, H._BSc_Thesis_2016.pdf (1.3 MB)

Authors

Böhm, H.M.

Issue Date

2016-07-13

Language

en

URI

http://hdl.handle.net/123456789/2615

Abstract

Ugenda is an organization which aggregates event information from other websites. A lot of work goes into selecting web pages which contain events, as there is no xed structure between di erent websites. Ugenda is searching for ways of automating the process of event page selection. One approach is to look at the text content of all webpages and automatically determine if the pages contain an event based on the text content. The text content of web pages is extracted by collecting the text inside select HTML tags. The resulting text is represented by counting the di erent words in the text and placing those counts in a vector. A dataset is created by crawling (following links on web pages) a select number of websites and performing manual classi cation (into classes event and other) in the resulting pages. These pages are then transformed into a format which can be read by the Weka datamining toolkit. Classi cation is performed by using three di erent classi ers to achieve the best performance possible. Three di erent weighting schemes are also used in order to enhance performance. The results are in line with established literature: Classi ers can distin- guish reasonable well between pages with events and other pages. However, the performance is not yet good enough for use by Ugenda. Additionally, a similar case was investigated by assimilating a random sam- ple of non-event web pages (not restricted by the selected websites by Ugenda) and a number of event web pages from multiple websites, where each web- site only provides a single event page. Pre-processing was done analogously to the previously mentioned process. Classi cation of this dataset is, on average, more di cult and thus yields worse performance. Possible improvements are discussed. The document representation could be changed to include phrases or concepts. The classi cation algorithms can possibly be tuned further and the collected datasets are too small to draw solid conclusions.