Leveraging Text Classification Techniques to Find Events on the Web
Keywords
Loading...
Authors
Issue Date
2016-07-13
Language
en
Document type
Journal Title
Journal ISSN
Volume Title
Publisher
Title
ISSN
Volume
Issue
Startpage
Endpage
DOI
Abstract
Ugenda is an organization which aggregates event information from other
websites. A lot of work goes into selecting web pages which contain events,
as there is no xed structure between di erent websites. Ugenda is searching
for ways of automating the process of event page selection. One approach is
to look at the text content of all webpages and automatically determine if the
pages contain an event based on the text content.
The text content of web pages is extracted by collecting the text inside
select HTML tags. The resulting text is represented by counting the di erent
words in the text and placing those counts in a vector. A dataset is created
by crawling (following links on web pages) a select number of websites and
performing manual classi cation (into classes event and other) in the resulting
pages. These pages are then transformed into a format which can be read
by the Weka datamining toolkit. Classi cation is performed by using three
di erent classi ers to achieve the best performance possible. Three di erent
weighting schemes are also used in order to enhance performance.
The results are in line with established literature: Classi ers can distin-
guish reasonable well between pages with events and other pages. However,
the performance is not yet good enough for use by Ugenda.
Additionally, a similar case was investigated by assimilating a random sam-
ple of non-event web pages (not restricted by the selected websites by Ugenda)
and a number of event web pages from multiple websites, where each web-
site only provides a single event page. Pre-processing was done analogously to
the previously mentioned process. Classi cation of this dataset is, on average,
more di cult and thus yields worse performance.
Possible improvements are discussed. The document representation could
be changed to include phrases or concepts. The classi cation algorithms can
possibly be tuned further and the collected datasets are too small to draw
solid conclusions.
Description
Citation
Supervisor
Faculty
Faculteit der Sociale Wetenschappen