Developing Eventscraper for Ugenda: How to keep a web scraper functional after a DOM change

Keywords

No Thumbnail Available

Issue Date

2016-08-25

Language

en

Document type

Journal Title

Journal ISSN

Volume Title

Publisher

Title

ISSN

Volume

Issue

Startpage

Endpage

DOI

Abstract

The goal of this thesis was to explore techniques that can be used to develop a web scraper that is still able to scrape web pages after their DOM has been altered. In this thesis, the modern applications of web scraping are discussed, as well as literature on existing web scraping approaches. A prototype web scraper, Eventscraper, was developed for the purpose of evaluating the performance of several web scraping techniques. This research proposes a new technique to handle DOM changes: Path distance search. It turned out to be infeasible to conduct an experiment to compare the performance of path distance search with existing techniques. However, a hypothesis on its performance has been formed, based on a detailed analysis of its behaviour. This research concludes with several suggestions for future research.

Description

Citation

Faculty

Faculteit der Sociale Wetenschappen