Multi-Label Classification of Movie Genres using Text-based Features and WordNet Hypernyms
Text categorization techniques have become increasingly more important in the past decade. Whereas many approaches rely on video or audio features for classifying digital media, text-based features provide a considerable amount of information and are computationally inexpensive to process. In this thesis we present a large movie subtitle database of data in natural language, which will be used to predict genre labels in a multi-label classification problem. We provide methods to extract text-based features and reduce attribute dimensionality effectively. We also demonstrate the generation of a second dataset using WordNet, where all words from the original subtitles are replaced by their direct hypernyms. A final distinction is made within datasets to include TF-IDF-transformations or not. We hypothesize that the dataset containing hypernyms will outperform the original dataset of textbased features. Furthermore, we hypothesize that TF-IDF-transformation has a positive effect on classification accuracy. A selection of multi-label classification techniques were tested on their performance using the four conditions. Results show very good scores on classification performance but no significant difference between the four experimental conditions.
Faculteit der Sociale Wetenschappen