Graph-based semi-supervised learning of semantic text clusters

The bag-of-words model is a common approach to represent documents for all kind of text mining tasks. However, the assumed independence of words does not reflect the complexity and context of human natural language. We propose a graph-based representation of collections of documents that include documents and features with their respective syntactic, semantic and frequency-based relations. Based on semi-supervised learning - an approach that besides using labeled data, also incorporates the structure of unlabeled data for classifier training - the influence of different graph properties on text categorization is investigated. The results show that even though bag-of-words is a powerful approach, adding word relations significantly improves classification performance. Whether syntactic or semantic feature relations are used has, however, no significant influence. Although, graph-based semi-supervised learning outperforms bag-of-words based supervised and semi-supervised learning approaches when varying the number of labeled documents, it is not able to use the full potential of including unlabeled data. The big advantage of graph-based methods is their flexibility to perfectly adapt the document representation to a specific text mining task.
Faculteit der Sociale Wetenschappen