Fast Representatives Search as an Initialization for Scalable Sparse Subspace Clustering
High-dimensional data clustering is a difficult task due to the sparsity, correlated features and specific subspace structures of high-dimensional data. However, the self-expressiveness property states that data points can be most efficiently represented as data points from their own subspace. This property is successfully used by Elhamifar and Vidal (2013) to cluster high-dimensional data with the sparse subspace clustering (SSC) algorithm. However, the computational complexity of SSC is too high to be applied on datasets with large number of data points. Scalable sparse subspace clustering (SSSC) uses an in-sample out-of-sample approach to speed SSC up. Two steps were taken in this research to improve the random initialization of the in-sample set of SSSC. First, the computational complexity of an algorithm for representative selection called sparse representatives modeling selection (SMRS) was improved using a divide-and-conquer strategy. This new algorithm was called hierarchical sparse representatives (HSR). Secondly, the representatives from SMRS (or HSR) were used to initialize the in-sample set smartly. Theoretical and empirical results indicated that SMRS and HSR had similar results. The representatives from both algorithms overlapped and the importance given to them correlated. However, using representatives or non-representatives as an initialization of the in-sample set of SSSC did not significantly change its performance.
Faculteit der Sociale Wetenschappen