Abstract:
Clustering is a popular approach to exploratory data analysis and mining. How-
ever, clustering faces difficult challenges due to its ill-posed nature. First, it is well
known that off-the-shelf clustering methods may discover different patterns in a given
set of data, because each clustering algorithm has its own bias resulting from the optimization of different criteria. Second, there is no ground truth against which the
clustering result can be validated. High dimensional data also pose a difficult challenge to the clustering process. Various clustering algorithms can handle data with
low dimensionality, but as the dimensionality of the data increases, these algorithms
tend to break down. In this dissertation, we introduce novel clustering ensemble
techniques and novel semi-supervised approaches to address these problems.
Clustering ensembles offer a solution to challenges inherent to clustering arising
from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, and they can average out
the emergent spurious structures which arise due to the various biases of each participating algorithm, and due to the variance induced by different data samples. We
introduce and analyze three new consensus functions for ensembles of subspace clusterings. The ultimate goal of our consensus functions is to provide hard partitions
of the data, and weight vectors which convey information regarding the subspaces
within which the individual clusters exist. We demonstrate the effectiveness of our
three techniques by running experiments with several real datasets, including high
dimensional text data, and investigate the issue of diversity and accuracy in our
ensemble techniques.
We also study scenarios in which limited knowledge on the data (in terms of pair-
wise constraints) is available from the user. We develop a methodology to embed such
constraints into the ensemble components, so that the desired structure emerges via
the consensus clustering. We introduce a mechanism which leverages the ensemble
framework to bootstrap informative constraints directly from the data and from the
various clusterings, without intervention from the user. We demonstrate the effectiveness of our proposed techniques with experiments using real datasets and other
state-of-the-art semi-supervised techniques.