Abstract:
Unsupervised learning is an important topic in machine learning. In particular, clustering
is an unsupervised learning problem that arises in a variety of applications for data analysis
and mining. Unfortunately, clustering is an ill-posed problem and, as such, a challenging
one: no ground-truth that can be used to validate clustering results is available. Two issues
arise as a consequence. Various clustering algorithms embed their own bias resulting from
di erent optimization criteria. As a result, each algorithm may discover di erent patterns
in a given dataset. The second issue concerns the setting of parameters. In clustering,
parameter setting controls the characterization of individual clusters, and the total number
of clusters in the data.
Clustering ensembles have been proposed to address the issue of di erent biases induced
by various algorithms. Clustering ensembles combine di erent clustering results, and can
provide solutions that are robust against spurious elements in the data. Although clustering
ensembles provide a signi cant advance, they do not address satisfactorily the model selection
and the parameter tuning problem.
Bayesian approaches have been applied to clustering to address the parameter tuning
and model selection issues. Bayesian methods provide a principled way to address these
problems by assuming prior distributions on model parameters. Prior distributions assign
low probabilities to parameter values which are unlikely. Therefore they serve as regularizers
for modeling parameters, and can help avoid over- tting. In addition, the marginal likelihood
is used by Bayesian approaches as the criterion for model selection. Although Bayesian
methods provide a principled way to perform parameter tuning and model selection, the
key question \How many clusters?" is still open. This is a fundamental question for model
selection. A special kind of Bayesian methods, nonparametric Bayesian approaches, have
been proposed to address this important model selection issue. Unlike parametric Bayesian
models, for which the number of parameters is nite and xed, nonparametric Bayesian
models allow the number of parameters to grow with the number of observations. After
observing the data, nonparametric Bayesian models t the data with nite dimensional
parameters.
An additional issue with clustering is high dimensionality. High-dimensional data pose
a di cult challenge to the clustering process. A common scenario with high-dimensional
data is that clusters may exist in di erent subspaces comprised of di erent combinations of
features (dimensions). In other words, data points in a cluster may be similar to each other
along a subset of dimensions, but not in all dimensions. People have proposed subspace
clustering techniques, a.k.a. co-clustering or bi-clustering, to address the dimensionality
issue (here, I use the term co-clustering). Like clustering, also co-clustering su ers from the
ill-posed nature and the lack of ground-truth to validate the results.
Although attempts have been made in the literature to address individually the major
issues related to clustering, no previous work has addressed them jointly. In my dissertation
I propose a uni ed framework that addresses all three issues at the same time. I designed a
nonparametric Bayesian clustering ensemble (NBCE) approach, which assumes that multiple
observed clustering results are generated from an unknown consensus clustering. The under-
lying distribution is assumed to be a mixture distribution with a nonparametric Bayesian
prior, i.e., a Dirichlet Process. The number of mixture components, a.k.a. the number
of consensus clusters, is learned automatically. By combining the ensemble methodology
and nonparametric Bayesian modeling, NBCE addresses both the ill-posed nature and the
parameter setting/model selection issues of clustering. Furthermore, NBCE outperforms
individual clustering methods, since it can escape local optima by combining multiple
clustering results.
I also designed a nonparametric Bayesian co-clustering ensemble (NBCCE) technique.
NBCCE inherits the advantages of NBCE, and in addition it is e ective with high dimensional
data. As such, NBCCE provides a uni ed framework to address all the three aforementioned
issues. NBCCE assumes that multiple observed co-clustering results are generated from an
unknown consensus co-clustering. The underlying distribution is assumed to be a mixture
with a nonparametric Bayesian prior. I developed two models to generate co-clusters in
terms of row- and column- clusters. In one case row- and column-clusters are assumed to be
independent, and NBCCE assumes two independent Dirichlet Process priors on the hidden
consensus co-clustering, one for rows and one for columns. The second model captures the
dependence between row- and column-clusters by assuming a Mondrian Process prior on the
hidden consensus co-clustering. Combined with Mondrian priors, NBCCE provides more
exibility to t the data.
I have performed extensive evaluation on relational data and protein-molecule interaction
data. The empirical evaluation demonstrates the e ectiveness of NBCE and NBCCE and
their advantages over traditional clustering and co-clustering methods.