Preface
According to Wikipedia,
Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.
In this post I do not discuss about clustering algorithms but they way to evaluate a clustering result. My current problem is relevant to forming a codebook for visual categorization, i.e to cluster a huge dataset (~ 6.525 million feature vectors) into clusters (visual words). After that, this codebook is used as a reference to vote into samples. In other word, this is exact the BoW method. The problem here is, how to know a clustering result is discriminative enough or not. Here I noted some idea from Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.
K-means clustering
It is natural to talk about clustering by review the K-means algorithm:
- A special case of a general procedure known as EM (Expectation Maximization)
- Termination conditions:
- A fixed number of iterations
- Sample partition unchanged
- Centroid positions unchanged (does the 2nd condition hold?)
- Time complexity
: iteration
: number of clusters
: number of samples
: sample dimension
Evaluation of clustering
Internal evaluation
- High intra-cluster similarity
- Low inter-cluster similarity
- Measured quality of a clustering depends on
- sample representation (i.e how to represent descriptor efficiently from raw data)
- similarity measure (i.e this post)
Comment: It seems that this kind of evaluation is not very meaningful. Instead of using it, I take the clustering result to use for another application and measure the application’s performance to decide the clustering is good or not.
External evaluation
In spite of unsupervised learning, clustering can benefits from some kinds of benchmark data/labeled data (if available). Assumed that I have this benchmark data and I want to know whether clustering method and accompanied parameters is good. Following measures can be used:
Purity
in which, is the set of clusters,
is the set of classes.
Purity demonstrates how much a cluster contains different classes. The more classes a cluster has in itself, the less purity is. However, purity can be easily obtained in the case .
Normalized Mutual Information
in which, – mutual information, expressed as follows:
and – entropy, expressed as follows:
Comment: Mutual information allows us to gain information about the classes when given what the clusters are ( in the ideal case, clusters are exact classes). However, MI has the similar case as when each cluster contains just one sample. Avoiding it, MI is divided by the denominator
. Entropy increases with the number of clusters. In case
,
and therefore
is low. Interesting?
Accuracy criterion (or Rand Index)
Ones can use accuracy concept to apply for a clustering result: A true positive decision assigns two similar samples to the same cluster, a false positive
decision assigns two similar samples to different clusters. The formula is quite simple: