vodinhphong

Archive for June, 2009

How to evaluate a clustering result?

In Research on June 28, 2009 at 8:13 am

Preface

According to Wikipedia,

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics.

In this post I do not discuss about clustering algorithms but they way to evaluate a clustering result. My current problem is relevant to forming a codebook for visual categorization,  i.e to cluster a huge dataset (~ 6.525 million feature vectors) into clusters (visual words). After that, this codebook is used as a reference to vote into samples. In other word, this is exact the BoW method. The problem here is, how to know a clustering result is discriminative enough or not. Here I noted some idea from  Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008.

K-means clustering

It is natural to talk about clustering by review the K-means algorithm:

  • A special case of a general procedure known as EM (Expectation Maximization)
  • Termination conditions:
    • A fixed number of iterations
    • Sample partition unchanged
    • Centroid positions unchanged (does the 2nd condition hold?)
  • Time complexity \Theta(lKmn)
    • l: iteration
    • K: number of clusters
    • n: number of samples
    • n: sample dimension

Evaluation of clustering

Internal evaluation

  • High intra-cluster similarity
  • Low inter-cluster similarity
  • Measured quality of a clustering depends on
    • sample representation (i.e how to represent descriptor efficiently from raw data)
    • similarity measure (i.e this post)

Comment: It seems that this kind of evaluation is not very meaningful. Instead of using it, I take the clustering result to use for another application and measure the application’s performance to decide the clustering is good or not.

External evaluation

In spite of unsupervised learning, clustering can benefits from some kinds of benchmark data/labeled data (if available).  Assumed that I have this benchmark data and I want to know whether clustering method and accompanied parameters is good. Following measures can be used:

Purity

purity(\Omega,C)=\frac{1}{N}\sum{max_j\left|\omega_k\cap c_j\right|}

in which, \Omega={\omega_1,\omega_2,...,\omega_K} is the set of clusters, C={c_1,c_2,...,c_J} is the set of classes.

Purity demonstrates how much a cluster contains different classes. The more classes a cluster has in itself, the less purity is. However, purity can be easily obtained in the case K=N.

Normalized Mutual Information

NMI(\Omega,C)=\frac{I(\Omega,C)}{(H(\Omega)+H(C))/2}

in which, I – mutual information, expressed as follows:

I(\Omega,C)=\sum_k{\sum_j{P(\omega_k\cap c_j)log(\frac{P(\omega_k\cap c_j)}{P(\omega_k)P(c_j)})}}=\sum_k{\sum_j{\frac{|\omega_k\cap c_j|}{N}log(\frac{N|\omega_k\cap c_j|}{|\omega_k||c_j|})}}

and H – entropy, expressed as follows:

H(\Omega)=-\sum_kP(\omega_k)logP(\omega_k)=-\sum_k\frac{|\omega_k|}{N}log\frac{\omega_k}{N}

Comment:  Mutual information allows us to gain information about the classes when given what the clusters are ( in the ideal case, clusters are exact classes). However, MI has the similar case as Purity when each cluster contains just one sample. Avoiding it, MI is divided by the denominator (H(\Omega)+H(C))/2. Entropy increases with the number of clusters. In case K=N, H(\Omega)=logN and therefore NMI is low. Interesting?

Accuracy  criterion (or Rand Index)

RI=\frac{TP+TN}{TP+FP+FN+TN}

Ones can use accuracy concept to apply for a clustering result: A true positive TP decision assigns two similar samples to the same cluster, a false positive FP decision assigns two similar samples to different clusters. The formula is quite simple:

For a more comformtable C/C++

In Programming on June 25, 2009 at 2:36 pm

C/C++ coding is often a nightmare with me because of its allocation/deallocation mechanism is delegated to programmer. But it is exact the way I like.  Flexibility is a two-blade knife, it can help you much and it can kill  you at once! Since I was a IT student, I rarely used C/C++ to write programs, and used Java instead. At that time, I adored Java and hated C/C++ because of Visual Studio. I had not known that there are so much cool IDEs that I can use free.

By now, I am on the way to become a PhD student and I like C/C++. However, using C/C++ in a right way is the problem.

The first thing I think, it is what kind of OS I should use? Linux!

The second thing is which IDE do I use? Eclipse. With Eclipse, I can manipulate makefile with supports from Eclipse, or write and maintain by myself. This point is quite important because when you deploy an application/your code on another computer/system, the most portable way is to use Makefile.

The next thing is how to use C/C++ in the most comformtable way? Use good external libraries to reduce time to develop or fix bugs.

The most important library is STL (sure!)

If you want to write programs with rich command line options (and you should write that), use the Boost library with Program_options package. Writing a GUI application is a tedious task and I personally think it is the bad way. Within several minutes, you can add numerous of functionalities into your program without doubt about where to place these buttons, these textboxes, etc. If your OS is Windows, please install Boost from this site. Do not waste time to install from Boost homepage, it is just good for Linux users and tutorials.

In fact, the more you depend on external libraries, the more problems you get when deploying on other computers. So, be thoughtful before decide to use them. On the next post, I might introduce about Blitz++.