As I mention in the previous post, I am doing on clustering and the task is how to build a efficient codebook for event recognition. So far, there are pretty much works on applying codebook to visual categorization, object recognition, and action recognition. Codebook based method is a shortcut to attack the recognition problem directly without concern in preprocessing. But everything is not so perfect, codebook based methods have tons of open questions to demystify. For instances, how many visual words a codebook should have; which is better, a tailored codebook with explicit annotated data, or a data-driven codebook using clustering algorithms; how to compromise among different object classes; which information need to be included, high frequency occurrence visual words, or rare ones; is there any method to preserve geometry information exerpt LDA, pLSA, and pyramid matching; a unified codebook for all classes is better than specific codebooks; and so on.
Since applied on texture classification [citation], codebook based methods are studied extensively on the object recognition problem. Two popular classification approaches are: i) Bayes Networks, ii) Support Vector Machines. In other word, we can use a generative model or discriminative model for learning. With the first choice, Naive Bayes is the most straightforward and simplest[ citation].More complicated models include LDA and pLSA with hidden topic variables [review]. Because the difficulty of generative model is to compute the full likelihood space. On the other hand, discriminative model offers direct methods to classify classes. Another approach is not to concern about incoporating geometry information but codebook discriminative power itself, i.e how to obtain a compact but high ‘quality’. In particular, clustering algorithm is analyzed under various circumstances and assumptions. Following characteristics are being got attention:
- Sampling techniques: interest point based, dense sampling
- Propose a efficient clustering algorithm
- The behavior of data points high very high dimensionality
- To present an class instance by the codebook
[citation] had a thorough analysis and found that dense sampling outperform interest point based sampling. Furthermore, when properly clustered, it can perform better. The conclusion is consistent with the observation that interest points do not actually capture the most informative regions. For example, I want to recognize cars from street context, but I always get high density visual words from building corners and trees. The problem is even more critical in the dense sampling case. However, dense sampling treat all points in images with no previllege. So we have below cause-results:
- Sparse sampling
lack of object relevant features
combined codebook, i.e universal + specific codebooks (for each class), multi-scale codebooks
- Dense sampling
redundant & trivial features
modified clustering algorithm, i.e kernel codebook, subsampling technique + meanshift
Although reliable conclusions have been made, the fact is we cannot apply dense sampling in every problem. For example, is it feasible to sample densely a video sequence? It’s huge, right? Therefore, in my opinion, sparse sampling techniques still have room for improvement. The keypoint here is we have to compensate by a good codebook formation.
Return to my problem, it is defined as follows:
- Input: video sequences, defined human action concepts with different articulations as well as duration
- Output: recognize as much as possible happened actions in the movie
After a bunch of experiments, it is likely that different actions type prefer different codebook size (the number of visual words). This is very natural and undoubtely. The handy solution is to build separate codebook for each action type. Consequently, one video sequence is tested on every specific action classifier. Why don’t use a multi-class classifier? On my perspective, the occurrence rate of event specific interest points is quite low (or very low) so its discriminative power is low as well. Using a multi-class classifier is not as efficient as a 2-class classifier with one-vs-all training strategy. However, it’s just a guess and it is possible to concatenate all the representation forms of one instance in each codebook into one long vector and learn it. We can easily infer that there are duplicated values in either case. Universal-specific codebook takes action at this point. But let’s ask the next question, “how damage the duplication cause?” There are two posibilities: i) the recognition is slow down, ii) the curse of dimensionality. Says we have 10 action classes and codebooks with size 1000 for each class. So the early fusion gives us a vector with 1000×10=10000 units! In the practical perspective, the curse of dimensionality is not well observed but through the overall performance of classifier. An elegant solution was proposed by [citation], in which they merge all the samples to generate a most common visual words. This amount has the high probability of occurrence in varous class instances. Then every class-specific codebook is generated indepently. An adaption process can optionaly applied between two kinds of codebook for better constraint.
Recently, Yang and Jurie (2008) proposed a way to unify codebook formation stage and classification stage. They claimed that these two processes are disconnected from each other. A iteration procedure is run to create codebook and
- Battiato, S., Farinella, G., Gallo, G. & Ravi, D.
- Scene categorization using bag of Textons on spatial hierarchy
- #ICIP08#
- 2008, pp. 2536-2539
- Dance, C., Willamowski, J., Fan, L., Bray, C. & Csurka, G.
- Visual categorization with bags of keypoints
- ECCV International Workshop on Statistical Learning in Computer Vision
- 2004
- Gemert, J.C. van., Geusebroek, J.M., Veenman, C.J. & Smeulders, A.W.M.
- Kernel Codebooks for Scene Categorization
- European Conference on Computer Vision
- 2008
- Jurie, F. & Triggs, B.
- Creating Efficient Codebooks for Visual Recognition
- ICCV
- 2005, pp. 604-610
- Metzler, D.
- Beyond bags of words: effectively modeling dependence and features in information retrieval
- SIGIR Forum, 2008, Vol. 42(1), pp. 77
- Monay, F., Quelhas, P., Odobez, J.-M. & Gatica-Perez, D.
- Integrating Co-Occurrence and Spatial Contexts on PatchBased Scene Segmentation
- CVPRW ‘06: Proceedings of the 2006 Conference on Computer Vision and Pattern Recognition Workshop
- IEEE Computer Society, 2006, pp. 14
- Nowak, E., Jurie, F. & Triggs, B.
- Sampling strategies for bag-of-features image classification
- In Proc. ECCV
- Springer, 2006, pp. 490-503
- Winn, J.M., Criminisi, A. & Minka, T.P.
- Object Categorization by Learned Universal Visual Dictionary
- ICCV
- 2005, pp. 1800-1807
- Zhang, W., Surve, A., Fern, X. & Dietterich, T.G.
- Learning non-redundant codebooks for classifying complex objects
- ICML
- 2009, pp. 156