Introduction to Sparse Coding

May 23, 2011

Recent literatures in computer vision and machine learning have observed the emergence of sparse coding as an effcient approach for feature selection. The basic idea of sparse coding [Olshausen 1997] is to represent a feature vector as linear combination of few bases from a predefined dictionary, hence induce the concept of sparsity. In other word, sparse coding provides low-dimensional approximation of a given signal in a given basis set. Apart from principal component analysis, sparse coding does not constraint the orthogonality of bases and it turns out that more flexibility is given to adapt the nonlinear representation of data.

From model fitting point of view, sparse coding provides means to avoid overfitting by eliminating insignificant variables in estimating the fitting function. Equivalently speaking, sparse coding finds a equilibrium solution between minimizing loss function of data-fitting versus sparse regularization of model coefficients, i.e.

\displaystyle \underset{\mathbf{S}}{\text{min }}\overset{\text{loss function}}{\overbrace{\mathcal{L}\left(\mathbf{X},\mathbf{B},\mathbf{S}\right)}}+\lambda\underset{\text{sparse regularization}}{\underbrace{\psi\left(\mathbf{S}\right)}}\ \ \ \ \ (1)

 in which {\mathbf{X}\in\mathbb{R}^{D\times N}} is the column matrix of input signals with their corresponding sparse coefficients {\mathbf{S}\in\mathbb{R}^{P\times N}} encoded by dictionary of bases {\mathbf{B}\in\mathbb{R}^{D\times P}}. The most popular choice for loss function is loss-square and the regularization term is {L_{1}}norm so that (1) is a {\ell_{1}}-regularized least square optimization problem. 

\displaystyle \begin{array}{c} \underset{\mathbf{B},\mathbf{S}}{\text{argmin }}\frac{1}{2}\left\Vert \mathbf{X}-\mathbf{B}\mathbf{S}\right\Vert _{Fro}^{2}+\beta\left\Vert \mathbf{S}\right\Vert _{1}\\ \text{subject to }\sum_{i}\mathbf{B}_{i,j}^{2}\leq c,\forall j=1,\ldots,N.\end{array}\ \ \ \ \ (2)

 Since the first term in (1) is a quadratic form given the bases fixed, it is convex with global minimum. Although the second term {\ell_{1}}-norm is convex also, it is not differentiable at the origin; hence, it requires some specific techniques to deal with. Instead of keeping bases fixed, ones can learn the dictionary {\mathbf{B}} for better adaptation to data domain. The formulation (1) is not convex in both {\mathbf{B}} and {\mathbf{S}} simultaneously. The solution, nevertheless, can be obtained by iteratively optimize the above objective by alternatingly optimizing with respect to {\mathbf{B}} and {\mathbf{S}} while holding the other fixed.


Follow

Get every new post delivered to your Inbox.