Introduction to Sparse Coding

May 23, 2011

Recent literatures in computer vision and machine learning have observed the emergence of sparse coding as an effcient approach for feature selection. The basic idea of sparse coding [Olshausen 1997] is to represent a feature vector as linear combination of few bases from a predefined dictionary, hence induce the concept of sparsity. In other word, sparse coding provides low-dimensional approximation of a given signal in a given basis set. Apart from principal component analysis, sparse coding does not constraint the orthogonality of bases and it turns out that more flexibility is given to adapt the nonlinear representation of data.

From model fitting point of view, sparse coding provides means to avoid overfitting by eliminating insignificant variables in estimating the fitting function. Equivalently speaking, sparse coding finds a equilibrium solution between minimizing loss function of data-fitting versus sparse regularization of model coefficients, i.e.

\displaystyle \underset{\mathbf{S}}{\text{min }}\overset{\text{loss function}}{\overbrace{\mathcal{L}\left(\mathbf{X},\mathbf{B},\mathbf{S}\right)}}+\lambda\underset{\text{sparse regularization}}{\underbrace{\psi\left(\mathbf{S}\right)}}\ \ \ \ \ (1)

 in which {\mathbf{X}\in\mathbb{R}^{D\times N}} is the column matrix of input signals with their corresponding sparse coefficients {\mathbf{S}\in\mathbb{R}^{P\times N}} encoded by dictionary of bases {\mathbf{B}\in\mathbb{R}^{D\times P}}. The most popular choice for loss function is loss-square and the regularization term is {L_{1}}norm so that (1) is a {\ell_{1}}-regularized least square optimization problem. 

\displaystyle \begin{array}{c} \underset{\mathbf{B},\mathbf{S}}{\text{argmin }}\frac{1}{2}\left\Vert \mathbf{X}-\mathbf{B}\mathbf{S}\right\Vert _{Fro}^{2}+\beta\left\Vert \mathbf{S}\right\Vert _{1}\\ \text{subject to }\sum_{i}\mathbf{B}_{i,j}^{2}\leq c,\forall j=1,\ldots,N.\end{array}\ \ \ \ \ (2)

 Since the first term in (1) is a quadratic form given the bases fixed, it is convex with global minimum. Although the second term {\ell_{1}}-norm is convex also, it is not differentiable at the origin; hence, it requires some specific techniques to deal with. Instead of keeping bases fixed, ones can learn the dictionary {\mathbf{B}} for better adaptation to data domain. The formulation (1) is not convex in both {\mathbf{B}} and {\mathbf{S}} simultaneously. The solution, nevertheless, can be obtained by iteratively optimize the above objective by alternatingly optimizing with respect to {\mathbf{B}} and {\mathbf{S}} while holding the other fixed.


Dictionary Learning with CVX Toolbox

May 22, 2011

In this post we do an exercise that help you implement sparse coding and dictionary learning without using algorithms dedicated to optimizing a \ell_1-regularized least squares problem. We will use a generic convex optimization CVX toolbox for sparse encoding and adopt simple updating rule for dictionary \mathbf{D}.

For simplicity, a single image “lena” is used to learn the dictionary. Firstly, patches are extracted and normalized:


clear all;
I=double(imread('lena.png'))/255;
% extract 8 x 8 patches
X=im2col(I,[8 8],'sliding');
X=X-repmat(mean(X),[size(X,1) 1]);
X=X ./ repmat(sqrt(sum(X.^2)),[size(X,1) 1]);
X=X(:,1:100:end);

Since the number of patches is so huge, we just sub-sample 1%. This is enough for us to check the correctness of our algorithm.

Next, the number of iterations, dictionary size, and regularization coefficient are set:


nIter = 25;
gamma = .15;
p = 200;

% init dictionary
sel = randperm(size(X,2));
sel = sel(1:p);
D = X(:,sel);
n = size(X,2);
%D = D ./ repmat( sqrt(sum(D.^2)), [w^2, 1] );
S = zeros(p,n);

We individually encode each of input sample by optimizing the following objective function:

\underset{\mathbf{s}}{\text{argmin }} \frac{1}{2}\left\Vert \mathbf{y} - \mathbf{D}\mathbf{s}\right\Vert_2^2 + \lambda*\left\Vert\mathbf{s}\right\Vert_1

cvx_begin quiet
variable s(p);
minimize (.5*norm(X(:,j)-D*s) + gamma*norm(s,1));
cvx_end

In order to update \mathbf{D}, we just need to solve a least squares problem:

\underset{\mathbf{D}}{\text{argmin }} \frac{1}{2}\left\Vert \mathbf{X} - \mathbf{D}\mathbf{S}\right\Vert_F^2

This objective function can be solved easily by taking the first-order derivative respect to \mathbf{D}:

\begin{array}{cl}  & \nabla_{\mathbf{D}}\text{Tr}\left(\frac{1}{2}\left(\mathbf{X}-\mathbf{D}\mathbf{S}\right)^{T}\left(\mathbf{X}-\mathbf{D}\mathbf{S}\right)\right)\\  = & \frac{1}{2}\nabla_{\mathbf{D}}\text{Tr}\left(\mathbf{X}^{T}\mathbf{X}-\mathbf{X}^{T}\mathbf{D}\mathbf{S}-\mathbf{S}^{T}\mathbf{D}^{T}\mathbf{X}+\mathbf{S}^{T}\mathbf{D}^{T}\mathbf{D}\mathbf{S}\right)\\  = & 0-\frac{1}{2}\mathbf{X}\mathbf{S}^{T}-\frac{1}{2}\mathbf{X}\mathbf{S}^{T}+\frac{1}{2}\mathbf{D}\left(\mathbf{S}\mathbf{S}^{T}+\mathbf{S}\mathbf{S}^{T}\right)\end{array}

If the Hessian \mathbf{D}^T\mathbf{D} is positive semidefinite, the above quadratic form reachs its global minimum value with the minimizer:

\mathbf{D}=\mathbf{X}\mathbf{S}^T\left(\mathbf{S}\mathbf{S}^T\right)^{-1}

There is the case that \mathbf{D} will grow big and \mathbf{S} shrinks small to fit \mathbf{X}. In order to prevent it from happenning, a \ell_2-norm applied for dictionary atoms is neccessary:

\Vert\mathbf{d}_{i}\Vert_{2}^{2}=1,i=1\ldots P

The Matlab code is very simple:

D = X * pinv(S);
D=D ./ repmat(sqrt(sum(D.^2)),[size(D,1) 1]);

After N iterations, we can visualize the content of dictionary atoms by calling from SPAMS Toolbox:

ImD=displayPatches(D);
figure, imagesc(ImD); colormap('gray');

The result after 6 hours training (!) is displayed below:

Learned dictionary from Lena image with 100 atoms


Visual Object Dictionary Learning based on Sparse Coding (part II)

May 21, 2011

Now, let’s take the traditional approach of spliting images into patches and learn them. Finishing this experiment, we can understand a little about the behaviour of dictionary learning in recognizing visual objects.

Image used for training

I took patches of 32\times 32 pixels with sliding fashion just on one image. If you do with more images, the total number of patches is very huge! So, my training set is ~250k samples.

Dictionary Visualization

You can observe that the dictionary atoms look quite similar to the experiments done in papers. This is the good news since our code work correctly.

\ell_1-norm magnitude over \lambda

Comments: It is quite strange when \ell_1-norm monotically increases when \lambda changes from very small to bigger. Normally, \lambda is the regularization term on the magnitude of coefficients \mathbf{S}. When \lambda increases, the magnitude of penalized term should decrease. But the above figure does not show such the thing.

Reconstruction error of the fidelity term

Comments: This figure also mentions strange things. The regularization term is used to prevent overfitting, in which the result regression function fits perfectly with the data. Therefore, the residual should increase when the regularization term is put more weights. In contrast, residual decreases when \lambda decreases!

Percentage of non-zeroes coefficients over \lambda

Comments: The sparsity of the encoded vectors also decreases! What happening?

My correction: if you cross out line 38 and 48 in the code snippet below, it will produce correct figures with the  increase of cost-function and decrease of non-zeroes when \lambda grows up. Since I am lazy, you correct for yourself, in Part I and Part II too. :) In fact, I think the documentation of SPAMS Toolbox has some sort of confusion in describing running modes. Anyway, it works well if we does not explicitly declare any mode.

% load images and extract patches
imglist = dir('dat/smallawa/');
if length(imglist) < 3
disp 'empty directory';
end

% just take first 5 images, otherwise it's huge!
imglist = imglist(3:4);
% imgsize = [32 32];
bag = cell(1,length(imglist));

wndsize = [32 32];

for j=1:length(imglist)
I = imread(['dat/smallawa/' imglist(j).name]);
I = rgb2gray(I);
I = im2double(I);
%I = imresize(I,[32 32]);
%X(:,j) = reshape(I,1, []);
X=im2col(I,wndsize,'sliding');
bag{j} = X;
end

X = cell2mat(bag);

disp (['Total patches: ' num2str(size(X,2))]);
X = X - repmat(mean(X),[size(X,1) 1]);
X = X./ repmat(sqrt(sum(X.^2)),[size(X,1) 1]);

lambda = logspace(-2,1,10);
R = zeros(length(lambda),1);
NR = zeros(length(lambda),1);

for i=1:length(lambda)
param.K = 100;
param.lambda = lambda(i);
param.iter = 100;
param.mode = 2;
param.modeD = 0;

tic;

D = mexTrainDL_Memory(X,param);
t=toc;
fprintf('time of computation for Dictionary Learning: %f\n',t);

fprintf('Evaluating cost function...\n');
param.mode = 0;
alpha=mexLasso(X,D,param);
alpha(isnan(alpha)) = 0;
R(i)=mean(0.5*sum((X-D*alpha).^2)+param.lambda*sum(abs(alpha)));
ImD=displayPatches(D);
%   title (['lambda=' num2str(lambda(i))]); axis square; axis off;
imwrite(ImD,sprintf('%02i.png',i));
NR(i) = sum(sum(alpha>0))/size(X,2);
%disp (['Non-zeros ' num2str(NR(i)*100), '%']);
%disp (['Residual:' num2str(R(i))]);
end

figure,semilogx(lambda,R); xlabel('lambda'); ylabel('residual'); grid on;
figure,semilogx(lambda,NR); xlabel('lambda'); ylabel('non-zeroes'); grid on;


Visual Object Dictionary Learning based on Sparse Coding (Part I)

May 21, 2011

In sparse coding for image denoising or inpainting, dictionary is learned over the patch level, i.e. images are divided into rectangular patches and the unsupervised dictionary learning take those to induce a dictionary for sparse coding. The famous example of coding natural images shows that the learned dictionary contains “edges” patches, a higher visual level compared to pixel. I wonder if patches are the whole images, how the dictionary looks like.

I did a funny experiment with Corel dataset. A small set of training images were taken from “tiger” category. There are ~500 images and I used all for dictionary training.

Firstly, I prepared training data by doing some preprocessing: convert to grayscale colormap, rescale it to [0..1], center the data by their mean, normalize with L2 norm.

Secondly, I set parameters for the algorithm dictionary provided by SPAMS Toolbox and wait for minutes. Because the training set is small, it was not take much time to finish the training.

Thirdly, I recorded some useful informations such as the number of non-zeroes and the residuals (between original signal and the reconstructed one).

Note that I train the dictionary over couples of values of \lambda, says from 10^{-2}\rightarrow 10^1. It is interesting to observe the correlations between \lambda, non-zeroes, and dictionary visualization.

Dictionary Visualization

Comments:

Residuals over lambda

Comments:

Average number of non-zeroes per vector

Comments:

Here is the Matlab code for my experiment


% load images and extract patches
imglist = dir('dat/smallawa/');
if length(imglist) < 3
disp 'empty directory';
end imgsize = [32 32];
X = zeros(prod(imgsize),length(imglist)-2);
for j=1:length(imglist)-2
I = imread(['dat/smallawa/' imglist(j+2).name]);
I = rgb2gray(I); I = im2double(I);
I = imresize(I,[32 32]);
X(:,j) = reshape(I,1, []);
end
disp (['Total images = ' num2str(size(X,2))]);
X = X - repmat(mean(X),[size(X,1) 1]);
X = X./ repmat(sqrt(sum(X.^2)),[size(X,1) 1]);
lambda = logspace(-2,1,20);
R = zeros(length(lambda),1);
NR = zeros(length(lambda),1);
for i=1:length(lambda)
param.K = 64;
param.lambda = lambda(i);
param.iter = 500;
param.mode = 2;
param.modeD = 0;
tic;
D = mexTrainDL_Memory(X,param);
t=toc;
fprintf('time of computation for Dictionary Learning: %f\n',t);
fprintf('Evaluating cost function...\n');
param.mode = 0;
alpha=mexLasso(X,D,param);
R(i)=mean(0.5*sum((X-D*alpha).^2)+param.lambda*sum(abs(alpha)));
ImD=displayPatches(D);
%   title (['lambda=' num2str(lambda(i))]); axis square; axis off;
imwrite(ImD,sprintf('%02i.png',i));
NR(i) = sum(sum(alpha>0))/param.K;
%disp (['Non-zeros ' num2str(NR(i)*100), '%']);
%disp (['Residual:' num2str(R(i))]);
end

figure,semilogx(lambda,R); xlabel('lambda'); ylabel('residual'); grid on;
figure,semilogx(lambda,NR); xlabel('lambda'); ylabel('non-zeroes'); grid on;


Follow

Get every new post delivered to your Inbox.