Thursday, 9 April 2015

Running time analysis

Measuring time

The main purpose of this project is to speedup the architecture selection process. In order to make my experiments representative, I run all the models on the same GPU with the same optimization method (RMSProp) with the same hyper parameters.


The model I used for this task is quite small and the majority of parameters are the parameters of the fully connected layer.  Nevertheless, fixing weights being random in first layers helps to speed up the training procedure. 

I conducted experiments for 2-5 fixed layers first and then after about a week I run a reference model and a model with 1 fixed layer. Seems, that something had changed within this week and two later models run much faster. I concluded, that for this small models the running time depends on other processes. I put in the table number of parameters for each model and time per epoch.
Fixed layers Trained parameters One epoch time, s
486,000 237
1 484,000 204
2 482,000 347
3 470,000 238
4 434,000 218
5 360,0000 183


  • Fixed random features help to speed up the learning 
  • The impact is not so big for small models
  • Optimizing may be more difficult for models with fixed weights

Wednesday, 8 April 2015


Experimental setup

As a reference model I used a comparably small network: 
Layer Structure Dimension
1 Convolution 10 filters 3x3
2 Convolution 20 filters 3x3, pooling 2x2
3 Convolution 64 filters 3x3, pooling 2x2
4 Convolution 64 filters 3x3, pooling 2x2
5 Convolution 128 filters 3x3, pooling 2x2
6 Fully connected 256
7 Fully connected 256

Then I trained six models: a reference one, then I fixed first layer, first and second and so forth. The idea is illustrated on the picture (grey colour means, that the weights are fixed):


I plotted training and validation error during training. 
We can see, that models with fixed parameters output reasonable results. Also it is interesting, that fixing weights sometimes gives regularizing effect (like on blue, magenta and yellow lines). It is not so amazing, because we are decreasing the capacity of the model.

Tuesday, 7 April 2015

Work on random features started

Previous work

In several works was mentioned that fixed convolutional weights perform not much worse than training the whole model, an example of these kind of works is Jarrett et al., 2009. Later, Saxe et al. investigated this phenomena from theoretical point of view and concluded, that fine tuning of the fully connected layers has the biggest effect on the training.


It is very tempting to use fixed weights for hyper parameter search since the training of the model with fixed weights should be easier and faster. I'm going to make several experiments to find advantages and drawbacks of this approach and analyze the behaviour of the training procedure in this setup.

Thursday, 26 February 2015

Results for bigger model

The model described in my previous post gave quite nice results after only 46 epochs:

Train error: 0.1104

Validation error:  0.1020

Test error: 0.1072

I didn't use any normalization except early stopping.

Future plans

I would like to add small random rotations to the dataset.

Tuesday, 24 February 2015

The bigger the better

I tried to train a bigger network with the following configuration:

    - 32
    - 40
    - 50
    - 70
    - 120
    - 3
    - 3
    - 3
    - 3
    - 3
    - 2
    - 2
    - 2
    - 2
    - 2
    - 500
    - 500

And only after 25 epochs it gets about 14% misclassification error.

Friday, 20 February 2015

Slight decrease of error

After one more day of training the network I described in the previous post slightly decreased the error.

What's interesting, it didn't go to overfitting regime although I used no regularization.

So the final result for this architecture:

Test error: 0.1824

Validation error: 0.1451

Train error: 0.1724

 Crossentropy during the training:

Error rating during the training:

Tuesday, 17 February 2015

Hit 80% accuracy!

Influenced by works of Iulian, Guillaume, and Alexandre I managed to get less than 20% error rate.


  1. Convolution 4x4, 32 feature maps
  2. Convolution 4x4, 32 feature maps
  3. Convolution 4x4, 64 feature maps
  4. Convolution 4x4, 64 feature maps
  5. Convolution 4x4, 128 feature maps
  6. Fully connected 500 hidden units
  7. Fully connected 500 hidden units
  8. Fully connected 250 hidden units
All the convolution layers were followed by 4x4 pooling.

I hoped, that a deeper architecture of fully connected layers would give better results.


I decided to use RMSprop. The speed of learning was better than with a standard SGD. 


Error rate:

Test error: 0.1992

Valid error: 0.1828

Train error: 0.1694


Future work

I'm going to continue training in order to try to overfit. I used no regularization and I wonder if it is necessary to use it for this model.

Sunday, 8 February 2015

First results

Seems, that I got some results.


Data was prepared like in vdumoulin's code: the smallest image side was reshaped to 256 pixels and then a random crop of the size 221x221 was taken. I also normalized the input data to one since otherwise the gradients were too big.

I used 5-layer network with 3 convolutional layers and 2 fully connected. The structure was the following: the first layer is a 7x7 convolution with 25 feature maps, the second layer -- 7x7 convolution with 56 feature maps, the third -- 3x3 convolution with 104 feature maps, all the convolutional layers where followed by 3x3 non-intersecting max pooling. The last fully connected layers 250 hidden units each. I used rectified linear unit for all activation functions.


I used a simple mini batch stochastic gradient descent with learning rate 1e-5. I tried to uses newer optimization methods (Adam optimizer), but I had a computational issue, I need to investigate it later.

I used a small model because I wanted to fit it into a GTX580 GPU.


I decided to use a new library for neural networks called blocks. The library allows you to create easily complicated theano models building it from 'bricks' available in blocks or implementing its yourself. The documentation for the library is available here

I implemented bricks for convolution, max pooling, and other auxiliary reasons, they are available in my repository and going to be included into blocks soon.

Intermediate results

I plotted the cost (categorical crossentropy)
Good news: it goes down, which means, that the models learns. And we can look at misclassification rate:
Currently the error rate is around 40% which means that the model is able learn something. I'll continue learning and we'll see what is the final result.

Sunday, 25 January 2015

Creating a dataset

Original data

A zipped file from kaggle contains a directory with jpeg images. They are named in a pattern 'cat/'. There are 12500 images of dogs and cats, 25000 in total. 


I used `scipy` package to  read images, currently I'm reading each image and save it with the label into pickle file, reading this data is quite slow, though. I suppose that I should use a better container in the future.

After pickling I made a list of file names, shuffled it and divided into training, validation and testing sets in a proportion 60/20/20 respectively. So, the size of training set is 15000 examples and validation and testing -- 5000 examples. The total size on disc is about 28Gb.

The script for aggregation is available here.

Data analysis

I plotted several distributions: of image height, width, area and "non-squarity" -- difference between height and width:
Seems, that mostly we have 350x500 pixel images.

 After a while I realized that one can significantly decrease the size of the dataset on disk carefully choosing dtype. After I switched dtype of image array to 'uint8' the size decreased to 11G. Then I discovered, that vdumoulin used the same trick.

Monday, 12 January 2015

I will be using this blog as a journal for my course project on representation learning (

The task is to develop a model for Cats vs Dogs dataset.

I will be hosting my source code here.