Sunday, 25 January 2015

Creating a dataset

Original data

A zipped file from kaggle contains a directory with jpeg images. They are named in a pattern 'cat/'. There are 12500 images of dogs and cats, 25000 in total. 


I used `scipy` package to  read images, currently I'm reading each image and save it with the label into pickle file, reading this data is quite slow, though. I suppose that I should use a better container in the future.

After pickling I made a list of file names, shuffled it and divided into training, validation and testing sets in a proportion 60/20/20 respectively. So, the size of training set is 15000 examples and validation and testing -- 5000 examples. The total size on disc is about 28Gb.

The script for aggregation is available here.

Data analysis

I plotted several distributions: of image height, width, area and "non-squarity" -- difference between height and width:
Seems, that mostly we have 350x500 pixel images.

 After a while I realized that one can significantly decrease the size of the dataset on disk carefully choosing dtype. After I switched dtype of image array to 'uint8' the size decreased to 11G. Then I discovered, that vdumoulin used the same trick.

Monday, 12 January 2015

I will be using this blog as a journal for my course project on representation learning (

The task is to develop a model for Cats vs Dogs dataset.

I will be hosting my source code here.