Deep Representation Learning: January 2015

Original data

A zipped file from kaggle contains a directory with jpeg images. They are named in a pattern 'cat/dog.id.jpg'. There are 12500 images of dogs and cats, 25000 in total.

Aggregation

I used `scipy` package to read images, currently I'm reading each image and save it with the label into pickle file, reading this data is quite slow, though. I suppose that I should use a better container in the future.

After pickling I made a list of file names, shuffled it and divided into training, validation and testing sets in a proportion 60/20/20 respectively. So, the size of training set is 15000 examples and validation and testing -- 5000 examples. The total size on disc is about 28Gb.

The script for aggregation is available here.

Data analysis

I plotted several distributions: of image height, width, area and "non-squarity" -- difference between height and width:

Seems, that mostly we have 350x500 pixel images.

UPD:
After a while I realized that one can significantly decrease the size of the dataset on disk carefully choosing dtype. After I switched dtype of image array to 'uint8' the size decreased to 11G. Then I discovered, that vdumoulin used the same trick.

Deep Representation Learning

Sunday, 25 January 2015

Creating a dataset

Original data

Aggregation

Data analysis

Monday, 12 January 2015