Original data
A zipped file from kaggle contains a directory with jpeg images. They are named in a pattern 'cat/dog.id.jpg'. There are 12500 images of dogs and cats, 25000 in total.
Aggregation
I used `scipy` package to read images, currently I'm reading each image and save it with the label into pickle file, reading this data is quite slow, though. I suppose that I should use a better container in the future.
After pickling I made a list of file names, shuffled it and divided into training, validation and testing sets in a proportion 60/20/20 respectively. So, the size of training set is 15000 examples and validation and testing -- 5000 examples. The total size on disc is about 28Gb.
The script for aggregation is available here.
Data analysis
I plotted several distributions: of image height, width, area and "non-squarity" -- difference between height and width:
Seems, that mostly we have 350x500 pixel images.
UPD:
After a while I realized that one can significantly decrease the size of the dataset on disk carefully choosing dtype. After I switched dtype of image array to 'uint8' the size decreased to 11G. Then I discovered, that vdumoulin used the same trick.
UPD:
After a while I realized that one can significantly decrease the size of the dataset on disk carefully choosing dtype. After I switched dtype of image array to 'uint8' the size decreased to 11G. Then I discovered, that vdumoulin used the same trick.
No comments:
Post a Comment