The dataset consists of 90 000 grayscale videos that show two objects of equal shape and size in which one object approaches the other one. The object speed during the process of approaching is hereby modeled by a proportional-derivative controller. Overall, three different shapes (Rectangle, Triangle and Circle) are provided. Initial configuration of the objects such as position and color were randomly sampled. Different from the moving MNIST dataset, the samples comprise a goal-oriented task, namely one object has to fully cover the other object rather than randomly moving, making it better suitable for testing prediction capabilities of an ML model.
For instance, one can use it as a toy dataset to investigate the capacity and output behavior of a deep neural network before testing it on real-world data. We have done this for a deep auto-encoder network:
To accesss a similar dataset see Simulated Planar Manipulator Dataset
Automatically download the data by using the included scripts:
python download_videos.py
(.tar files ~1.3GB, uncompressed ~23GB) orpython download_tfrecords.py
for the tfrecords respectively (~89 GB)
or manually download them by invoking the download links in the files
flyingshapes_videos.txt
flyingshapes_tfrecords.txt
If you use the dataset in your research, you should cite it as follows:
@misc{npde2018,
author = {Fabio Ferreira, Jonas Rothfuss, Eren E. Aksoy, You Zhou, Tamim Asfour},
title = {Introducing the Simulated Flying Shapes and Simulated Planar Manipulator Datasets},
year = {2018},
publisher = {arXiv},
publication = {eprint arXiv:1807.00703},
howpublished = {\url{https://arxiv.org/abs/1807.00703v1}},
}
We provide both the videos as .avi files as well as TensorFlow tfrecord files. The samples in the tfrecord files contain 10 frames of the original video which were taken equally distributed over the entire playtime. Here are some more details:
- video resolution: 128x128
- fps: 30
- color depth: 24bpp (3 channels, grayscale)
- video codec: ffmjpeg
- compression format: mjpeg, color encoding: yuvj420p
- the samples follow the naming:
where:
id_shape_startLocation_endLocation_motionDirection_euclideanDistance
id
is a unique identifierstartLocation
starting position of the object, e.g. righttopendLocation
destination position of the object, e.g. leftbottommotionDirection
e.g. lefteuclideanDistance
Euclidean distance between the two objects, e.g. 7.765617
The tfrecord files have been created with the pip package \texttt{video2tfrecord} and each file contains 1000 videos.
Due to the high computational cost of processing all original video frames in deep neural networks, we decided to reduce the number of extracted frames for the tfrecord files. As a result, every single tfrecord file entry consists of 10 RGB frames which were taken equally distributed over the video playtime. Assuming no prior knowledge about the video and its inherent scene dynamics, choosing frames equally spaced maximizes the chances of capturing most of the spatio-temporal dynamics. The files store both the videos itself and meta information from the file name (start location, eucl. distance etc.). The video data is stored in a feature
dict (which is serialized as tf.train.Example
) which stores the following keys
- feature[path] ('blob' + '/' + str(imageCount)
- feature['height']
- feature['width']
- feature['depth']
- feature['id']
Additional information is stored in a meta_dict
(which is also serialized within the feature dict via the key 'metadata')
- meta_dict['shape']
- meta_dict['start_location']
- meta_dict['end_location']
- meta_dict['motion_location']
- meta_dict['eucl_distance']