Skip to content
K.-Michael Aye edited this page Jun 9, 2014 · 28 revisions

Format

The target format is the Hierarchical Data Format (HDF) in version 5, a well established data format with good reading routines for Python, Matlab and IDL.

Parsing

The first step is just a straight forward parsing of the CSV output of the Mongo database dump. While parsing, values of 'null' are being replaced by numpy.NaN. I made the conscious decision to NOT replace None in the marking column by NaN because that detail is in itself useable data.

Both the acquisition_date and the created_at column are currently being parsed to a python datetime. This has been made optional by calling the reduction routine with the option --raw_times.

Filtering / Cleaning

Some markings for fans and blotches have some of their required data fields empty. By default we are removing these from the HDF5 database files. The way this is done is:

  1. Define the required columns for both fan and blotch markings. These are:

    blotch_data_cols = 'x y image_x image_y radius_1 radius_2'.split()
    fan_data_cols = 'x y image_x image_y distance angle spread'.split()
  2. For each marking ['fan', 'blotch'] do:

    1. Split the data in this marking data and the rest.
    2. Filter the marking data for the respective required columns.
    3. Filter out any rows that have any of these fields empty.
    4. Combine the reduced marking data with the rest.

To be noted: These incomplete data are not only from the first days during the TV event, but, albeit at lower frequency, scattered throughout the next year.

Additionally, the data is parsed for empty lines, because the current Mongo dump file contains an empty line at the end. So far this search does not take too much time. Maybe in the future one needs to only check the last line for emptiness, to speed things up.

Application

The application is called planet4_reduction.py and when called with -h for help, it provides the following output:

    usage: planet4_reduction.py [-h] [--raw_times] [--keep_dirt] csv_fname

    positional arguments:
      csv_fname    Provide the filename of the database dump csv-file here.

    optional arguments:
          -h, --help   show this help message and exit
      --raw_times  Do not parse the times into a Python datetime object. For the
               stone-age. ;) Default: parse into datetime object.
      --keep_dirt  Do not filter for dirty data. Keep everything. Default: Do the
                   filtering.

Reduction levels

I produce different versions of the reduced dataset, increasing in reduction, resulting in smaller and faster to read files.

For all file names the date part indicates the date of the database dump which is delivered every by Stuart.

Level Fast_Read

This file is a fixed table format for all cleaned data, in case one needs to read everything into memory the fastest way.

Above mentioned filtering was applied, so tutorials and incomplete data rows removed.

Product file name is yyyy-mm-dd_planet_four_classifications_fast_all_read.h5

Level Queryable

This file is the data of Level 1, but combined with a multi-column index, to be able to query the database file. The data columns that can be filtered for are:

data_columns=['classification_id', 'image_id',
              'image_name', 'user_name', 'marking',
              'acquisition_date', 'local_mars_time'])

The way querying works (amazingly fast, btw), for example to get all data for one image_id:

data = pd.read_hdf(database_fname, 'df', where='image_id=<image_id>')

Product file name is yyyy-mm-dd_planet_four_classifications_queryable.h5

Level Retired (not yet implemented in reduction.py)

This product is reduced to only included image_id's that have been retired (> 30 analyses done.)

Product file name is yyyy-mm-dd_planet_four_classifications_retired.h5

Clone this wiki locally