Skip to content
K.-Michael Aye edited this page Apr 15, 2015 · 28 revisions

tl;dr (for the impatient)

python reduction.py path_to_csv_file

will create an HDF file with indexes, so it's query-able for values, for example like so:

df = pd.read_hdf(path_to_hdf_file, 'df', where="image_id==PSP_003092_0985")

These are the options for the reduction module:

$ python reduction.py --help
usage: reduction.py [-h] [--raw_times] [--keep_dirt] [--do_fastread] csv_fname

positional arguments:
  csv_fname      Provide the filename of the database dump csv-file here.

optional arguments:
  -h, --help     show this help message and exit
  --raw_times    Do not parse the times into a Python datetime object. For the
                 stone-age. ;) Default: parse into datetime object.
  --keep_dirt    Do not filter for dirty data. Keep everything. Default: Do
                 the filtering.
  --do_fastread  Produce the fast-read database file for complete read into
                 memory.

Intro

The reduction module offers a standard data reduction pipeline starting from the weekly CSV dump into a fast on-disk query-able HDF5 data file.

Format

The target format is the Hierarchical Data Format (HDF) in version 5, a well established data format with good reading routines for Python, Matlab and IDL.

Parsing

The first step is a straight forward parsing of the CSV output of the Mongo database dump. While parsing, values of 'null' are being replaced by numpy.NaN, so your analysis code needs to be NaN-aware. I made the conscious decision to NOT replace None in the marking column by NaN because that detail is in itself useable data.

Also, I am not including the user_agent column in the HDF file, as this would blow up the size of the resulting file immensely. This columns was included for debugging purposes only anyway.

Date conversion

Both the acquisition_date and the created_at column are currently being parsed to a python datetime data-type. This datatype has the great advantage of available tools that deal with time-related plots, groupings and sortings automatically. In case this is not what the user wants, this conversion can be omitted by using the option --raw_times when performing the data reduction.

Tutorials split-off

Next I split off the tutorials into its own HDF file. This is useful if one wants to study the tutorial performance per se and to not confuse the science results with the tutorial data. The file pattern for the tutorial file is {}_tutorials.h5 with {} standing for the basename of the CSV filename.

Filtering / Cleaning

Empty lines

Some earlier CSV dumps had empty and/or NaN lines in them, or the CSV parser created them for something unreadable in the CSV file. It does not seem to happen lately anymore, but nevertheless I am dropping completely empty and NaN-ed lines like this:

df = df.dropna(how='all')

Incomplete data-sets

Some markings for fans and blotches have some of their required data fields empty. By default we are removing these from the HDF5 database files. The way this is done in scan_for_incomplete

  1. Define the required columns for both fan and blotch markings. These are:

    blotch_data_cols = 'x y image_x image_y radius_1 radius_2'.split()
    fan_data_cols = 'x y image_x image_y distance angle spread'.split()
  2. For each marking ['fan', 'blotch'] do:

    1. Split the data in this marking data and the rest.
    2. Filter the marking data for the respective required columns.
    3. Filter out any rows that have any of these fields empty.
    4. Combine the reduced marking data with the rest of the data file.

To be noted: These incomplete data are not only from the first days during the TV event, but, albeit at lower frequency, scattered throughout the dataset.

Fast Read

By default, this is excluded, but if called with --do_fastread, then the reduction pipeline can create an unindexed, and therefore smaller and faster readable HDF5 file. This is useful if you have a machine with a huge amount of RAM and you want to play with the whole dataset at once.

Convert blotch angles

The blotch angles suffer from an anti-symmetry depending on the way the drawing was done. A blotch drawn in parallel to the x-axis starting from the right to the left has an angle of 180 degrees, while drawn from the left to the right an angle of 0, all the while it shows perfectly the same. This skews the mean-value creation after clustering and can create the amusing display of the mean-value being orthogonal to the original markings. I apply the following algorithm to unify the blotch angles:

def convert_ellipse_angles(df):
    def func(angle):
        if angle < 0:
            return angle+180
        elif angle > 180:
            return angle - 180
        else:
            return angle
    df.loc[df.marking == 'blotch', 'angle'].map(func)
    return

Reduction levels

Different versions of the reduced dataset can be produced. For all file names the date part indicates the date of the database dump which is delivered every by Stuart.

Level Fast_Read

This file is a fixed table format for all cleaned data, in case one needs to read everything into memory the fastest way.

Above mentioned filtering was applied, so tutorials and incomplete data rows removed and blotch angles normalized.

Product file name is yyyy-mm-dd_planet_four_classifications_fast_all_read.h5

Level Queryable

This file is the data of Level Fast_Read, but combined with a multi-column index, to be able to query the database file. The data columns that can be filtered for are:

data_columns=['classification_id', 'image_id',
              'image_name', 'user_name', 'marking',
              'acquisition_date', 'local_mars_time'])

The way querying works (amazingly fast, btw.), for example to get all data for one image_id:

data = pd.read_hdf(database_fname, 'df', where='image_id=<image_id>')

where df is the HDF internal handle for the table. This is required because HDF files can contain more than one table structure.

Product file name is yyyy-mm-dd_planet_four_classifications_queryable.h5

Level Retired (not yet implemented in reduction.py)

This might be a convenient data product to have, but I did not get around to add it to the standard reduction yet.

This product is reduced to only included image_id's that have been retired (> 30 classifications performed.)

Product file name is yyyy-mm-dd_planet_four_classifications_retired.h5