-
Notifications
You must be signed in to change notification settings - Fork 3
Data reduction
The target format is the Hierarchical Data Format (HDF) in version 5, a well established data format with good reading routines for Python, Matlab and IDL.
The first step is just a straight forward parsing of the CSV output of the Mongo database dump.
While parsing, values of 'null' are being replaced by numpy.NaN.
I made the conscious decision to NOT replace None
in the marking
column by NaN because that
detail is in itself useable data.
Both the acquisition_date
and the created_at
column are currently being parsed to a python datetime.
This has been made optional by calling the reduction routine with the option --raw_times
.
Some markings for fans and blotches have some of their required data fields empty. By default we are removing these from the HDF5 database files. The way this is done is:
-
Define the required columns for both fan and blotch markings. These are:
blotch_data_cols = 'x y image_x image_y radius_1 radius_2'.split() fan_data_cols = 'x y image_x image_y distance angle spread'.split()
-
Filter the data for these columns and then filter out any rows that have any of these fields empty.
Additionally, the data is parsed for empty lines, because the current Mongo dump file contains an empty line at the end. So far this search does not take too much time. Maybe in the future one needs to only check the last line for emptiness, to speed things up.
The application is called planet4_reduction.py
and when called with -h
for help, it provides the following output:
usage: planet4_reduction.py [-h] [--raw_times] [--keep_dirt] csv_fname
positional arguments:
csv_fname Provide the filename of the database dump csv-file here.
optional arguments:
-h, --help show this help message and exit
--raw_times Do not parse the times into a Python datetime object. For the
stone-age. ;) Default: parse into datetime object.
--keep_dirt Do not filter for dirty data. Keep everything. Default: Do the
filtering.
I produce different versions of the reduced dataset, increasing in reduction, resulting in smaller and faster to read files.
For all file names the date part indicates the date of the database dump which is delivered every by Stuart.
All data from the CSV Mongo dump is included, but converted to a fast loadable fixed format HDF file. No filtering was done.
Product file name is 20xx-xx-xx_planet_four_classifications_L0.h5
This product splits up into fan, blotch
Product file name is planet_four_level_2_20xx-xx-xx.h5
This product is reduced to the data records that are finished in Planet4 terms, which is currently defined has having 30 individual analyses performed on a specific Planet4 subframe.
Product file name is planet_four_level_2_20xx-xx-xx.h5
This product is reduced further from Level 2 by only including data records with markings!='None'
.
In other words, each data record of this data product has marking data in it.
Product file name is planet_four_level_3_20xx-xx-xx.h5