Duplicate entries in the 0.5 cut catalog #53

mschwamb · 2016-05-26T09:00:58Z

It appears that are duplicate entries at least in the 0.5 cut csv files in some cases..

For example I have:

488.92527796427413,283.42631325721743,2277.2586112976073,6649.35964659055,209.19709872152092,44.129834465313714,128.01835043922821,391.3928304362869,271.22236959730355,427.41745758605964,206.75622395261806,APF0000q7s,1

listed twice in ESP_021494_0945_fans.csv under applied_cut_0.5

178.03871848366478,291.06212165138936,918.0387184836648,11251.062121651392,51.881296109111894,13.286515110774953,13.777289639625213,186.2403877064726,304.335176078663,169.83704926085696,277.78906722411574,167.19966246744863,299.56674242432524,188.87777449988093,282.5575008784535,APF0000kou,1

is listed twice in APF0000kou ESP_021526_0985_blotches.csv

It doesn't look like every entry is duplicated so I'm not sure what exactly happened here.

michaelaye · 2016-05-27T17:40:30Z

Wow, started to look at this yesterday, thanks for the find, something is seriously wrong with how I apply the cut. Fortunately, the fnotching itself is okay, as evident by the fact that the files outside the applied cut don't seem to show any duplicates (I ran a check on all of them).
But here's the stats on the cut-applied files:

Meaning, while most have no dupes, there's more than 10 files that have between 1000 and 2000 dupes!
Looking into this now.

mschwamb · 2016-05-27T22:36:43Z

I think this explains why I see the catalog has a significant number more fans and blotches than were marked by the science team for the gold standard data and might explain the variation Anya saw for two images taken at very close temporal separations.

michaelaye · 2016-05-31T07:31:32Z

So, this lead down to a rabbit hole, but I'm seeing the end of it:

First, I found a not insignificant bug within the pandas library with on-disk filtering of data columns that were stored as pandas.Categories (reported at categories in HDFStore don't filter correctly pandas-dev/pandas#13322). I am using those instead of just strings for image_id and image_name because the predetermined length of the strings saves currently around 2 GB of disk space for the reduced classification data-base. Not a biggie, apart from when I want to copy around the database file often. For now, worked around by using again strings for image_id and _name.
Then, importantly, I found a lingering bug in my fnotching code, that was introduced when we realized that it is helpful to keep both the planet4 tile coordinates of markings and the hirise image coordinates alive through-out the clustering process so that one can refer one to the other later for plotting. I previously did brutally kick out what I did not needed, and renamed image_x and image_y to x and y if clustering on hirise scope. This meant, when I removed the renaming and kept both image_x/y and x/y, the fnotching would use the planet4 x/y coords, even so the pipeline was working on the hirise scope. This lead to the sometimes thousands of duplicates, because the fnotching code, when presented with all data for a whole Hirise image, found of course many overlying clusters, while only looking at p4 tile coordinates, as there are many p4 tiles in an Hirise image. Fixed that by now always requiring a scope argument that tells at all times in what scope i'm working in (planet4 or hirise) and uses then the appropriate data columns, without losing any.
Finally, it became clear, that after repairing the previous bullet, there are still some duplicates remaining. I am reproducing a catalog now to assess the status after the bigger bugfix.

michaelaye · 2016-06-02T16:00:38Z

So, the remaining dupes after fixing above mentioned bug is this: Out of 439 blotch and fan files for season 2 and 3, 172 show duplicates, with the ones having more than 20 looking like this:

Anya will run this new catalog today through her scripts to see if so far it had any influence on the variability of the early in the season data points.

michaelaye · 2016-06-02T16:36:59Z

Within that highest obsid, the distribution of the top duplicate containers is like this:

mschwamb added the bug label May 26, 2016

michaelaye added this to the first_paper milestone May 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate entries in the 0.5 cut catalog #53

Duplicate entries in the 0.5 cut catalog #53

mschwamb commented May 26, 2016

michaelaye commented May 27, 2016

mschwamb commented May 27, 2016

michaelaye commented May 31, 2016 •

edited

Loading

michaelaye commented Jun 2, 2016

michaelaye commented Jun 2, 2016

Duplicate entries in the 0.5 cut catalog #53

Duplicate entries in the 0.5 cut catalog #53

Comments

mschwamb commented May 26, 2016

michaelaye commented May 27, 2016

mschwamb commented May 27, 2016

michaelaye commented May 31, 2016 • edited Loading

michaelaye commented Jun 2, 2016

michaelaye commented Jun 2, 2016

michaelaye commented May 31, 2016 •

edited

Loading