Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries in the 0.5 cut catalog #53

Open
mschwamb opened this issue May 26, 2016 · 5 comments
Open

Duplicate entries in the 0.5 cut catalog #53

mschwamb opened this issue May 26, 2016 · 5 comments
Labels
Milestone

Comments

@mschwamb
Copy link
Collaborator

It appears that are duplicate entries at least in the 0.5 cut csv files in some cases..

For example I have:

488.92527796427413,283.42631325721743,2277.2586112976073,6649.35964659055,209.19709872152092,44.129834465313714,128.01835043922821,391.3928304362869,271.22236959730355,427.41745758605964,206.75622395261806,APF0000q7s,1

listed twice in ESP_021494_0945_fans.csv under applied_cut_0.5

178.03871848366478,291.06212165138936,918.0387184836648,11251.062121651392,51.881296109111894,13.286515110774953,13.777289639625213,186.2403877064726,304.335176078663,169.83704926085696,277.78906722411574,167.19966246744863,299.56674242432524,188.87777449988093,282.5575008784535,APF0000kou,1

is listed twice in APF0000kou ESP_021526_0985_blotches.csv

It doesn't look like every entry is duplicated so I'm not sure what exactly happened here.

@mschwamb mschwamb added the bug label May 26, 2016
@michaelaye michaelaye added this to the first_paper milestone May 27, 2016
@michaelaye
Copy link
Owner

Wow, started to look at this yesterday, thanks for the find, something is seriously wrong with how I apply the cut. Fortunately, the fnotching itself is okay, as evident by the fact that the files outside the applied cut don't seem to show any duplicates (I ran a check on all of them).
But here's the stats on the cut-applied files:
unknown-1
Meaning, while most have no dupes, there's more than 10 files that have between 1000 and 2000 dupes!
Looking into this now.

@mschwamb
Copy link
Collaborator Author

I think this explains why I see the catalog has a significant number more fans and blotches than were marked by the science team for the gold standard data and might explain the variation Anya saw for two images taken at very close temporal separations.

@michaelaye
Copy link
Owner

michaelaye commented May 31, 2016

So, this lead down to a rabbit hole, but I'm seeing the end of it:

  • First, I found a not insignificant bug within the pandas library with on-disk filtering of data columns that were stored as pandas.Categories (reported at categories in HDFStore don't filter correctly pandas-dev/pandas#13322). I am using those instead of just strings for image_id and image_name because the predetermined length of the strings saves currently around 2 GB of disk space for the reduced classification data-base. Not a biggie, apart from when I want to copy around the database file often. For now, worked around by using again strings for image_id and _name.
  • Then, importantly, I found a lingering bug in my fnotching code, that was introduced when we realized that it is helpful to keep both the planet4 tile coordinates of markings and the hirise image coordinates alive through-out the clustering process so that one can refer one to the other later for plotting. I previously did brutally kick out what I did not needed, and renamed image_x and image_y to x and y if clustering on hirise scope. This meant, when I removed the renaming and kept both image_x/y and x/y, the fnotching would use the planet4 x/y coords, even so the pipeline was working on the hirise scope. This lead to the sometimes thousands of duplicates, because the fnotching code, when presented with all data for a whole Hirise image, found of course many overlying clusters, while only looking at p4 tile coordinates, as there are many p4 tiles in an Hirise image. Fixed that by now always requiring a scope argument that tells at all times in what scope i'm working in (planet4 or hirise) and uses then the appropriate data columns, without losing any.
  • Finally, it became clear, that after repairing the previous bullet, there are still some duplicates remaining. I am reproducing a catalog now to assess the status after the bigger bugfix.

@michaelaye
Copy link
Owner

So, the remaining dupes after fixing above mentioned bug is this: Out of 439 blotch and fan files for season 2 and 3, 172 show duplicates, with the ones having more than 20 looking like this:
screenshot 2016-06-02 09 59 20

Anya will run this new catalog today through her scripts to see if so far it had any influence on the variability of the early in the season data points.

@michaelaye
Copy link
Owner

Within that highest obsid, the distribution of the top duplicate containers is like this:
screenshot 2016-06-02 10 36 09

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants