-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Super mandatory columns #38
Comments
Just a heads-up: there's a new combined table (with the new source_mat_id's) for all sampling events (incl. measured) data for the 227 samples sent for sequencing in batches 1 and 2. Pretty easy to see which fields the stations consistently fill in... Combined metadata for batch 1 and 2 Though having said that, not all stations have had samples sent to sequencing... |
Why would a sample not have mandatory information? I cannot only think of the blanks that are missing mandatory information (such as depth, which is actually non applicable). But other than that I think we should push for having such information from the providers, rather than try to find a way to work around the missing mandatory information issue.
Isn't it more confusing to have "super mandatory", "mandatory" and "optional" columns?
What do you mean? |
I will have to dig them out - or check out cymon’s summary spreadsheet - but i seem to recall that a few stations did not, and never will, measure some mandatory values. Due to circumstances. When i am online again (on phone now) i can see how many these are. It is not many. But you also have some stations will only give measurements much later and for those, what do we do: put “value will be provided later” (not good) “not measured” (noy great) or nothing at all? I thought nothing at all is th cleanest approach, but i dont think i asked u guys what u thought |
FYI in the "Updated definitions" Length of wc_fields: 136 Water column mandatory fields: ['arr_date_hq', 'arr_date_seq', 'chlorophyll', 'chlorophyll_method', 'collection_date', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'membr_cut', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'project_name', 'replicate', 'samp_collect_device', 'samp_description', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_vol', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'scientific_name', 'sea_subsurf_salinity', 'sea_subsurf_salinity_method', 'sea_subsurf_temp', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_surf_salinity_method', 'sea_surf_temp', 'sea_surf_temp_method', 'ship_date', 'ship_date_seq', 'size_frac', 'size_frac_low', 'size_frac_up', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tax_id', 'time_fi', 'tot_depth_water_col', 'wa_id'] Soft sediment mandatory fields: ['arr_date_hq', 'arr_date_seq', 'collection_date', 'comm_samp', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'ph', 'ph_method', 'project_name', 'redox_potential', 'redox_potential_method', 'replicate', 'samp_collect_device', 'samp_store_date', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_mass', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'sea_surf_temp', 'sea_surf_temp_method', 'sediment_temp', 'sediment_temp_method', 'ship_date', 'ship_date_seq', 'size_frac_low', 'size_frac_up', 'so_id', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tot_depth_water_col', 'tot_depth_water_col', 'wa_id'] Common MANDATORY fields in both WC and SS: {'longitude', 'contact_email', 'loc_regional_mrgid', 'contact_name', 'project_name', 'organization_country', 'sea_surf_temp_method', 'ship_date_seq', 'wa_id', 'latitude', 'env_package', 'ENA_accession_number_project', 'loc_loc', 'ENA_accession_number_sample', 'long_store', 'replicate', 'sampling_event', 'failure_comment', 'extra_site_info', 'ship_date', 'failure', 'size_frac_low', 'loc_regional', 'geo_loc_name', 'source_mat_id_orig', 'store_temp_hq', 'source_mat_id', 'store_person', 'samp_store_date', 'investigation_type', 'loc_broad_ocean_mrgid', 'env_broad_biome', 'ENA_accession_number_umbrella', 'sea_surf_temp', 'loc_broad_ocean', 'loc_loc_mrgid', 'sampl_person', 'samp_collect_device', 'obs_id', 'arr_date_seq', 'env_local', 'env_material', 'depth', 'organization_edmoid', 'samp_store_temp', 'size_frac_up', 'samp_mat_process', 'tot_depth_water_col', 'samp_store_loc', 'organization', 'arr_date_hq', 'samp_mat_process_dev', 'collection_date'} MANDATORY fields only in WC: {'samp_size_vol', 'chlorophyll_method', 'size_frac', 'scientific_name', 'sea_subsurf_temp', 'samp_description', 'sea_subsurf_salinity', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_subsurf_salinity_method', 'chlorophyll', 'time_fi', 'sea_surf_salinity_method', 'membr_cut', 'tax_id'} MANDATORY fields only in SS: {'sediment_temp_method', 'sediment_temp', 'redox_potential_method', 'comm_samp', 'ph', 'ph_method', 'samp_size_mass', 'so_id', 'redox_potential'} |
So I made some new validators just for the mandatory fields of the water_column and soft_sediment sampling sheets (not the measured yet). The validators don't check for the sanity of the value entered only that a mandatory field has a value. All stations passed except Bergen: and that was because they had renamed the arr_date_hq field to arr_hq_date (go figure) so they have 150 missing entries: link to issue The bigger problem is going to be the measured mandatory fields in wc and ss respectively, because some stations have taken no measurements at all. So they might just as well all be made Optional. |
I really don't think they should be optional. I would personally push for keeping them mandatory. And if the stations don't follow the protocol, then there is nothing we can do. Their samples will be not included in the EMO BON release batches. |
OK, so if there are any missing mandatory fields, we DO still carry those rows to the cleaned-up logsheets, but we do NOT carry them forward into triples. |
If they are missing mandatory columns, then they don't get to ENA. I think Nicolas is pretty ready to mention something like this to the stations. |
:-D Grand, we can discuss this in Dec, to see if we need to implement automatic checks, or or they have to be manual. |
So I updated the validators to flag any mandatory value that was None (ie no value), NaN (ie no value float), "" (empty string), "N/A" (or variations thereof). The results for just the "sampling" sheets of water_column and soft_sediment are here: It's clearly not very useful, as recorders, and HQ, add mandatory fields at a later date, so not all row's with a source_mat_id can expect to have all mandatory fields filled. Also some fields are only mandatory if another field is positive: so failure and failure_comment are both flagged mandatory but I would guess failure_comment is only mandatory if failure is True. Hence failure_comment should be Optional. I guess this would be more useful to check a batch of sampling events were ready before the final publication - would that be before sending to ENA? |
Maybe we should discuss the validators (and their rules) in our meeting? |
I will make a list of GH issues to discuss, and this will be one of them |
At the present, all data read from the logsheets up to a certain date (as written in a governance file) get into the transformed logsheets (cleaned and fixed) and the ttl files -> the triple store
Where no information is provided in the logsheet, we do not carry that information into the ttls files (@laurianvm can you confirm this is what I asked you to do?).
I have 2 questions here
Please people comment here and before end 2024 we should decide
The text was updated successfully, but these errors were encountered: