Super mandatory columns #38

kmexter · 2024-11-07T10:29:20Z

At the present, all data read from the logsheets up to a certain date (as written in a governance file) get into the transformed logsheets (cleaned and fixed) and the ttl files -> the triple store
Where no information is provided in the logsheet, we do not carry that information into the ttls files (@laurianvm can you confirm this is what I asked you to do?).

I have 2 questions here

do we actually want to have it that all the mandatory columns ARE copied to the ttl files and the triple store - in other words, if a sample does not have some mandatory information, we put a "not provided" in the ttl file there, rather than not put anything at all?
do we want to make a set of "super mandatory columns" that HAVE to be in the logsheet in order for that row to get into the ttl files? we should not use the currently-defined mandatory columns because sometimes a station does not measure that mandatory value, and if we throw those out then those samples (which, by the way, will be sequenced) will never be published.

Please people comment here and before end 2024 we should decide

cymon · 2024-11-07T12:04:02Z

Just a heads-up: there's a new combined table (with the new source_mat_id's) for all sampling events (incl. measured) data for the 227 samples sent for sequencing in batches 1 and 2. Pretty easy to see which fields the stations consistently fill in...

Combined metadata for batch 1 and 2

Though having said that, not all stations have had samples sent to sequencing...

cpavloud · 2024-11-10T17:37:00Z

do we actually want to have it that all the mandatory columns ARE copied to the ttl files and the triple store - in other words, if a sample does not have some mandatory information, we put a "not provided" in the ttl file there, rather than not put anything at all?

Why would a sample not have mandatory information? I cannot only think of the blanks that are missing mandatory information (such as depth, which is actually non applicable). But other than that I think we should push for having such information from the providers, rather than try to find a way to work around the missing mandatory information issue.
I am not a fun of "not provided"...

do we want to make a set of "super mandatory columns" that HAVE to be in the logsheet in order for that row to get into the ttl files? we should not use the currently-defined mandatory columns because sometimes a station does not measure that mandatory value, and if we throw those out then those samples (which, by the way, will be sequenced) will never be published.

Isn't it more confusing to have "super mandatory", "mandatory" and "optional" columns?
Why can't we just have "mandatory" and "optional" columns?
If a station doesn't measure that mandatory value then it's on them and their samples would not be treated as the rest. I mean, we need to draw a line somewhere and also train people to collect the mandatory information.

not all stations have had samples sent to sequencing...

What do you mean?

kmexter · 2024-11-11T10:05:59Z

I will have to dig them out - or check out cymon’s summary spreadsheet - but i seem to recall that a few stations did not, and never will, measure some mandatory values. Due to circumstances. When i am online again (on phone now) i can see how many these are. It is not many. But you also have some stations will only give measurements much later and for those, what do we do: put “value will be provided later” (not good) “not measured” (noy great) or nothing at all? I thought nothing at all is th cleanest approach, but i dont think i asked u guys what u thought

cymon · 2024-11-13T17:56:28Z

FYI in the "Updated definitions"

Length of wc_fields: 136
Length of ss_fields: 136
Duplicate mandatory fields in soft sediment...
Field samp_store_date appears 2 times
Field tot_depth_water_col appears 2 times
Duplicate optional fields in soft sediment...
Field water_current appears 2 times
Field water_current_method appears 2 times

Water column mandatory fields: ['arr_date_hq', 'arr_date_seq', 'chlorophyll', 'chlorophyll_method', 'collection_date', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'membr_cut', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'project_name', 'replicate', 'samp_collect_device', 'samp_description', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_vol', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'scientific_name', 'sea_subsurf_salinity', 'sea_subsurf_salinity_method', 'sea_subsurf_temp', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_surf_salinity_method', 'sea_surf_temp', 'sea_surf_temp_method', 'ship_date', 'ship_date_seq', 'size_frac', 'size_frac_low', 'size_frac_up', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tax_id', 'time_fi', 'tot_depth_water_col', 'wa_id']
Water column mandatory fields: 68

Soft sediment mandatory fields: ['arr_date_hq', 'arr_date_seq', 'collection_date', 'comm_samp', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'ph', 'ph_method', 'project_name', 'redox_potential', 'redox_potential_method', 'replicate', 'samp_collect_device', 'samp_store_date', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_mass', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'sea_surf_temp', 'sea_surf_temp_method', 'sediment_temp', 'sediment_temp_method', 'ship_date', 'ship_date_seq', 'size_frac_low', 'size_frac_up', 'so_id', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tot_depth_water_col', 'tot_depth_water_col', 'wa_id']
Soft sediment mandatory fields: 64

Common MANDATORY fields in both WC and SS: {'longitude', 'contact_email', 'loc_regional_mrgid', 'contact_name', 'project_name', 'organization_country', 'sea_surf_temp_method', 'ship_date_seq', 'wa_id', 'latitude', 'env_package', 'ENA_accession_number_project', 'loc_loc', 'ENA_accession_number_sample', 'long_store', 'replicate', 'sampling_event', 'failure_comment', 'extra_site_info', 'ship_date', 'failure', 'size_frac_low', 'loc_regional', 'geo_loc_name', 'source_mat_id_orig', 'store_temp_hq', 'source_mat_id', 'store_person', 'samp_store_date', 'investigation_type', 'loc_broad_ocean_mrgid', 'env_broad_biome', 'ENA_accession_number_umbrella', 'sea_surf_temp', 'loc_broad_ocean', 'loc_loc_mrgid', 'sampl_person', 'samp_collect_device', 'obs_id', 'arr_date_seq', 'env_local', 'env_material', 'depth', 'organization_edmoid', 'samp_store_temp', 'size_frac_up', 'samp_mat_process', 'tot_depth_water_col', 'samp_store_loc', 'organization', 'arr_date_hq', 'samp_mat_process_dev', 'collection_date'}
Common MANDATORY fields in both WC and SS: 53

MANDATORY fields only in WC: {'samp_size_vol', 'chlorophyll_method', 'size_frac', 'scientific_name', 'sea_subsurf_temp', 'samp_description', 'sea_subsurf_salinity', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_subsurf_salinity_method', 'chlorophyll', 'time_fi', 'sea_surf_salinity_method', 'membr_cut', 'tax_id'}
MANDATORY fields only in WC: 15

MANDATORY fields only in SS: {'sediment_temp_method', 'sediment_temp', 'redox_potential_method', 'comm_samp', 'ph', 'ph_method', 'samp_size_mass', 'so_id', 'redox_potential'}
MANDATORY fields only in SS: 9

cymon · 2024-11-14T16:56:58Z

So I made some new validators just for the mandatory fields of the water_column and soft_sediment sampling sheets (not the measured yet). The validators don't check for the sanity of the value entered only that a mandatory field has a value.

All stations passed except Bergen: and that was because they had renamed the arr_date_hq field to arr_hq_date (go figure) so they have 150 missing entries: link to issue

The bigger problem is going to be the measured mandatory fields in wc and ss respectively, because some stations have taken no measurements at all. So they might just as well all be made Optional.

cpavloud · 2024-11-14T20:12:49Z

I really don't think they should be optional. I would personally push for keeping them mandatory. And if the stations don't follow the protocol, then there is nothing we can do. Their samples will be not included in the EMO BON release batches.

kmexter · 2024-11-15T09:30:59Z

OK, so if there are any missing mandatory fields, we DO still carry those rows to the cleaned-up logsheets, but we do NOT carry them forward into triples.
BUT, how do we make sure then that those samples really do not get sequenced? I mean, if they are sequenced and get MetaGoflowed and put in the queue to be added to ENA, then we will have omics data with no metadata. I mean, of course we can fix that manually, but we want to avoid that. It is not a big deal for now, but perhaps something to discuss in Dec, to see what we think we should do
Also, perhaps telling the stations this specifically would be useful -> I am sure they are not really aware of the fact that without mandatory measurements, their samples may as well end up in the bin?

cpavloud · 2024-11-15T09:36:16Z

If they are missing mandatory columns, then they don't get to ENA.

I think Nicolas is pretty ready to mention something like this to the stations.

kmexter · 2024-11-15T09:42:33Z

:-D Grand, we can discuss this in Dec, to see if we need to implement automatic checks, or or they have to be manual.
So I think the conculsions is that we dont need super-mandatory columns - all those that are currently mandatory are also super-mandatory.
And moreover that if any mandatory values are missing then these data do not get turned into triples, and are not sent to genoscope for sequencing (tho that may be difficult as some mandatory values do come late, potentially after the batch has been sent off)

cymon · 2024-11-15T10:28:14Z

So I updated the validators to flag any mandatory value that was None (ie no value), NaN (ie no value float), "" (empty string), "N/A" (or variations thereof).

The results for just the "sampling" sheets of water_column and soft_sediment are here:
link

It's clearly not very useful, as recorders, and HQ, add mandatory fields at a later date, so not all row's with a source_mat_id can expect to have all mandatory fields filled. Also some fields are only mandatory if another field is positive: so failure and failure_comment are both flagged mandatory but I would guess failure_comment is only mandatory if failure is True. Hence failure_comment should be Optional.

I guess this would be more useful to check a batch of sampling events were ready before the final publication - would that be before sending to ENA?

cpavloud · 2024-11-15T11:15:40Z

Maybe we should discuss the validators (and their rules) in our meeting?

kmexter · 2024-11-15T16:15:30Z

I will make a list of GH issues to discuss, and this will be one of them

kmexter assigned cymon, isanti, cpavloud, kmexter and melanthia Nov 7, 2024

kmexter mentioned this issue Nov 7, 2024

VB_IMEV: water column sampling sheets has valid source_mat_ids without any valid data #37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super mandatory columns #38

Super mandatory columns #38

kmexter commented Nov 7, 2024

cymon commented Nov 7, 2024

cpavloud commented Nov 10, 2024 •

edited

Loading

kmexter commented Nov 11, 2024

cymon commented Nov 13, 2024

cymon commented Nov 14, 2024

cpavloud commented Nov 14, 2024

kmexter commented Nov 15, 2024

cpavloud commented Nov 15, 2024

kmexter commented Nov 15, 2024

cymon commented Nov 15, 2024

cpavloud commented Nov 15, 2024

kmexter commented Nov 15, 2024

Super mandatory columns #38

Super mandatory columns #38

Comments

kmexter commented Nov 7, 2024

cymon commented Nov 7, 2024

cpavloud commented Nov 10, 2024 • edited Loading

kmexter commented Nov 11, 2024

cymon commented Nov 13, 2024

cymon commented Nov 14, 2024

cpavloud commented Nov 14, 2024

kmexter commented Nov 15, 2024

cpavloud commented Nov 15, 2024

kmexter commented Nov 15, 2024

cymon commented Nov 15, 2024

cpavloud commented Nov 15, 2024

kmexter commented Nov 15, 2024

cpavloud commented Nov 10, 2024 •

edited

Loading