Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Super mandatory columns #38

Open
kmexter opened this issue Nov 7, 2024 · 12 comments
Open

Super mandatory columns #38

kmexter opened this issue Nov 7, 2024 · 12 comments
Assignees

Comments

@kmexter
Copy link
Contributor

kmexter commented Nov 7, 2024

At the present, all data read from the logsheets up to a certain date (as written in a governance file) get into the transformed logsheets (cleaned and fixed) and the ttl files -> the triple store
Where no information is provided in the logsheet, we do not carry that information into the ttls files (@laurianvm can you confirm this is what I asked you to do?).

I have 2 questions here

  1. do we actually want to have it that all the mandatory columns ARE copied to the ttl files and the triple store - in other words, if a sample does not have some mandatory information, we put a "not provided" in the ttl file there, rather than not put anything at all?
  2. do we want to make a set of "super mandatory columns" that HAVE to be in the logsheet in order for that row to get into the ttl files? we should not use the currently-defined mandatory columns because sometimes a station does not measure that mandatory value, and if we throw those out then those samples (which, by the way, will be sequenced) will never be published.

Please people comment here and before end 2024 we should decide

@cymon
Copy link

cymon commented Nov 7, 2024

Just a heads-up: there's a new combined table (with the new source_mat_id's) for all sampling events (incl. measured) data for the 227 samples sent for sequencing in batches 1 and 2. Pretty easy to see which fields the stations consistently fill in...

Combined metadata for batch 1 and 2

Though having said that, not all stations have had samples sent to sequencing...

@cpavloud
Copy link

cpavloud commented Nov 10, 2024

do we actually want to have it that all the mandatory columns ARE copied to the ttl files and the triple store - in other words, if a sample does not have some mandatory information, we put a "not provided" in the ttl file there, rather than not put anything at all?

Why would a sample not have mandatory information? I cannot only think of the blanks that are missing mandatory information (such as depth, which is actually non applicable). But other than that I think we should push for having such information from the providers, rather than try to find a way to work around the missing mandatory information issue.
I am not a fun of "not provided"...

do we want to make a set of "super mandatory columns" that HAVE to be in the logsheet in order for that row to get into the ttl files? we should not use the currently-defined mandatory columns because sometimes a station does not measure that mandatory value, and if we throw those out then those samples (which, by the way, will be sequenced) will never be published.

Isn't it more confusing to have "super mandatory", "mandatory" and "optional" columns?
Why can't we just have "mandatory" and "optional" columns?
If a station doesn't measure that mandatory value then it's on them and their samples would not be treated as the rest. I mean, we need to draw a line somewhere and also train people to collect the mandatory information.

not all stations have had samples sent to sequencing...

What do you mean?

@kmexter
Copy link
Contributor Author

kmexter commented Nov 11, 2024

I will have to dig them out - or check out cymon’s summary spreadsheet - but i seem to recall that a few stations did not, and never will, measure some mandatory values. Due to circumstances. When i am online again (on phone now) i can see how many these are. It is not many. But you also have some stations will only give measurements much later and for those, what do we do: put “value will be provided later” (not good) “not measured” (noy great) or nothing at all? I thought nothing at all is th cleanest approach, but i dont think i asked u guys what u thought

@cymon
Copy link

cymon commented Nov 13, 2024

FYI in the "Updated definitions"

Length of wc_fields: 136
Length of ss_fields: 136
Duplicate mandatory fields in soft sediment...
Field samp_store_date appears 2 times
Field tot_depth_water_col appears 2 times
Duplicate optional fields in soft sediment...
Field water_current appears 2 times
Field water_current_method appears 2 times

Water column mandatory fields: ['arr_date_hq', 'arr_date_seq', 'chlorophyll', 'chlorophyll_method', 'collection_date', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'membr_cut', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'project_name', 'replicate', 'samp_collect_device', 'samp_description', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_vol', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'scientific_name', 'sea_subsurf_salinity', 'sea_subsurf_salinity_method', 'sea_subsurf_temp', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_surf_salinity_method', 'sea_surf_temp', 'sea_surf_temp_method', 'ship_date', 'ship_date_seq', 'size_frac', 'size_frac_low', 'size_frac_up', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tax_id', 'time_fi', 'tot_depth_water_col', 'wa_id']
Water column mandatory fields: 68

Soft sediment mandatory fields: ['arr_date_hq', 'arr_date_seq', 'collection_date', 'comm_samp', 'contact_email', 'contact_name', 'depth', 'ENA_accession_number_project', 'ENA_accession_number_sample', 'ENA_accession_number_umbrella', 'env_broad_biome', 'env_local', 'env_material', 'env_package', 'extra_site_info', 'failure', 'failure_comment', 'geo_loc_name', 'investigation_type', 'latitude', 'loc_broad_ocean', 'loc_broad_ocean_mrgid', 'loc_loc', 'loc_loc_mrgid', 'loc_regional', 'loc_regional_mrgid', 'long_store', 'longitude', 'obs_id', 'organization', 'organization_country', 'organization_edmoid', 'ph', 'ph_method', 'project_name', 'redox_potential', 'redox_potential_method', 'replicate', 'samp_collect_device', 'samp_store_date', 'samp_mat_process', 'samp_mat_process_dev', 'samp_size_mass', 'samp_store_date', 'samp_store_loc', 'samp_store_temp', 'sampl_person', 'sampling_event', 'sea_surf_temp', 'sea_surf_temp_method', 'sediment_temp', 'sediment_temp_method', 'ship_date', 'ship_date_seq', 'size_frac_low', 'size_frac_up', 'so_id', 'source_mat_id_orig', 'source_mat_id', 'store_person', 'store_temp_hq', 'tot_depth_water_col', 'tot_depth_water_col', 'wa_id']
Soft sediment mandatory fields: 64

Common MANDATORY fields in both WC and SS: {'longitude', 'contact_email', 'loc_regional_mrgid', 'contact_name', 'project_name', 'organization_country', 'sea_surf_temp_method', 'ship_date_seq', 'wa_id', 'latitude', 'env_package', 'ENA_accession_number_project', 'loc_loc', 'ENA_accession_number_sample', 'long_store', 'replicate', 'sampling_event', 'failure_comment', 'extra_site_info', 'ship_date', 'failure', 'size_frac_low', 'loc_regional', 'geo_loc_name', 'source_mat_id_orig', 'store_temp_hq', 'source_mat_id', 'store_person', 'samp_store_date', 'investigation_type', 'loc_broad_ocean_mrgid', 'env_broad_biome', 'ENA_accession_number_umbrella', 'sea_surf_temp', 'loc_broad_ocean', 'loc_loc_mrgid', 'sampl_person', 'samp_collect_device', 'obs_id', 'arr_date_seq', 'env_local', 'env_material', 'depth', 'organization_edmoid', 'samp_store_temp', 'size_frac_up', 'samp_mat_process', 'tot_depth_water_col', 'samp_store_loc', 'organization', 'arr_date_hq', 'samp_mat_process_dev', 'collection_date'}
Common MANDATORY fields in both WC and SS: 53

MANDATORY fields only in WC: {'samp_size_vol', 'chlorophyll_method', 'size_frac', 'scientific_name', 'sea_subsurf_temp', 'samp_description', 'sea_subsurf_salinity', 'sea_subsurf_temp_method', 'sea_surf_salinity', 'sea_subsurf_salinity_method', 'chlorophyll', 'time_fi', 'sea_surf_salinity_method', 'membr_cut', 'tax_id'}
MANDATORY fields only in WC: 15

MANDATORY fields only in SS: {'sediment_temp_method', 'sediment_temp', 'redox_potential_method', 'comm_samp', 'ph', 'ph_method', 'samp_size_mass', 'so_id', 'redox_potential'}
MANDATORY fields only in SS: 9

@cymon
Copy link

cymon commented Nov 14, 2024

So I made some new validators just for the mandatory fields of the water_column and soft_sediment sampling sheets (not the measured yet). The validators don't check for the sanity of the value entered only that a mandatory field has a value.

All stations passed except Bergen: and that was because they had renamed the arr_date_hq field to arr_hq_date (go figure) so they have 150 missing entries: link to issue

The bigger problem is going to be the measured mandatory fields in wc and ss respectively, because some stations have taken no measurements at all. So they might just as well all be made Optional.

@cpavloud
Copy link

I really don't think they should be optional. I would personally push for keeping them mandatory. And if the stations don't follow the protocol, then there is nothing we can do. Their samples will be not included in the EMO BON release batches.

@kmexter
Copy link
Contributor Author

kmexter commented Nov 15, 2024

OK, so if there are any missing mandatory fields, we DO still carry those rows to the cleaned-up logsheets, but we do NOT carry them forward into triples.
BUT, how do we make sure then that those samples really do not get sequenced? I mean, if they are sequenced and get MetaGoflowed and put in the queue to be added to ENA, then we will have omics data with no metadata. I mean, of course we can fix that manually, but we want to avoid that. It is not a big deal for now, but perhaps something to discuss in Dec, to see what we think we should do
Also, perhaps telling the stations this specifically would be useful -> I am sure they are not really aware of the fact that without mandatory measurements, their samples may as well end up in the bin?

@cpavloud
Copy link

If they are missing mandatory columns, then they don't get to ENA.

I think Nicolas is pretty ready to mention something like this to the stations.

@kmexter
Copy link
Contributor Author

kmexter commented Nov 15, 2024

:-D Grand, we can discuss this in Dec, to see if we need to implement automatic checks, or or they have to be manual.
So I think the conculsions is that we dont need super-mandatory columns - all those that are currently mandatory are also super-mandatory.
And moreover that if any mandatory values are missing then these data do not get turned into triples, and are not sent to genoscope for sequencing (tho that may be difficult as some mandatory values do come late, potentially after the batch has been sent off)

@cymon
Copy link

cymon commented Nov 15, 2024

So I updated the validators to flag any mandatory value that was None (ie no value), NaN (ie no value float), "" (empty string), "N/A" (or variations thereof).

The results for just the "sampling" sheets of water_column and soft_sediment are here:
link

It's clearly not very useful, as recorders, and HQ, add mandatory fields at a later date, so not all row's with a source_mat_id can expect to have all mandatory fields filled. Also some fields are only mandatory if another field is positive: so failure and failure_comment are both flagged mandatory but I would guess failure_comment is only mandatory if failure is True. Hence failure_comment should be Optional.

I guess this would be more useful to check a batch of sampling events were ready before the final publication - would that be before sending to ENA?

@cpavloud
Copy link

Maybe we should discuss the validators (and their rules) in our meeting?

@kmexter
Copy link
Contributor Author

kmexter commented Nov 15, 2024

I will make a list of GH issues to discuss, and this will be one of them

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants