Observatory logsheet errors in the googlesheets to be fixed by HQ #13

kmexter · 2024-08-23T06:57:33Z

I was going over all the googlesheets (manually!) to see which ones we could manually harvest and turn into triples (bypassing Bram's QC code until it is fixed, and perhaps using Cymon's output for those identified observatories to create the triples from).

As a result of that, I have comments about QC errors for all stations, and rather than raising separate issues for each observatory, I am putting them here. These errors REALLY need to be fixed by the stations themselves, but EMO BON HQ (whoever that is) will need to coordinate that. Tell them this is urgent! It took me a whole #$%@#$ day to investigate this, the least they can do is fix their logsheets (yes, I am VERY annoyed!).

(Christina and Cymon - assigning you more so that you see this issue rather than because I expect you to do anything, But please, do read all of this text, because there are some questions in there that I would like you to think about/answer.)

In some cases, where I say there are missing values (for mandatory columns), they should ensure that they either put in a value, type in "expected MM-YY", or NA if there will never be a value.

Note: for tidal stages it is not possible to select "expected" or "NA" as this column is created as a drop down. I strong suggest that HCMR go to all logsheets and add 2 more options to all of them: "yet to be measured" and "not available", as there are a few stations that do seems to need these options for this column. Please, can EMOBON HQ confirm here when you have done this?

Sediment
UMF UmU
Sampling tab

missing values in col AE
are the varous colours for themselves only or for us to understand - it is ok to use colour if only for yourself
no size frac up and low for many rows > if there is to be no value here (as there was no sequential filtering) please enter NA
no tidal stages

Measured tab

missing redox (should be NA?)

ROSKOGO SBR
Samping tab

are the varous colours for themselves only or for us to understand - it is ok to use colour if only for yourself
missing depths
missing size frac up and low -> if there is to be no value here (as there was no sequential filtering) please enter NA
the names of the sampling and storing people are not complete - we cannot acknowledge anyone without a first name and surname
some missing tides and depths
some illegatl enn_material values - need ENVO IDs, so find one for "swab" please

Measured tab

illegal values in in col B - replace irrelevant with NA
missing ph, soil temp, and redox

STHVN MSS
Sampling

missing size fractions
illegal date format in col x for some rows
missing NA where there are no values for mandatory columns (O and W)
no investigation type values

Measured tab

no mandatory values entered

OOB Banylus
Sampling

some missing values on col R, T, U, V

Measured

1 missing value and I put in a comment in the googlesheet there

EMT21 UVIGO
Sampling

illegal source_mat_id -> this is something we may need to discss. They have 2 different types of samples, with different filter sizes, for the same events, and this is something no other station has. As a result, using the source mat id equation as for the other stations, you end up with sample IDs that are repeated, and this is a big nono. So someone changed the equation to add on the filter size. In of itself, this is OK, but this sample id will then not match those of the other stations and this may well cause lots of problems later in the automatic processing of emo bon data. Can you all look at this please - if we include the size_frac_up as part of the source_mat_id for these samples, then we need to do that for all the samples. It is easy to fix, but we need to decide if we will do that

Water

EMT21 UVIGO
Sampling

A bunch of source_mat_id in the final column will not form because the value in col R is NA - rows 187-191. Please fill in the correct values in R. See comment below for VB IMEC also.
HQ then have to copy the entire sampling sheet (rename the current one to "bla" and then rename the new one to "sampling" and delete the "bla" one). This is the only way that you will have permission to create the new source_mat_ids for rows below 189 (Ioulia protected this column so only she can do it). Please do this for ALL stations (water and sediment) as we really need these IDs before we can do anything with the data

VB IMEC
Sampling

There are also many NAs in column R which will make the source_mat_id funny, and one lot of 9999 as for EMT21. It actually is not illegal, however, do we want this? If HQ are OK with this formulation of these sirve fractions, I am OK with it also. UPDATE: now that col R has been changed to col Q this is not a problem any more.
depth is given as a range, and that range is quite wide - 0-75 for example. Is this OK for us? I guess it is what it is and we just deal with it (these values will be carried thru to the transformed CSV files as a range, in and in the triples will be a max and min, while those depths that are a single value will be recorded in the triples as a single value. This is something that the FE VRE code will have to deal with, if we think depth is a value people will want to sort on (@cymon, this comment is for you)
There are samples with the same ID. The sample in row 361 has the same id as the sample in row 365. This is because they differ on depth, and depth is not a value in the sample ID. BIG PROBLEM that we have to solve. Note: if we want to include depth in the source_mat_id, we have to deal with the fact that some depths are ranges, and that will absolutely not work in the ID.
Also issue with drag-n-drop of the source_mat_id final column: another reason why for all stations we need this sheet recreated so we can add values to this column
There are completely blank rows. PLEASE REMOVE
Row 803 has an illegal value in col A

Measured

no measurements at all?

UMF_UmU
Sampling

some missing depths

STHVN MSS
No value at all....is this station "dead"? I see that we have not harvested this into GH, so I assume this station never happened in the end. In which case, perhaps remove the googlesheet?

ROSKOGO SBR
Sampling

missing sample_store_location
missing size_frac_up in row 42 - perhaps the value in col Q should actually be in col R? (I think so and if SBR do not respond, I suggest HQ just make the change themselves). Also for row 52, 62, and possible more. please identify all of these and correct them
Question: but as this is a blank, should these size_frac_up and low not all be NA?
xxx is not a value orcid (col J). If there is none, put NA or just leave blank.
missing tidal_stages

Measured

no measurements at all ?

PiEGetzo UPV/EHU
Sampling

missing size fractions from row 25 - this affects the source_mat_id as the size_frac_up should be in there. As a result there are rows with the same source_mat_id - e.g. 25 and 27. This is a NONO. Need a value in column R. The size_frac_low is given as >20, which is maybe why size_frac_up is blank? but how can a lower limit be ">20"? that is impossible - the lower limit can be 20 and there can be an upper limit of NA or 9999...I think that @cpavloud should understand this better than I am, because you pointed out that the low and up should be swapped. Perhaps in doing that (if you have done that already- I don't know) the source_mat_id got messed up?
there are two blanks for the same sample - row 29 and 30 for example. These need to be indicated as blank1 and blank2 in column L. Please identify all of these and fix them
NOTE: I know that @cpavloud wanted to change the way that repeated blanks were record - not blank1 and blank2 but something else. I think I agreed with her BUT someone has to then make the necessary changes in the logsheets. @cpavloud can you please raise the necessary issue(s) on HQ as I don't have the time to do this work

NRMCB SZN
Sampling

NAs in size_frac_up which affect the way the source_mat_id looks AND is illegal as it means that the id in row 54 is the same as the id in row 59 for example. Thought for all (including @cymon and @cpavloud): maybe we need to rethink the way the source_mat_ids are created????

MBAL4 MBA
Observatory

I dont think the lat and long and specific enough?

Sampling
1.again missing sive_frac_up leading to odd-looking source_mat_id - see e.g. row 10. Can the value be added to col R for row 10, and for all blanks in this sheet? BUT: this is a blank, note, and as I asked above, I wondered if these needed size_fracs at all?

Measured

most rows are missing mandatory values in col B. Are these expected to come or should they be NA (i..e will never come)?

LMO LnU
Sampling

again some NAs in the size_frac_up for blank samples making the source mat id look odd, but no repeated IDs as a result.
note that the original ID (col A) has 3 in the name while the actual one (col AM) has 0.2 in the name, I think because these size fraction cols were swapped. Again, not illegal, just potentially confusing. This is for all rows

Measured

some missing values in col B

IUIEilat1 UIU
Observatory

not sure the lag is specific enough. if it suppoed to be 29.5000 exactly, then should be entered as '29.50000 (otherwise googlesheets stupidly remove the trailing 0s)

FYI I think the UIU in the spreadsheet name should be IUI

HCMR-1-UBPC HCMR
Measured

some missing col A values - use NA or "expected MM-YY"

ESC68N UiT
Measured

for the rows with missing values, please do not write a comment in the parameter columns - only in the method columns. Put NA in there if there will be no value - cols D,F,H,J

DOORS BlackSea
No data - should this logsheet be removed from the drive?

BPNS Belgium
Sampling
1.again NAs in size_frac_up but here it is a problem as it makes the source_mat_ids the same - row 13 and 18 for example. This is a NONO. Perhaps we should use size_frac_low in the sample id?

BERGEN UiB
Sampling

looks good, but here we again need the tab to be re-created so we can drag-drop in column AM in order to create the new samp_mat_ids, or else we will lose those data in the triples

AAOT CNR ISMAR
Sampling

see comments in col R, potentially wrong values in there - check that entire colun Q,R - for example for row 172 the ID in col A says that the filter size is 10mu, but the values in Q and R are 20 and 2000, which do not match 10 at all, and apparently 2000 is an unexpectedly high value (according to the comment in cell 12R.

Measured

missing values in column O, AN, AP, BD, BL, BR

kmexter · 2024-08-23T09:19:48Z

In the end
The stations that I think have logsheets that we can turn into triples (semi-manually) for sediment are
CCMAR, OOB, HCMR, Belgium, NRMCB; and for water are UMF-umU, ROSKOGO SBR if HQ fix issue 3 in sampling (which I think they can do without asking SBR), RFormosa CCMAR, Plenzia UPV/EHU, OSD74 CIIMAR, possibly MBAL4 MBA, probably LMO LnU, IUIEilat1 UIU, HCMR-1-UBPC, ESC68N UiT if someone fixes the NA in the rows with comments, BERGEN UiB except the latest values which have no id, . While some of these do have missing values, they do not appear to have wrong values (at least, as far as creating triples is concerned

There is clearly something bad about having col R in the water logsheets, sampling tab, being part of the source mat id, as often that column (size_frac_up) is missing values or has NA in there. Sometimes this is not a problem (the ids look odd but are unique) but for many logsheets it is a problem as it created duplicated IDs - and each row MUST have a unique ID. So that needs looking into.

For sediment, EMT21 UVIGO has illegal source_mat_ids but these are necessary, so we need to discuss how to change things to accommodate that.

Similar problem for water for VB IMEC. Different depths of sampling are creating non-unique IDs

cymon · 2024-08-29T11:53:18Z

(OK, I'm just adding this here rather than creating a new Issue.)

After the observatory 'sampling' sheets were updated and 'new_sampling' sheets renamed 'sampling' (this occurred on 28th August), there are only 2 'source_material_id's in the run_information sheets of Batch 1 & 2 that do not have matching 'source_mat_id's in the observatory sample sheets:

Missing source_mat_id is EMOBON_HCMR-1_Wa_210917_3um_blank row 72 batch 1
close matches are:
EMOBON_HCMR-1_Wa_210917_3um_blank2
EMOBON_HCMR-1_Wa_210917_3um_blank1
EMOBON_HCMR-1_Wa_210917_0.2um_blank2

Missing source_mat_id is EMOBON_HCMR-1_Wa_210917_0.2um_blank row 75 batch 1
close matches are
EMOBON_HCMR-1_Wa_210917_0.2um_blank2
EMOBON_HCMR-1_Wa_210917_0.2um_blank1
EMOBON_HCMR-1_Wa_230217_0.2um_blank2

I am going to assume that it they match the "blank1" but this needs to be corrected in the 'replicate' field of the 'sampling' sheets

Also notice how the autoformatted "source_mat_id" in the sampling sheet, e.g. 2EMOBON_HCMR-1_Wa_210628_3um_blank1" becomes "EMOBON_HCMR-1_Wa_210628_3um_blank" in the "measured" sheet even though it is copied from the correct cell in the "sampling" sheet - what is going on?
https://docs.google.com/spreadsheets/d/13DcVK2mzSxMJoFydSBaIMmj7Td1_JapEvcY2bmZTyLc/edit?gid=1225064690#gid=1225064690

cpavloud · 2024-09-02T08:04:53Z

I think I had created issues for these things that @kmexter is mentioning, individually in each's observatory's repo and @melinalou should have corrected (at least most of them).
In sort:
depth should not be a range, this will not work for the ENA submission, it needs to be a number.
size_frac_low being >20 is also wrong. It should just be 20. In this case there is no upper limit and it is normal because it is sampling using a mesh and not a filter (different SOP).

kmexter · 2024-09-02T08:42:45Z

clearly there is still some fixing to do then, as there are ranges and > in many logsheets.

melinalou · 2024-09-02T08:44:31Z

I can check this and maybe delete the ranges in depth and size_frac_low and up? or in every column with numbers?

kmexter · 2024-09-02T08:45:27Z

I think for the size_frac it should be clear what to do, but for the ranges - we cannot delete them as we need a value in there, so the observatory has to chose a value

melinalou · 2024-09-02T08:45:59Z

Yes that's right!I will change only the size_frac.

melinalou · 2024-09-02T09:37:38Z

Done.

cpavloud · 2024-09-02T10:06:00Z

size_frac should be a range
size_frac_low and size_frac_up should be numbers

also, @cymon you could add a QC step in your code to check if the values are ok, since size_frac_low should always be lower than size_frac_up (if both numbers exist)

cymon · 2024-09-02T10:53:55Z

On Mon, 2 Sept 2024 at 11:06, Christina Pavloudi ***@***.***> wrote: *size_frac* should be a range

size_frac is a range, e.g 3-200, int dash int, which means it's a string type Do you want to throw a validation error if the range is given with a float e.g. 0.2-3 ? This is effectively the same question as below... BTW size_frac would be better if auto-formatted on the sheets...

*size_frac_low* and *size_frac_up* should be numbers

The current Updated Definition define these as integers, but are often mis-specified as a float: so do you want to throw validation error rather than keep the original floats: https://github.com/emo-bon/emo-bon-data-validation/blob/main/Batch1and2_combined_logsheets_2024-09-02.csv also, @cymon <https://github.com/cymon> you could add a QC step in your

code to check if the values are ok, since size_frac_low should always be lower than size_frac_up (if both numbers exist)

Yes, you can validate values based on the values in other fields: I'll add this check.

kmexter · 2024-09-02T11:39:05Z

size_frac can be read just as a string. it is size_frac_up and _low that we actually use for ena and triples
note: if you did not know already, the data types are also specified in https://github.com/emo-bon/observatory-profile/blob/main/logsheet_schema_extended.csv -> col3

cymon · 2024-09-02T11:55:35Z

On Mon, 2 Sept 2024 at 12:39, Katrina Exter ***@***.***> wrote: size_frac can be read just as a string. it is size_frac_up and _low that we actually use for ena and triples note: if you did not know already, the data types are also specified in https://github.com/emo-bon/observatory-profile/blob/main/logsheet_schema_extended.csv -> col3

No, I didn't know that document existed... it'll take types directly from there rather than trying to interpret the examples in the Updated Definitions (which sometimes look wrong - but this may just be the way GoogleSheets is displaying the data). Message ID: ***@***.***>

--

…

___________________________________________ Cymon J. Cox Senior Researcher Plant Systematics and Bioinformatics Digital Laboratory Centro de Ciencias do Mar (CCMAR) - CIMAR-Lab. Assoc. Mailing address: CCMAR - Centro de Ciencias do Mar, Universidade do Algarve Campus de Gambelas Edif. 7 8005-139 Faro Portugal Phone: +351 289800051 ext 7380 Fax: +351 289800051 Email: ***@***.*** @ccmar <https://ccmar.ualg.pt/users/cymon> | Google Scholar <https://scholar.google.co.uk/citations?user=f5M7DhkAAAAJ&hl=en&oi=ao> | Scopus <http://www.scopus.com/inward/authorDetails.url?authorID=7402112716&partnerID=MN8TOARS> | Orcid <http://orcid.org/0000-0002-4927-979X> | CienciaVitae <https://www.cienciavitae.pt/6B15-9771-1D04> GPG: Public key on keyserver.ubuntu.com

___________________________________________

kmexter assigned cymon, cpavloud, melinalou and melanthia Aug 23, 2024

kmexter mentioned this issue Aug 26, 2024

Swap size_frac_low and up emo-bon/governance-data#21

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

kmexter commented Aug 23, 2024 •

edited

Loading

kmexter commented Aug 23, 2024

cymon commented Aug 29, 2024 •

edited

Loading

cpavloud commented Sep 2, 2024

kmexter commented Sep 2, 2024

melinalou commented Sep 2, 2024

kmexter commented Sep 2, 2024

melinalou commented Sep 2, 2024 •

edited

Loading

melinalou commented Sep 2, 2024

cpavloud commented Sep 2, 2024 •

edited

Loading

cymon commented Sep 2, 2024 via email

kmexter commented Sep 2, 2024

cymon commented Sep 2, 2024 via email

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

Comments

kmexter commented Aug 23, 2024 • edited Loading

kmexter commented Aug 23, 2024

cymon commented Aug 29, 2024 • edited Loading

cpavloud commented Sep 2, 2024

kmexter commented Sep 2, 2024

melinalou commented Sep 2, 2024

kmexter commented Sep 2, 2024

melinalou commented Sep 2, 2024 • edited Loading

melinalou commented Sep 2, 2024

cpavloud commented Sep 2, 2024 • edited Loading

cymon commented Sep 2, 2024 via email

kmexter commented Sep 2, 2024

cymon commented Sep 2, 2024 via email

kmexter commented Aug 23, 2024 •

edited

Loading

cymon commented Aug 29, 2024 •

edited

Loading

melinalou commented Sep 2, 2024 •

edited

Loading

cpavloud commented Sep 2, 2024 •

edited

Loading