Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

Open
kmexter opened this issue Aug 23, 2024 · 12 comments
Open

Observatory logsheet errors in the googlesheets to be fixed by HQ #13

kmexter opened this issue Aug 23, 2024 · 12 comments
Assignees

Comments

@kmexter
Copy link
Contributor

kmexter commented Aug 23, 2024

I was going over all the googlesheets (manually!) to see which ones we could manually harvest and turn into triples (bypassing Bram's QC code until it is fixed, and perhaps using Cymon's output for those identified observatories to create the triples from).

As a result of that, I have comments about QC errors for all stations, and rather than raising separate issues for each observatory, I am putting them here. These errors REALLY need to be fixed by the stations themselves, but EMO BON HQ (whoever that is) will need to coordinate that. Tell them this is urgent! It took me a whole #$%@#$ day to investigate this, the least they can do is fix their logsheets (yes, I am VERY annoyed!).

(Christina and Cymon - assigning you more so that you see this issue rather than because I expect you to do anything, But please, do read all of this text, because there are some questions in there that I would like you to think about/answer.)

In some cases, where I say there are missing values (for mandatory columns), they should ensure that they either put in a value, type in "expected MM-YY", or NA if there will never be a value.

Note: for tidal stages it is not possible to select "expected" or "NA" as this column is created as a drop down. I strong suggest that HCMR go to all logsheets and add 2 more options to all of them: "yet to be measured" and "not available", as there are a few stations that do seems to need these options for this column. Please, can EMOBON HQ confirm here when you have done this?

Sediment
UMF UmU
Sampling tab

  1. missing values in col AE
  2. are the varous colours for themselves only or for us to understand - it is ok to use colour if only for yourself
  3. no size frac up and low for many rows > if there is to be no value here (as there was no sequential filtering) please enter NA
  4. no tidal stages

Measured tab

  1. missing redox (should be NA?)

ROSKOGO SBR
Samping tab

  1. are the varous colours for themselves only or for us to understand - it is ok to use colour if only for yourself
  2. missing depths
  3. missing size frac up and low -> if there is to be no value here (as there was no sequential filtering) please enter NA
  4. the names of the sampling and storing people are not complete - we cannot acknowledge anyone without a first name and surname
  5. some missing tides and depths
  6. some illegatl enn_material values - need ENVO IDs, so find one for "swab" please

Measured tab

  1. illegal values in in col B - replace irrelevant with NA
  2. missing ph, soil temp, and redox

STHVN MSS
Sampling

  1. missing size fractions
  2. illegal date format in col x for some rows
  3. missing NA where there are no values for mandatory columns (O and W)
  4. no investigation type values

Measured tab

  1. no mandatory values entered

OOB Banylus
Sampling

  1. some missing values on col R, T, U, V

Measured

  1. 1 missing value and I put in a comment in the googlesheet there

EMT21 UVIGO
Sampling

  1. illegal source_mat_id -> this is something we may need to discss. They have 2 different types of samples, with different filter sizes, for the same events, and this is something no other station has. As a result, using the source mat id equation as for the other stations, you end up with sample IDs that are repeated, and this is a big nono. So someone changed the equation to add on the filter size. In of itself, this is OK, but this sample id will then not match those of the other stations and this may well cause lots of problems later in the automatic processing of emo bon data. Can you all look at this please - if we include the size_frac_up as part of the source_mat_id for these samples, then we need to do that for all the samples. It is easy to fix, but we need to decide if we will do that

Water

EMT21 UVIGO
Sampling

  1. A bunch of source_mat_id in the final column will not form because the value in col R is NA - rows 187-191. Please fill in the correct values in R. See comment below for VB IMEC also.
  2. HQ then have to copy the entire sampling sheet (rename the current one to "bla" and then rename the new one to "sampling" and delete the "bla" one). This is the only way that you will have permission to create the new source_mat_ids for rows below 189 (Ioulia protected this column so only she can do it). Please do this for ALL stations (water and sediment) as we really need these IDs before we can do anything with the data

VB IMEC
Sampling

  1. There are also many NAs in column R which will make the source_mat_id funny, and one lot of 9999 as for EMT21. It actually is not illegal, however, do we want this? If HQ are OK with this formulation of these sirve fractions, I am OK with it also. UPDATE: now that col R has been changed to col Q this is not a problem any more.
  2. depth is given as a range, and that range is quite wide - 0-75 for example. Is this OK for us? I guess it is what it is and we just deal with it (these values will be carried thru to the transformed CSV files as a range, in and in the triples will be a max and min, while those depths that are a single value will be recorded in the triples as a single value. This is something that the FE VRE code will have to deal with, if we think depth is a value people will want to sort on (@cymon, this comment is for you)
  3. There are samples with the same ID. The sample in row 361 has the same id as the sample in row 365. This is because they differ on depth, and depth is not a value in the sample ID. BIG PROBLEM that we have to solve. Note: if we want to include depth in the source_mat_id, we have to deal with the fact that some depths are ranges, and that will absolutely not work in the ID.
  4. Also issue with drag-n-drop of the source_mat_id final column: another reason why for all stations we need this sheet recreated so we can add values to this column
  5. There are completely blank rows. PLEASE REMOVE
  6. Row 803 has an illegal value in col A

Measured

  1. no measurements at all?

UMF_UmU
Sampling

  1. some missing depths

STHVN MSS
No value at all....is this station "dead"? I see that we have not harvested this into GH, so I assume this station never happened in the end. In which case, perhaps remove the googlesheet?

ROSKOGO SBR
Sampling

  1. missing sample_store_location
  2. missing size_frac_up in row 42 - perhaps the value in col Q should actually be in col R? (I think so and if SBR do not respond, I suggest HQ just make the change themselves). Also for row 52, 62, and possible more. please identify all of these and correct them
  3. Question: but as this is a blank, should these size_frac_up and low not all be NA?
  4. xxx is not a value orcid (col J). If there is none, put NA or just leave blank.
  5. missing tidal_stages

Measured

  1. no measurements at all ?

PiEGetzo UPV/EHU
Sampling

  1. missing size fractions from row 25 - this affects the source_mat_id as the size_frac_up should be in there. As a result there are rows with the same source_mat_id - e.g. 25 and 27. This is a NONO. Need a value in column R. The size_frac_low is given as >20, which is maybe why size_frac_up is blank? but how can a lower limit be ">20"? that is impossible - the lower limit can be 20 and there can be an upper limit of NA or 9999...I think that @cpavloud should understand this better than I am, because you pointed out that the low and up should be swapped. Perhaps in doing that (if you have done that already- I don't know) the source_mat_id got messed up?
  2. there are two blanks for the same sample - row 29 and 30 for example. These need to be indicated as blank1 and blank2 in column L. Please identify all of these and fix them
  3. NOTE: I know that @cpavloud wanted to change the way that repeated blanks were record - not blank1 and blank2 but something else. I think I agreed with her BUT someone has to then make the necessary changes in the logsheets. @cpavloud can you please raise the necessary issue(s) on HQ as I don't have the time to do this work

NRMCB SZN
Sampling

  1. NAs in size_frac_up which affect the way the source_mat_id looks AND is illegal as it means that the id in row 54 is the same as the id in row 59 for example. Thought for all (including @cymon and @cpavloud): maybe we need to rethink the way the source_mat_ids are created????

MBAL4 MBA
Observatory

  1. I dont think the lat and long and specific enough?

Sampling
1.again missing sive_frac_up leading to odd-looking source_mat_id - see e.g. row 10. Can the value be added to col R for row 10, and for all blanks in this sheet? BUT: this is a blank, note, and as I asked above, I wondered if these needed size_fracs at all?

Measured

  1. most rows are missing mandatory values in col B. Are these expected to come or should they be NA (i..e will never come)?

LMO LnU
Sampling

  1. again some NAs in the size_frac_up for blank samples making the source mat id look odd, but no repeated IDs as a result.
  2. note that the original ID (col A) has 3 in the name while the actual one (col AM) has 0.2 in the name, I think because these size fraction cols were swapped. Again, not illegal, just potentially confusing. This is for all rows

Measured

  1. some missing values in col B

IUIEilat1 UIU
Observatory

  1. not sure the lag is specific enough. if it suppoed to be 29.5000 exactly, then should be entered as '29.50000 (otherwise googlesheets stupidly remove the trailing 0s)

FYI I think the UIU in the spreadsheet name should be IUI

HCMR-1-UBPC HCMR
Measured

  1. some missing col A values - use NA or "expected MM-YY"

ESC68N UiT
Measured

  1. for the rows with missing values, please do not write a comment in the parameter columns - only in the method columns. Put NA in there if there will be no value - cols D,F,H,J

DOORS BlackSea
No data - should this logsheet be removed from the drive?

BPNS Belgium
Sampling
1.again NAs in size_frac_up but here it is a problem as it makes the source_mat_ids the same - row 13 and 18 for example. This is a NONO. Perhaps we should use size_frac_low in the sample id?

BERGEN UiB
Sampling

  1. looks good, but here we again need the tab to be re-created so we can drag-drop in column AM in order to create the new samp_mat_ids, or else we will lose those data in the triples

AAOT CNR ISMAR
Sampling

  1. see comments in col R, potentially wrong values in there - check that entire colun Q,R - for example for row 172 the ID in col A says that the filter size is 10mu, but the values in Q and R are 20 and 2000, which do not match 10 at all, and apparently 2000 is an unexpectedly high value (according to the comment in cell 12R.

Measured

  1. missing values in column O, AN, AP, BD, BL, BR
@kmexter
Copy link
Contributor Author

kmexter commented Aug 23, 2024

In the end
The stations that I think have logsheets that we can turn into triples (semi-manually) for sediment are
CCMAR, OOB, HCMR, Belgium, NRMCB; and for water are UMF-umU, ROSKOGO SBR if HQ fix issue 3 in sampling (which I think they can do without asking SBR), RFormosa CCMAR, Plenzia UPV/EHU, OSD74 CIIMAR, possibly MBAL4 MBA, probably LMO LnU, IUIEilat1 UIU, HCMR-1-UBPC, ESC68N UiT if someone fixes the NA in the rows with comments, BERGEN UiB except the latest values which have no id, . While some of these do have missing values, they do not appear to have wrong values (at least, as far as creating triples is concerned

There is clearly something bad about having col R in the water logsheets, sampling tab, being part of the source mat id, as often that column (size_frac_up) is missing values or has NA in there. Sometimes this is not a problem (the ids look odd but are unique) but for many logsheets it is a problem as it created duplicated IDs - and each row MUST have a unique ID. So that needs looking into.

For sediment, EMT21 UVIGO has illegal source_mat_ids but these are necessary, so we need to discuss how to change things to accommodate that.

Similar problem for water for VB IMEC. Different depths of sampling are creating non-unique IDs

@cymon
Copy link

cymon commented Aug 29, 2024

(OK, I'm just adding this here rather than creating a new Issue.)

After the observatory 'sampling' sheets were updated and 'new_sampling' sheets renamed 'sampling' (this occurred on 28th August), there are only 2 'source_material_id's in the run_information sheets of Batch 1 & 2 that do not have matching 'source_mat_id's in the observatory sample sheets:

Missing source_mat_id is EMOBON_HCMR-1_Wa_210917_3um_blank row 72 batch 1
close matches are:
EMOBON_HCMR-1_Wa_210917_3um_blank2
EMOBON_HCMR-1_Wa_210917_3um_blank1
EMOBON_HCMR-1_Wa_210917_0.2um_blank2

Missing source_mat_id is EMOBON_HCMR-1_Wa_210917_0.2um_blank row 75 batch 1
close matches are
EMOBON_HCMR-1_Wa_210917_0.2um_blank2
EMOBON_HCMR-1_Wa_210917_0.2um_blank1
EMOBON_HCMR-1_Wa_230217_0.2um_blank2

I am going to assume that it they match the "blank1" but this needs to be corrected in the 'replicate' field of the 'sampling' sheets

Also notice how the autoformatted "source_mat_id" in the sampling sheet, e.g. 2EMOBON_HCMR-1_Wa_210628_3um_blank1" becomes "EMOBON_HCMR-1_Wa_210628_3um_blank" in the "measured" sheet even though it is copied from the correct cell in the "sampling" sheet - what is going on?
https://docs.google.com/spreadsheets/d/13DcVK2mzSxMJoFydSBaIMmj7Td1_JapEvcY2bmZTyLc/edit?gid=1225064690#gid=1225064690

@cpavloud
Copy link

cpavloud commented Sep 2, 2024

I think I had created issues for these things that @kmexter is mentioning, individually in each's observatory's repo and @melinalou should have corrected (at least most of them).
In sort:
depth should not be a range, this will not work for the ENA submission, it needs to be a number.
size_frac_low being >20 is also wrong. It should just be 20. In this case there is no upper limit and it is normal because it is sampling using a mesh and not a filter (different SOP).

@kmexter
Copy link
Contributor Author

kmexter commented Sep 2, 2024

clearly there is still some fixing to do then, as there are ranges and > in many logsheets.

@melinalou
Copy link

I can check this and maybe delete the ranges in depth and size_frac_low and up? or in every column with numbers?

@kmexter
Copy link
Contributor Author

kmexter commented Sep 2, 2024

I think for the size_frac it should be clear what to do, but for the ranges - we cannot delete them as we need a value in there, so the observatory has to chose a value

@melinalou
Copy link

melinalou commented Sep 2, 2024

Yes that's right!I will change only the size_frac.

@melinalou
Copy link

Done.

@cpavloud
Copy link

cpavloud commented Sep 2, 2024

size_frac should be a range
size_frac_low and size_frac_up should be numbers

also, @cymon you could add a QC step in your code to check if the values are ok, since size_frac_low should always be lower than size_frac_up (if both numbers exist)

@cymon
Copy link

cymon commented Sep 2, 2024 via email

@kmexter
Copy link
Contributor Author

kmexter commented Sep 2, 2024

size_frac can be read just as a string. it is size_frac_up and _low that we actually use for ena and triples
note: if you did not know already, the data types are also specified in https://github.com/emo-bon/observatory-profile/blob/main/logsheet_schema_extended.csv -> col3

@cymon
Copy link

cymon commented Sep 2, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants