Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating definition of coordinate variable to account for NUG changes #174

Closed
martinjuckes opened this issue Jul 16, 2019 · 128 comments · Fixed by #531
Closed

Updating definition of coordinate variable to account for NUG changes #174

martinjuckes opened this issue Jul 16, 2019 · 128 comments · Fixed by #531
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format

Comments

@martinjuckes
Copy link
Contributor

martinjuckes commented Jul 16, 2019

In NetCDF4, coordinate variables can be string valued or character arrays. This is a change from NetCDF3 --- and, because of this change, the section of the CF Convention which refers to the NetCDF definition of coordinate variables contains a contradictions.

Section 1.2 on terminology states that a Coordinate Variable is defined "precisely as it is defined in the NUG section on coordinate variables": this now implies string and character values are allowed. However, the following sentence in the definition of a Coordinate Variable states that it should be "numeric data type with values that are ordered monotonically".

We could resolve this contradiction by either (1) retaining the restriction to numeric data types and dropping precise equivalence with NUG or (2) retaining precise equivalence with NUG and allowing string and char coordinate variables. Initial discussion on the CF Discussions email list has two votes in favour of option 1. This would require minor changes to the text. In principle there would be no change to the conformance requirements, but the requirement for numeric data types does not appear be represented in the conformance document and should be added.

If option (2) is taken, there is some ambiguity about the meaning of the monotonicity requirement which we would need to resolve.


PR #531 implements the decisions made by the following discussion.

@martinjuckes martinjuckes added the defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors label Jul 16, 2019
@JonathanGregory
Copy link
Contributor

Thanks for raising this issue, Martin. I agree with option (1), which means this is a defect, as you have labelled it. Proposals to remedy defects are accepted by default if no-one objects within three weeks.
Jonathan

@davidhassell
Copy link
Contributor

I support option 1) as well. Thanks for raising this, Martin.

@JimBiardCics
Copy link
Contributor

I concur. Option 1) is the right choice. Do we need to add any clarifying verbiage regarding "label" coordinates?

In the old character array approach, a label coordinate variable was, by definition, an auxiliary coordinate since it was (almost) never 1D. A 1D string variable can meet the dimensional requirements for a coordinate variable. You can construct a variable with matching name and dimension name, for example, basin(basin). It seems we have three options:

A) State that this form is not allowed. Such variables would always need to have non-matching name and dimension name. This implies a cf-checker test that would fail if a 1D string variable had a matching dimension name.

B) State that this form is allowed, but that it will only be considered as an auxiliary coordinate for a variable if it is included in a coordinates attribute on the variable. This implies a cf-checker change to ignore 1D string variables like basin(basin) when building lists of coordinate variables.

C) State that string variables that would otherwise look like they could be coordinate variables are always auxiliary coordinate variables. This implies that a string variable such as basin(basin) would be understood as an auxiliary coordinate for a variable such as flow(time, basin) without the need to include it in a coordinates attribute. This likely implies a cf-checker change that would accept this form as valid.

In every case, there are implications for the data model, and for software packages such as cdms that attempt to build coordinate domains for data variables will need to deal appropriately with 1D string variables that appear to match the requirements for a coordiate variable. (@taylor13, you might want to chime in.)

@martinjuckes
Copy link
Contributor Author

Hi Jim,

I hadn't spotted that problem. My preference is for (A).

As I understand it, the CF data model has a single namespace, so there can only be one basin object. In option (B), in the absence of a coordinates declaration, the NetCDF file has a dimension basin and a variable basin which, in the NetCDF data model live in separate namespaces. I don't think we can accommodate this in CF without a significant change to the data mode .. which does not appear to be justified here.

Option (C) looks awkward to me. The size of an auxiliary coordinate is usually determined by its own coordinates. If there are no true coordinates, it is not really "auxiliary" to anything.

Perhaps @davidhassell can comment on the data model issues.

@martinjuckes
Copy link
Contributor Author

See also #139 .. which is a proposed enhancement to support string variables.

@davidhassell
Copy link
Contributor

I don't think that this is a data model issue, which ever option we choose.

The data model doesn't care how its constructs are encoded - all it needs is to be able to do is unambiguously identify its constructs from a file. For example, if we were to say in the conventions "when you see a string variable like basin(basin), interpret the variable basin as an auxiliary coordinate variable for the "discrete" axis basin" then that would fit in perfectly with the existing data model.

@JimBiardCics
Copy link
Contributor

I agree with @davidhassell that whatever option we choose, it's not difficult to incorporate into the data model. @martinjuckes, I have to confess that I don't understand your argument against options B) or C). I prefer options B) and C) myself. I don't see a good reason to make the entirely natural choice of creating a 1D string coordinate variable named basin(basin) illegal just because it is not numeric. I would love to adopt option C), but I can see the argument for option B).

@martinjuckes
Copy link
Contributor Author

OK, it is good to see that there are no obstacles from the data model side of things. My mistake there.

Adopting (B) or (C) would require a change to the definition of an Auxiliary Coordinate, which currently includes the statement that "Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s)" -- it would be good to know what alternative is being proposed.

@JonathanGregory : you took an interest in this topic during the email discussions --- do you have a preference for any of the options Jim has outlined above (on the 16th).

@ethanrd
Copy link
Member

ethanrd commented Jul 19, 2019

I like option one, i.e., sticking with the restriction that coordinate variables are 1-D numeric, monotonic.

To me this is more of a correction than a change because the main NUG section on coordinate variables is not actually particularly precise in its definition of coordinate variables. It says

It is legal for a variable to have the same name as a dimension. Such variables have no special meaning to the netCDF library. However there is a convention that such variables should be treated in a special way by software using this library.
...
If a dimension has a corresponding coordinate variable, then this provides an alternative, and often more convenient, means of specifying position along it. Current application packages that make use of coordinate variables commonly assume they are numeric vectors and strictly monotonic (all values are different and either increasing or decreasing).

The NUG Best Practices section on coordinate variables is a bit more precise. These should be made more consistent. There's discussion at Unidata on getting the NUG in its own repo so changes like this can be a bit more transparent.

@martinjuckes
Copy link
Contributor Author

Hello All,

There appears to be consensus on point 1: treating the wording of the definition of a coordinate variable as a defect and modifying it to state clearly that CF requires coordinate variables to be of numeric data type.

I don't think we have identified a clear preference regarding Jim's suggestions about auxiliary coordinate variables. In going through the changes needed in the Conventions document I noticed that we have the sentences We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional near the start of Chapter 5. It appears to me that the logic of this recommendation applies equally well to rank 1 auxiliary coordinates of non-numeric data. Do you agree with this @JimBiardCics , or is there a a reason to treat uni-dimensional auxiliary coordinates differently here?

That is,

dimensions:
   n = 1;
   basin = 4;
variables:
   float data(basin,n);
      data:coordinates = "basin";
   string basin(basin,n);

is recommended against. This is phrased as a restriction on multidimensional coordinate variables, but I believe it makes sense to treat it as applying to all auxiliary coordinate variables. This would be a slight variation on Jim's option A above, recommending that basin(basin) be avoided for auxiliary coordinates, rather than saying it is disallowed, and also leaving it open to have a simple data variable of the form basin(basin).

i.e. we have an option (D): If a string or character variable has a single dimension matching its own name, it will be treated as a data variable with an index dimension. It is recommended that such variables should not be used as auxiliary coordinate variables.

I also noticed that the first sentence of Section 1.2 is also out of date, in that it states that the terms defined come from the NUG. I think this has been wrong for some time, as most of the terms appear to be specific to CF.

I've drafted some proposed updates to the document, in the 4 places that I believe need updating:

1. Update 1st sentence of Section 1.2

Current:

The terms in this document that refer to components of a netCDF file are based on terms
 defined in the NetCDF User’s Guide (NUG) [NUG] NUG. Some of those definitions are
 repeated below for convenience.

Proposed:

The terms in this document that refer to components of a NetCDF file are defined below.
Some of these are as defined in the NetCDF User’s Guide (NUG) [NUG] NUG and are
repeated below for convenience. Terms which are introduced by NUG are
marked *[NUG]*, and terms which are introduced in NUG and modified here are marked
*[NUG->CF]*.

2. Update terms in Section 1.2 to indicate those based on NUG

  • ancestor group --> ancestor group [NUG]
  • CDL syntax --> CDL syntax [NUG]
  • cell --> cell [NUG]
  • coordinate variable --> coordinate variable [NUG -> CF]

I think these 4 are the only terms that have a specific meaning in NUG.

3. Update Definition of Coordinate Variable in Section 1.2

Current:

We use this term precisely as it is defined in the NUG section on coordinate variables.
It is a one-dimensional variable with the same name as its dimension [e.g., time(time) ],
and it is defined as a numeric data type with values that are ordered monotonically.
Missing values are not allowed in coordinate variables.

Proposed:

A one-dimensional variable with the same name as its dimension [e.g., time(time) ]
and numeric data type, with values that are ordered monotonically. Missing values
are not allowed in coordinate variables. This matches the definition of this term in
the NUG section on coordinate variables, except that CF does not allow
non-numeric data types.

4. Recommendation on Auxiliary Coordinates (Chapter 5)

Fourth paragraph of chapter 5:

Current

We recommend that the name of a multidimensional coordinate variable should not
match the name of any of its dimensions because that precludes supplying a coordinate
variable for the dimension.

Proposed (multidimensional --> auxiliary):

We recommend that the name of an auxiliary coordinate variable should not match
the name of any of its dimensions because that precludes supplying a coordinate
variable for the dimension.

@JimBiardCics
Copy link
Contributor

@martinjuckes As I read CF, there is no such thing as a "multi-dimensional coordinate variable" that is anything but an auxiliary coordinate variable. There is no provision in CF for connecting an auxiliary coordinate variable with a data variable apart from including the name in a coordinates attribute on the data variable.

Relevant parts of Section 1.2 declare

auxiliary coordinate variable
Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

coordinate variable
We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time) ], and it is defined as a numeric data type with values that are ordered monotonically. Missing values are not allowed in coordinate variables

multidimensional coordinate variable
An auxiliary coordinate variable that is multidimensional.

scalar coordinate variable
A scalar variable (i.e. one with no dimensions) that contains coordinate data. Depending on context, it may be functionally equivalent either to a size-one coordinate variable (Section 5.7, "Scalar Coordinate Variables") or to a size-one auxiliary coordinate variable (Section 6.1, "Labels" and Section 9.2, "Collections, instances, and elements").

recommendation
Recommendations in this convention are meant to provide advice that may be helpful for reducing common mistakes. In some cases we have recommended rather than required particular attributes in order to maintain backwards compatibility with COARDS. An application must not depend on a dataset’s adherence to recommendations.

These definitions make it quite clear that a non-numeric variable cannot ever be a coordinate variable. I've got no problem with clearing up the wording in Section 1.2, but everything that follows is dependent on these definitions of terms.

Section 5 paragraph 4 states

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

I included the definition of "recommendation" earlier because this paragraph is a recommendation, not a requirement. Notice that the definition of a recommendation states "An application must not depend on a dataset’s adherence to recommendations." I can see a valid argument against human confusion for recommending against allowing a multidimensional auxiliary coordinate variable to have a dimension that has the same name as the variable name, but I think the assertion that such a construction precludes providing a coordinate variable for the dimension is incorrect, and deciding that a variable is a coordinate variable on the basis of a match between one dimension and the variable name is a particularly bad practice. An auxiliary coordinate variable can be fully compliant with CF and not follow this recommendation.

I think we should consider this recommendation to be defective. I'm going to break here on account of the length of this comment and continue in another.

@JimBiardCics
Copy link
Contributor

I realized that I left a bit out above. The definitions in Section 1.2 also make it clear that a multidimensional numeric variable cannot ever be a coordinate variable. Both multidimensional numeric variables and string variables can be auxiliary coordinate variables.

@JimBiardCics
Copy link
Contributor

@martinjuckes Now to the question of 1D string auxiliary coordinate variables. A 1D string variable with matching dimension and variable names is, per the Section 1.2 definitions (see my previous comment), a fully-compliant auxiliary coordinate variable. I believe that the construction

dimensions:
   basin = 4;
variables:
   float data(basin);
      data:coordinates = "basin";
   string basin(basin);

is valid according to the current version of CF. It satisfies all the requirements. It is also compliant with the current (non-binding) recommendation from Section 5 paragraph 4, which doesn't mention 1D string auxiliary coordinate variables.

I personally think it is fine for a 1D type string auxiliary coordinate variable to have matching variable and dimension names. It is evocative of its use as a 1D coordinate for a data variable (though it is not a "true" coordinate).

@JimBiardCics
Copy link
Contributor

@martinjuckes After all that is said and done, I like your change to the Section 1.2 definition of coordinate variable. I disagree with your change to the Section 5 paragraph 4 recommendation that I believe to be defective. Here's an alternative suggestion.

We could change Section 5 paragraph 4 from

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because that precludes supplying a coordinate variable for the dimension. This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional.

to

We recommend that the name of a multidimensional [auxiliary] coordinate variable should not match the name of any of its dimensions because of the potential for such a construction to confuse users.

@martinjuckes
Copy link
Contributor Author

Hi Jim,

(1) yes, it is clear that a multidimensional coordinate variable is always an auxiliary coordinate, and that the converse is not true;

(2) yes, it is clear that Section 5, para 4 is a recommendation, not a requirement;

(3) A construction of the form string basin(basin) is clearly not supported at the moment because string variables are not allowed. This will change with your proposed extension (pull request 140), but the usage you suggest above, in which string basin(basin) could be an auxiliary coordinate is clearly something that has never been allowed in the past and so surely must be considered as an extension. Would you like to propose it as an extension?

(4) Removing the sentence from Section 5 para 4 that states This practice also avoids potential bugs in applications that determine coordinate variables by only checking for a name match between a dimension and a variable and not checking that the variable is one dimensional is an interesting suggestion, but I don't see the relevance to this discussion. Why do you think our view on this point needs to change (it has been in there since version 1.0)? I'm in favour of clearing up text which is not needed, but I don't see the grounds for considering this sentence as a defect.

@JimBiardCics
Copy link
Contributor

@martinjuckes Regarding your point (3): What convention prevents the construction string basin(basin) from being an auxiliary coordinate variable? I may well have missed it, but I haven't found one, myself. Character array variables have long been valid auxiliary coordinate variables. See Section 4.5 and Section 6.1. I see no need for an extension once the type string is accepted for use with variables.

Regarding your point (4): The sentence in the recommendation in Section 5 paragraph 4 that I am saying is defective is based on an appeal to a bad programming practice. The definition of recommendation contains the statement, "An application must not depend on a dataset’s adherence to recommendations." Applied to the sentence I am suggesting we remove, this definitional statement reads, "An application must not depend on a multidimensional [auxiliary] coordinate variable avoiding the use of a dimension that has the same name as the variable it is applied to." This directly contradicts the recommendation.

@martinjuckes
Copy link
Contributor Author

Hello Jim,

Under the existing convention basin(basin) is a coordinate variable and is not allowed to be a string. That is the whole point of this discussion. Regarding point 4, I'm afraid don't see the contradiction you allude to.

Perhaps it would help to have some other views on these points. @JonathanGregory , @ethanrd : do you have any views on Jim's suggestion that a variable of the form string basin(basin) should be allowed as an auxiliary coordinate variable (see this and preceding posts)? This is an alternative to point 4 of my proposed changes, in which I suggest extending the existing recommendation against using auxiliary coordinates of the form string basin(basin,n) to the single-dimension case.

@JonathanGregory
Copy link
Contributor

Dear Martin and Jim
I'm inclined to agree with Martin that string basin(basin) should not be allowed as an auxiliary coordinate variable because it looks like a dimension (NUG) coordinate variable, in being 1D and having the name of its dimension. At present, such variables are 2D char variables and they have a name which differs from their dimension e.g. char basinname(basin,stringlength). The string version of such an aux coord var would be string basinname(basin).
That means that string basin(basin) would not be allowed at all in CF - it can't be a dimension coord var either because it's not numeric. Not being able to use this construction might seem regrettable, since it looks convenient, but on the whole I feel that it would confuse the convention if we allowed it as an aux coord var.
Best wishes
Jonathan

@taylor13
Copy link

taylor13 commented Aug 1, 2019

Dear all,

I also think that if a string aux. coord. var. name and its dimension's name are identical, this could unnecessarily mislead some into thinking it is a coordinate variable (because of the NUG convention), so CF should not allow it.

best regards,
Karl

@JimBiardCics
Copy link
Contributor

@martinjuckes @JonathanGregory I hear where you are coming from. I may have been somewhat unclear before. What I am trying to point out is that the Conventions don't currently proscribe such a form. There is no prohibition in the text against having a 1D variable of non-numeric type that has matching variable and dimension name. Such a variable cannot be a coordinate variable, by definition, because it is non-numeric. It meets all the requirements for a valid auxiliary coordinate variable. There is also no prohibition in the text against having a multidimensional variable with a dimension name that matches the variable name. Such a variable meets all the requirements for a valid auxiliary coordinate variable.

The only basis I have found for any assertion regarding such variables is the defective recommendation in Section 5 paragraph 4, which can't actually be regarded as proscriptive because it is a recommendation.

If we wish to proscribe a variable of the form string basin(basin), we can certainly do so, but we will need to change the language of the definition of auxiliary coordinate variable in Section 1.2 to directly prohibit such a variable having a name that is the same as the name of any of its dimensions. If that is the community consensus I'm OK with that. I can see valid arguments for all three of the options I laid out in my earlier comment.

@davidhassell
Copy link
Contributor

Is there a backward compatibility issue here? If we allow string basin(basin) we are likely to break software. Given that there is not a scientific use-case here (right? or have I missed it?) I think it would be best to disallow string basin(basin)

@JimBiardCics
Copy link
Contributor

@davidhassell It's possible. Software that didn't check on the type might end up doing something unexpected. An appeal to potential software problems is problematic, as software that properly implements CF as written should check the type of a variable like string basin(basin) and decide it is not a coordinate variable. But that doesn't mean that people haven't made naïve assumptions when writing code. A variable such as char basin(basin), which is currently the only "non-numeric" option available and which CF declares to be a scalar "string" variable, is certainly problematic for a few reasons.

There is not a scientific use case. You could say the same for a number of aspects of CF.

The assumption appears to have been that auxiliary coordinate variables wouldn't ever look like coordinate variables. That's probably why we have the recommendation in Section 5 paragraph 4. We just didn't write the conventions to expressly prohibit such a case.

@JimBiardCics
Copy link
Contributor

If we don't want to allow 1D non-numeric auxiliary coordinate variables to have the form
<type> <name>(<name>) I suggest that we change the definition of auxiliary coordinate variable in Section 1.2 from

auxiliary coordinate variable
Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s).

to

auxiliary coordinate variable
Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, an auxiliary coordinate variable must not have the same name as the name of any of its dimensions.

@martinjuckes
Copy link
Contributor Author

Hello All,

thanks for those comments. I realise now that there was an error in my proposed new definition of the *coordinate variable" (item 3 here): it implied an unintended change in the interpretation of a variable int x(x) with non-monotonic values. This should be interpreted, I believe, as a coordinate variable with non-compliant values. My suggested text would have implied, rather unhelpfully, that it was merely a *data variable".

As Jim has pointed out, there is a choice about how we deal with string variables here. We could say string basin(basin) s a coordinate variable with an invalid data type, or we could say that it is not a coordinate variable, but still a valid data variable with an index dimension. Jonathan and David have argued for the 1st, and I think Karl's comments also point in that direction. I think Jim has been following the 2nd interpretation. I've created two revised definitions that set out these options below.

3(a) Revised proposal for Coordinate Variable:

Any one-dimensional variable with the same name as its dimension [e.g., time(time)] is
interpreted as a coordinate variable. Coordinate variables must have a numeric data
type and data values that are ordered monotonically without any missing values.
This matches the definition of this term in the NUG section on coordinate variables,
except that CF does not allow non-numeric data types.

3(b) Alternative revised proposal for Coordinate Variable:

Any one-dimensional variable with the same name as its dimension [e.g., time(time)]
and numeric data type is interpreted as a coordinate variable. Coordinate variables
must have data values that are ordered monotonically without any missing values.
This matches the definition of this term in the NUG section on coordinate variables,
except that CF does not interpret non-numeric variables as coordinate variables.

A also prefer 3(a), as it reduces the room for confusion which might arise if string basin(basin) is considered as a coordinate variable in NUG and a data variable in CF.

If we accept 3(a), I'm not sure of necessity to change the auxiliary coordinate variable definition. In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable. This applies to all data types. The relevant recommendation is in section 5. It might help the clarity of the document to repeat this in the definition text, but I think we should keep it as a recommendation rather than strengthen it to a firm rule.

On the other hand, if we are going to be precise there is a problem with the phrase there is no relationship between the name of an auxiliary coordinate variable and the name(s) of its dimension(s). It is trying to say that there is no semantic relationship, or no formal relationship within the CF data model. There could be other types of relationships: the point is that CF doesn't care about any relationships.

@JimBiardCics : what do you think of the following alternative for auxiliary coordinate variable:

Any netCDF variable that contains coordinate data, but is not a coordinate variable (in the sense of that term defined by the NUG and used by this standard - see below). Unlike coordinate variables, relationships between the name of the auxiliary coordinate and the names of its dimensions have no significance. It is recommended, however, that an auxiliary coordinate variable does not have the same name as the name of any of its dimensions.

@JimBiardCics
Copy link
Contributor

@martinjuckes If we go with the majority of responders on this thread regarding the acceptability of string basin(basin), then I agree that something close to your option 3(a) is best. And I'm OK with going that way since I appear to be the only one that feels differently.

Regarding your statement:

A also prefer 3(a), as it reduces the room for confusion which might arise if string basin(basin) is considered as a coordinate variable in NUG and a data variable in CF.

NUG does consider string basin(basin) as a coordinate variable, so I'm not quite sure what you are getting at. I guess you see 3(a) as precluding any variable with the form <type> basin(basin) if the type is not numeric. I think that is overly restrictive in general. I much prefer being quite clear about what is an allowed coordinate variable and what is an allowed auxiliary coordinate variable. Anything else would be considered a data variable. We can then make a recommendation that variables of the form name(name), name(name, ...), or name(..., name, ...) are to be avoided since they could be misinterpreted by inattentive readers as coordinate variables.

My next comment will include suggested new text.

I disagree with your comments regarding auxiliary coordinate variables. First and foremost, a recommendation does not, per its own definition, define anything that a person writing software to read a netCDF file should depend on.

Regarding your statement:

In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable.

I'm guessing there's a typo in this sentence, because there's not an 'auxiliary coordinate' as opposed to an 'auxiliary coordinate variable'.

Looking at the rest of the paragraph that begins "If we accept 3(a) ...", I believe that if we are going to disallow string basin(basin), then we should disallow char basin(basin,len). They are functionally equivalent. The recommendation in Section 5 paragraph 4 was an attempt to discourage any such constructions, but it is a defective recommendation in a number of ways (as I have described in previous comments). Whatever we do, let's address the defective recommendation.

I agree that the phrase about relationship between variable and dimension name in the definition of an auxiliary coordinate variable is unclear. We should change it.

@JimBiardCics
Copy link
Contributor

JimBiardCics commented Aug 5, 2019

@martinjuckes As I mentioned before, I am confused by your sentence

In the current convention the form basin(basin,n) is allowed as a data variable or an auxiliary coordinate, but generates a warning if used as an auxiliary coordinate variable.

As I read it again this morning (my time), I think I may see what you are getting at. Are you saying that the Conventions define a variable with the form <type> basin(basin) to be a valid data variable or auxiliary coordinate variable, but the cfchecker application generates a warning if the variable is used as an auxiliary coordinate variable?

@JimBiardCics
Copy link
Contributor

How about this approach? Define a coordinate variable to be a 1D numeric monotonic variable with matching variable and dimension name that does not contain any fill or missing values. Define an auxiliary coordinate variable to be an N-D variable with a name that does not match any dimension name that contains data that is intended to be interpreted as coordinate information. Remove the recommendation from Section 5. This would allow someone to make a variable of the form <type> name(name) that isn't any sort of coordinate variable, but would actively prohibit constructions such as string basin(basin) or int basin(basin, len). In specific terms, make the changes below.

In Section 1.2 (and in the order below)

Coordinate Variable

A coordinate variable is a one-dimensional variable with a numeric type that has the same name as the name of its dimension (e.g., int time(time)). The contents of a coordinate variable shall be monotonic — that is, consistently either increasing or decreasing in value, and shall not contain fill or missing values. Such a variable functions as a domain axis for any variable that has the corresponding dimension name as one of its dimensions. This definition differs from the definition in the NUG which does not require numeric type, monotonicity, or lack of fill or missing values.

Auxiliary Coordinate Variable

An auxiliary coordinate variable is a variable containing coordinate information which does not meet all the requirements of a coordinate variable. An auxiliary coordinate variable shall not have a name matching any of the names of its dimensions. An auxiliary coordinate variable may have a non-numeric type (allowing it to represent a category or label axis), may be non-monotonic, and may contain fill and missing values.

In Section 5

Delete paragraph 4.

@davidhassell
Copy link
Contributor

@JimBiardCics I like this approach.

I'd like to pepper in a few "strictly"s, and I'd rather shy away from your use of "domain axis" in the text, simply because a "domain axis construct" is a CF data model construct that does not map to a CF-netCDF coordinate variable.

How about (new text in italics):

A coordinate variable is a one-dimensional variable with a numeric type that has the same name as the name of its dimension (e.g., int time(time)). The contents of a coordinate variable shall be strictly monotonic — that is, consistently either strictly increasing or strictly decreasing in value, and shall not contain fill or missing values. Such a coordinate variable is able to unambiguously provide cell locations for any variable that has the corresponding dimension name as one of its dimensions. This definition differs from the definition in the NUG which does not require numeric type, monotonicity, or lack of fill or missing values.

and in the auxiliary coordinate paragraph: "non-strictly-monotic", (if that makes grammatical sense!).

@JimBiardCics
Copy link
Contributor

@JonathanGregory Respectfully, we don't all agree that a 1-D string-valued coordinate variable should not have the same name as its dimension. But I acknowledge that you and Martin have that view.

@ethanrd indicates that there is a general reason use case, in that netCDF-Java supports coordinate variables of the form "string x(x)". I have a use case in my own work. I am working on automated front detection. I generate netCDF files with data variables where one of the dimensions is the front type. I'd love to construct a dimension coordinate variable "string front_type(front_type)", but current CF understanding requires me to construct an auxiliary coordinate variable "string front_type_name(front_type)". It's not a problem, per se, but it is an example of a system where the "string x(x)" form would be a natural fit.

Given the strong historical conceptual connection between dimension coordinate variables and temporal and spatial axes, I understand the tendency to want to exclude non-numeric dimension coordinate variables. But in the wider world of state space domains there are "independent variable" bases that aren't represented as classical number line axes.

A workaround would be to use flag_values and flag_meanings to assign strings to each element of a variable of the form "int x(x)". Such a variable would pass the requirements for a dimension coordinate variable, and would be conceptually equivalent to a variable of the form "string x(x)".

But ... CF currently doesn't allow flag_values and flag_meanings to be used for coordinate variables. So that option is excluded.

And, yes, this isn't the most important part of this discussion, and it could be spun off into its own issue in the name of getting to closure.

@JonathanGregory
Copy link
Contributor

Dear Jim

Thanks for the example. Before the string data type was introduced, you would have used char front_type_name(front_type,stringmaxlength). Since it's an auxiliary coordinate variable, it is recommended not to give it the same name as its dimension (in chapter 5). My interpretation is that it's still an auxiliary coordinate variable when string-valued, though now 1D, and therefore should not have the same name as its dimension. I don't think it can be a dimension coordinate variable because it's not numeric, and Section 1.2 says [dimension] coordinate variables must be numeric. I think I'm correctly stating the convention as it stands. Maybe the word "should" is the problem in how I stated it. :-) In fact it's not an error for you to have string front_type(front_type) as an auxiliary coordinate variable (named by the coordinates attribute) but the CF checker should produce a warning, I think.

However, you would like string front_type to be a dimension coordinate variable (which doesn't have to be named in the coordinates attribute). For me, that means that the ordering is inherently meaningful, which in turn means we have to define a collating sequence as a CF standard, and I'm nervous of doing that, because it would be unreliable. I would guess that there is no inherent meaning to the order of a list of front types. Is that right? In a dimension coordinate variable, the values also have to be unique, but that's not difficult to enforce consistently.

If it's an auxiliary coordinate variable, then giving it a different name from its dimension is no more problematic than for any other 1D auxiliary coordinate variable. You remark that it's not really a problem. The choice is whether alleviating this annoyance is worth the cost of the possible confusion of having something that looks like a dimension coordinate variable but isn't, which is my objection to it.

Best wishes
Jonathan

@martinjuckes
Copy link
Contributor Author

martinjuckes commented Mar 12, 2020

@JimBiardCics : for completeness I'd like to note that in addition to Jonathan and myself, Karl has expressed opposition to accepting string valued dimension coordinate variables (repeatedly), and David has expressed reservations.

Do you acknowledge that the current convention, and version 1.0, contain the statement that a coordinate variable " is defined as a numeric data type", and that this would appear to rule out string coordinate variables?

You are right in observing that we are talking past each other .. thank you for acknowledging that. I'm puzzled by your confidence that "anyone given that text" could work out which variables in a file were coordinate variables. My objection, I'd like to remind you, was that both applications and users need to be able to identify coordinate variables. Can I interpret the fact that you have not addressed the question about applications as an admission that your definition cannot be converted into a logical algorithm which can be run in applications? A tutorial is not good enough, we need a coherent set of logical rules.

@ethanrd
Copy link
Member

ethanrd commented Mar 12, 2020

Hi all - While I agree that CF intends to only recognize numeric-valued coordinate variables, I think the reference to the NUG definition makes the CF statement somewhat ambiguous. (A careful reading of various sections helps make the intent clear. However, not everyone reads the specification text that carefully. I, for one, did not realize this restriction was intended and I’ve read many parts of the CF spec pretty carefully but never before, it turns out, while questioning my NUG coordinate variable assumptions.)

A bit of a side-note, or FYI:
We at Unidata are working to make the NUG a document that better supports community engagement. We are currently working to move it to its own GitHub repo to better support community input and discussions.

@JimBiardCics
Copy link
Contributor

@JonathanGregory If "string front_type(front_type)" could be considered a valid auxiliary coordinate variable, I'd be fine with that. In fact, I think that is a great approach.

The message I have gotten repeatedly from this discussion is that there were strong feelings that this should not be allowed. I agree that, whatever flaws there may be in the current language, CF currently disallows non-numeric dimension coordinate variables.

As far as it goes, I don't think we should define a collating order if we allowed string dimension coordinate variables. In fact, I think that would be the wrong way to approach it.

@JimBiardCics
Copy link
Contributor

@martinjuckes wrote:

Do you acknowledge that the current convention, and version 1.0, contain the statement that a coordinate variable " is defined as a numeric data type", and that this would appear to rule out string coordinate variables?

Yes, absolutely. The language in versions of CF developed before the string datatype was available did not envision string dimension coordinate variables and did exclude char dimension coordinate variables.

Can I interpret the fact that you have not addressed the question about applications as an admission that your definition cannot be converted into a logical algorithm which can be run in applications?

Not at all. In fact, I claim the exact opposite. I claim that the text as I wrote it is easily used by people to write software to create coordinate variables in netCDF files and software to read netCDF files and find coordinate variables.

@zklaus
Copy link

zklaus commented Mar 13, 2020

Speaking as someone that has been trying to make sense of very diverse CF files with nothing but the CF-Convention in my hand, I have to say the fact that dimension coordinates can be identified by name and dimension being the same is a good thing.

It is very hard to correctly identify, for example, auxiliary coordinates and cell_measures because this status can not be inferred from the variables themselves, but only from analyzing all relevant variables. This is possible in an ad-hoc fashion, but hard to implement in a parser. It becomes harder when "all relevant variables" might be spread over several files or exist only in an object storage or similar.

Generally, the convention does a good job of telling people with data how to put this into netcdf files. It is far more difficult to work with in the other direction.
Keeping dimensional variables easily identifiable is a good step in that direction and so I personally support strongly to forbid string x(x):.

In fact, I would like to see CF move in a direction where it becomes easier to identify the character of all variables, but that is a discussion for another day.

@davidhassell
Copy link
Contributor

As far as I can tell this issue has no moderator as yet. I would be happy to take this on, if everyone else is OK with that. I will try to collate a summary of the points raised, sometime (hopefully early) next week.

Thanks, David

@martinjuckes
Copy link
Contributor Author

Thanks David, I believe that would be very helpful. We have agreed to change it from a defect to an enhancement, but I don't appear to have the permission needed to effect that change, so please make that switch if you can.

@davidhassell davidhassell added enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format and removed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors labels Mar 13, 2020
@taylor13
Copy link

Yes, thanks for moderating, David.

@JonathanGregory
Copy link
Contributor

I agree with @zklaus, meaning I haven't changed my mind. @JimBiardCics wrote "current CF understanding requires me to construct an auxiliary coordinate variable string front_type_name(front_type). It's not a problem, per se, but it is an example of a system where the string x(x) form would be a natural fit." We all appreciate the sentiment, but since the status quo is "not a problem, per se", I continue to think that we should leave things as they are. Although string x(x) might seem like a natural fit, it has the serious disadvantage of resembling a dimension coordinate variable, which it can't be (and Jim doesn't intend it to be).

@martinjuckes
Copy link
Contributor Author

@davidhassell : this issue is still in need of a moderator -- is your offer from March 2020 still open? As far as I'm aware, the issues is still unresolved. I've revised the top comment to note an additional problem with a broken link in section 1.3 of the convention.

@davidhassell
Copy link
Contributor

Hi @martinjuckes,

Yes, of course! I notice that my offer came just before everything changed last year, so I guess it got lost in the noise. I shall remind myself of the discussion thus far and post a summary.

Thanks,
David

@JonathanGregory
Copy link
Contributor

Dear @davidhassell

Are you willing to summarise this issue?

Best wishes

Jonathan

@davidhassell
Copy link
Contributor

Thanks Jonathan, I am indeed (finally!). I shall do it tomorrow.

@davidhassell
Copy link
Contributor

davidhassell commented Jun 21, 2024

Hello,

Here is my summary of this issue ...The arguments presented are quite subtle at times, and I would recommend re-reading yourself if you want to pick up on this, rather than relying on this very compressed representation. However, it will hopefully act as a good reminder or introduction to the topic.

The issue is about clarifying the definitions of CF-netCDF coordinate variables and auxiliary coordinate variables.

The CF conventions in section 1.3 terminology say

coordinate variable

We use this term precisely as it is defined in the NUG section on coordinate variables. It is a one-dimensional variable with the same name as its dimension [e.g., time(time)], and it is defined as a numeric data type with values in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables.

but the word "precisely" in the above is problematic, because the NUG definition also allows variables of the form char x(x, n_char) and string x(x) to be coordinate variables, but CF does not, as it it limits its coordinate variables be of a numeric data type.

The discussion, I think, coalesced into two basic questions:

  1. Should we update the CF definition of a coordinate variable to be exactly like the NUG, and allow 1-d string-valued and 2-d char-valued variables with the same name as one of their dimensions?
  2. If the answer to 1. is "no", then should we disallow variables of the form char x(x, n_char) and/or string x(x), where "disallow" means that the CF checker would raise an error if such a variable is found in a file.

I would say that the majority of support was for:

  1. No.
  2. We should (continue) to allow variables of the form char x(x, n_char), but should disallow string x(x), as the latter could be too easily confused by software applications relying on the variable and dimension being the same.

There were plenty of other points raised in the discussion (such as whether or not to have better names for coordinate and auxiliary coordinate variables), but these, I feel, could be pursued elsewhere.

If you think I've missed out something important (quite possible!), please let me know and I'll update the summary.

Many thanks,
David

@JonathanGregory
Copy link
Contributor

Dear @davidhassell

Thanks for this very useful and clear summary. I have reread the discussion quickly and I think the summary is correct, as well as consistent with memory (as far as memory goes). If we agree with the majority opinion in the early discussion, I think we need to modify the definition in in 1.3. I suggest:

We use this term precisely as it is defined in the NUG section on coordinate variables. A coordinate variable It is a one-dimensional variable with the same name as its dimension [e.g., time(time)]. In CF, a coordinate variable must be of and it is defined as a numeric data type (note that NUG does not have this requirement), and consequently CF does not permit a one-dimensional string-valued variable to have the same name as its dimension. The coordinate values must be with values in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables.

and we need a corresponding prohibition of string NAME(NAME) in the conformance document, as @martinjuckes remarked.

Best wishes

Jonathan

@ethanrd
Copy link
Member

ethanrd commented Jul 9, 2024

I tripped over the "and consequently" phrase when I first read it. I think it's because the restriction against variables of the form string x(x) seems to me a deliberate decision rather than a natural consequence.

Perhaps simply moving that phrase, minus the "consequently", to the end of the suggested text and adding a bit of an explanation, something like:

"To avoid some complexity and possible confusion, CF does not permit a one-dimensional string-valued variable to have the same name as its dimension."

@JonathanGregory
Copy link
Contributor

Dear @ethanrd

Thanks for your suggestion, which would make the text:

A coordinate variable is a one-dimensional variable with the same name as its dimension e.g., time(time). In CF, a coordinate variable must be of a numeric data type (note that NUG does not have this requirement). The coordinate values must be in strict monotonic order (all values are different, and they are arranged in either consistently increasing or consistently decreasing order). Missing values are not allowed in coordinate variables. To avoid confusion with coordinate variables, CF does not permit a one-dimensional string-valued variable to have the same name as its dimension.

Is that OK?

Jonathan

@ethanrd
Copy link
Member

ethanrd commented Jul 10, 2024

Looks good. Thanks@JonathanGregory

@JonathanGregory
Copy link
Contributor

I have created PR 531 with these changes. I have put the new requirement in Sect 2.5 "Variables" in the conformance document, with a reference to Sect 1.3, because it didn't seem logical to have a requirement corresponding to "Terminology".

If there are no further objections or comments requiring changes, this can be merged in three weeks from now (31st July).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format
Projects
None yet
10 participants