-
Notifications
You must be signed in to change notification settings - Fork 17
Composite Time Series Design document. #1103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
#. The members of a composite time series define a continuous range | ||
#. The date ranges of a composite time series *MUST* not overlap | ||
#. The date ranges of a composite time series *MUST* have any gaps | ||
#. Data may have gaps, an explanation range should be provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this user defined? Will it allow for definition of a timeseries that may not have extents that cover that time period (e.g. there's a ~2 month gap between timeseries A that ends at 2014-01-03 12:00 and timeseries B which starts at 2014-03-14- 12:00). What does an explanation range look like (e.g. "no data, start 2014-01-03 12:00, end 2014-03-14- 12:00)? Is that assigned automatically if there is a gap in the timeseries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I definitely did not describe this clearly enough in the sample data.
couple of things
- a time series with missing in it still counts as complete
A gap here means that you have an end date for one member and then... and not I've realized if you don't have a defined interval it's hard to determine this.
perhaps that should change to SHOULD since I don't think the system can meaningfully define what a "gap" in member is. Does the next start have to be the smallest time unit after the previous end (e.g. nano seconds), if not what is acceptable?
Here's what I was thinking, how do we handle known gaps in service? be it accidental destruction (2 different SPK/SPN gauges have suffered alcohol related removals from service). One site at SPK is removed during most of a year due to no water and it kept suffering vandalism.
So intent is "there's oddly large amount of missing data, how do we report that."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I was thinking, how do we handle known gaps in service? be it accidental destruction (2 different SPK/SPN gauges have suffered alcohol related removals from service). One site at SPK is removed during most of a year due to no water and it kept suffering vandalism.
So intent is "there's oddly large amount of missing data, how do we report that."
Well, this is something that doesn't currently exist in CWMS at all, as I realized when developing a system to read in punch tapes. There are notes on the tapes for station maintenance, but I have no where to save that information in a meaningful, easy to access way.
I'd argue this is out of scope, since this is a useful feature beyond this discussion.
|Duration -\> 0 |Duration of average or total. may change over time with new members, duration will be indicated in the member definition| | ||
+------------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|Version |As Normal CWMS TS ID | | ||
+------------------------+------------------------------------------------------------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like these options, but it would be nice to differentiate between a POR timeseries that includes all best available intervals (e.g. daily inst, 4 hr, 1 hr, 15 minute) and a POR timeseries that includes the best available on a daily interval (e.g. 8 am inst or daily avg). MVP's merged TS denotes these as ~15Minutes and ~1Day but maybe there's a better way. I'm thinking some way like the USGS makes it easy to pick instantaneous value data vs. daily data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the type of information to put in the version, if desired.
My assumption with the POR time series was that it should be suitable for "this is the full record as we have it" knowing that over time the official record has improved methods of measurement.
So like the first few decades could be daily instantaneous, and the next decade 12 hours, then 1 hour, then 15 minutes, and maybe things would change to an average or not. But if you go further down in the document you'll see that the returned time series values also includes the members with their definition.
So yes, you could make a composite time series that only included certain intervals and durations, but to the composite system itself it wouldn't care.
That said we could open up the definition to allow the interval and duration to be set, we would then need to decide if that is enforced.
For example:
- if the composite is 1Day, do we limit every member to 1Day. (This applies to Local Regular as well)
- if the composite is ~1Day, do we limit every member to ~1Day or allow others since ~1Day means most likely 1 day but could be different
I'm not opposed, I don't think that adds too much complexity, but other one of those more feedback from the group would be good type things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like the type of information to put in the version, if desired.
My assumption with the POR time series was that it should be suitable for "this is the full record as we have it" knowing that over time the official record has improved methods of measurement.
This is completely different from my understanding of what we were going to have for POR. Honestly, I can't think of any use case where having everything jumbled into a single time series is remotely useful. As it is, I have a hard time wanting to classify readings from two different sensor types (e.g. bubbler vs shaft-encoder) into a single POR. It's not the same data. Yes, it represents the same real-word measurement, but how useful is it to have them together? You can't run any worthwhile scientific/mathematical analysis on the data, since difference sensors respond in different ways and can throw off expectations.
Imagine training a ML model off bubbler recorded data, then trying to have it work with shaft-encoder data. It won't return expected results.
Also, what if we're actively recording the same measurement with two different sensors? Which do we put into the POR?
|
||
* Get, through existing TimeSeries classes. | ||
|
||
Does it make sense to support writing directly to a composite time series. While the write of each element *could* be sent to the underlying member, this seems ripe for error when editing or updating any data. It is likely that any edits would always be to the most recent time series, and configured in some other system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that seems like a bad idea. It's probably better to write to the underlying timeseries.
+----------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|Duration -\> var |Duration of average or total may change over time with new members, duration will be indicated in the member definition | | ||
+----------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|Version |As Normal CWMS TS ID | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could a common version name denote it as authoritative? Or does just the existence of a composite timeseries imply that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a possibility. I had put in a place older "is-authoritative" flag in the composite definition.
Though I agree the version is technically a good place for that, it does seem to get a bit... overused at times.
There are certainly arguments to be me in either case, so we'll wait for commentary from others to tip the scales.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good summary of the design -- thanks for putting that together. Noted a few points of clarity / contention.
Both names have been discarded. We use "Virtual" in too many other places with a more direct meaning of that word. | ||
For Period-of-Record, while that is the primary use-case, the concept is useful in other situations as well. | ||
|
||
Hence generically have have a "composite time series" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think having "virtual" somewhere in the name is somewhat helpful in immediately indicating to the user that the time series doesn't actually exist. If I saw "composite time series" without any additional context my first thought would probably be a merged copy of other time series data.
Possibly "virtual composite time series" or "composite virtual time series"? Although there may be some advantage to keeping the name more succinct as in your recommendation. It will likely be in standard enough usage that users will catch on quickly to whatever the terminology is, so I'm not too worried either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. adding both composite and virtual would be redundant.
Additionally, we use virtual in the context of Location and Location levels to mean something calculated that doesn't physically exist somewhere (e.g. a "virtual" gauge between two others on a river.)
Whereas this really is just a composite. It's is made of other time series that physically measure something.
@ktarbet came up with the Composite name, he may be able to add more to the argument.
Though as with other things, if enough users say I'm wrong and virtual makes more sense I'll accept the group think.
But I am a little worried about everyone eventually wanting a virtual time series that's more akin to the virtual location levels and then what do we do?
#. The definition of the composite time series is stored within the CWMS database | ||
#. The members of a composite time series define a continuous range | ||
#. The date ranges of a composite time series *MUST* not overlap | ||
#. The date ranges of a composite time series *MUST* have any gaps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this missing a "not"? If so, I think it's reasonable to allow for gaps in the data. For example, if a gage gets washed out and isn't replaced for a month (and the new gage uses a different interval) then there simply wouldn't be data for the missing month. The user could theoretically extend the range of preceding or following record to include the gap, but that wouldn't feel very intuitive to me. I would expect that the interpreter could simply return gaps as missing data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct, the NOT is missing.
I wasn't thinking about the processing returning the missings, but that is a good idea, and would required the gap to be defined... and probably some other information. So it may just be easier to put the information into the time series itself (e.g. in notes) and just let the missings be returned.
+----------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|
||
|
||
Option 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Option 2 makes more sense to me. I'm unsure how Option 1 would deal with potentially varying parameter types among a set of composite data.
I'm a little curious about the implications of doing something like <Location Id>.<Parameter>.Var.0.0.Composite
. That would lose the flexibility of using the version as a sort of unique identifier, but I can't really think of a use case for needing multiple composites for one location/parameter off of the top of my head. Using Option 2's style I'm not sure what I would use for the version besides "Composite" or "POR" -- I think it's probably somewhat rare to have a consistent source for a full set of POR data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @msweier made a fairly decent case for why you might have more than one composite for a location+measure. It does makes sense to me to have a "single authoritative" time series followed by "all data with interval X". Really depends on exactly what you're doing with the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I commented elsewhere, I say make it work like aliases, so if you fetch a composite, it checks the composite list first, if not found there, then regular timeseries, or something like that.
Then you can have it arbitrarily named, such as <Location Id>.<Parameter>.Var.0.0.Composite
or <Location Id>.<Parameter>.Var.0.0.Composite;Lowess-SPK
{ | ||
"office": "<string>", | ||
"name": "<ts id name>", | ||
"is-period-of-record": true, // or is authoritative. to distinguish between other possible use-cases? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little unclear on the relationship between the "composite time series" and the "authoritative time series". It makes sense that the authoritative TS will generally be a composite TS, but will the functionality for both be handled within this one construct? What would be the implications of setting "is-period-of-record" to true here? I feel like this would be more appropriate as "isAuthoritative" based on my assumptions.
Under the CMA paradigm a separate construct exists to link the "authoritative" TS with the parameter for a location. It seems like this will handle that automatically? e.g. if I have a Buckhorn.Flow-Outflow composite record with this set to true, this will automatically be returned when a user requests Flow-Outflow data for Buckhorn (also, is there another endpoint created for this)?
Just to clarify -- I don't have strong feelings either way, just trying to understand the intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is-authoritative
is probably better. Especially after the ideas @msweier provided in his responses.
Though idea behind this flag is that it makes it easier to decide what should be rendered. and now that I think about it I suspect A2W would always want to render the "authoritative" so it makes sense to actually use that language.
"start": "start date of this", | ||
"end": "end date of this range", | ||
"notes": "text", | ||
"version", "version date", // maybe not? could just use POR or period-of-record in the ts id version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this would be for -- seems like the version for an individual TSID member would just be the version of that TSID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly why I'm weary of using "virtual" anywhere in the naming.
This is the "date based version" that some districts use, not the last portion of the time series id. Though I really should've called that field "version-date" instead of version.
My though behind this is that we technically have two rather important periods-of-record that may exist
- The data as it was actually used to make a decision which may either be not corrected, corrected in real-time, etc.
- The data as corrected, say, using more detailed analysis or improved knowledge - e.g. in the moment you didn't know a tree had fallen in the river or something.
1 can certainly qualify as an official "record" in the full sense of that legal term.
2 seems to be what a student doing a research project would actually want
well, and depending on what the student is doing they might want both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MikeNeilson that's an interesting take too. I was thinking most people would be interested in the best available which would include the data from your second point as well as recent realtime data. But would you differentiate the two? The USGS uses "Approved" and "Preliminary" qualifiers, but our qualifiers don't quite translate as well. There's a discourse on the qualifier discussion. https://discourse.hecdev.net/t/best-practices-of-cwms-data-qualifier-codes/3805
"start-inclusive": true, | ||
"end-inclusive": false | ||
// suggest default of "start-inclusive": true, "end-inclusive": false | ||
// it may also make sense to just make this *always* the above and not let the user set it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just choosing a standard for this would suffice -- seems like a toggle would over-complicate things
|
||
* Get, through existing TimeSeries classes. | ||
|
||
Does it make sense to support writing directly to a composite time series. While the write of each element *could* be sent to the underlying member, this seems ripe for error when editing or updating any data. It is likely that any edits would always be to the most recent time series, and configured in some other system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm against allowing writes to the composite time series. That sounds like a recipe for disaster (and likely a pain to implement while handling edge cases). Writes should, in my opinion, be done to the underlying time series.
As the composite time series is comprised of multiple other time series should this always be an error to specify? | ||
The marker for always latest or always first may make sense to allow, however, at the time series is supposed to be authoritative, that would add ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't personally think of any realistic use cases for creating a composite of versioned time series data. Maybe not worth supporting unless someone can give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment on this (below?).
I say this to anyone, if you think a few things need clarity use the "request changes" feature of the pull-request if you have the authority to do so. If not (e.g. the option just doesn't exist) go ahead make it blindingly obvious (bold text, big red letters like in Hitch Hikers Guide to the Galaxy, etc) something should change first. Especially in documents like this that will define things for years to come. |
Time Series Catalog | ||
------------------- | ||
|
||
Time Series Catalog should show composite time series and allow searching by "authoritative" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just FYI: we'd have to teach CWMS-Vue a new Interval definition or a new Parameter Type in order to show up properly. I know you called up inevitable CWMS-Vue updates elsewhere, but wanted to be explicit.
|
||
Hence generically have have a "composite time series" | ||
|
||
Axioms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to add a comment or remark to a timeseries (e.g. staff gage readings, gate computations, tailwater rating etc...) or to the composite timeseries itself? If not this could be separately managed in a CLOB, but it would be neat if you could optionally comment on the timeseries as you added them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm expecting that when we add the more direct support for extracting the text timeseries along with a value time series that the time series "notes" will just come along.
Alternatively one can also just take the member time series and go retrieve that (definitely not ideal though.)
As for the 2nd part of that. There is a "notes" field for each member.
+----------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|Parameter Type |As Normal CWMS TS ID, Instantaneous, average, total, etc | | ||
+----------------------+------------------------------------------------------------------------------------------------------------------------+ | ||
|Interval -\> Composite| Marker that this time series does not have a fix information and is build of various member time series. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm of the opinion that:
- duration MUST be the same for all composite members. If you change the window size of an averaging function, then you are producing different data. It's not the same as simply swapping out a sensor.
- interval MUST be the same for all composite members. Trying to retrieve a composite time series where you have no idea what you're going to get is a nightmare to code for. If you want to combine different intervals, create a computation to create the requisite interval data from other intervals.
Tying back into an earlier section, that means the composite doesn't have to be irregular, and could potentially be regular, lrts, or prts depending on the sources.
If all timeseries in a composite are hourly regular, for example, empty values could be generated to fill in the gaps.
Now, that leads to the section I highlighted:
I say make the naming somewhat arbitrary, like it is now. Allow it to operate like an alias, so the user creates the name they want the timeseries to labeled, then say "this is a composite".
For example, that would allow the user to just specify .Composite in the version, and prevent confusion by overloading the other parameters.
<Location Id>.<Parameter>.<Parameter Type>.<Interval>.<Duration>.Composite
If that's not feasible, then perhaps adding to the interval, like lrts did:
Something like 1HourComp:
<Location Id>.<Parameter>.<Parameter Type>.1HourComp.<Duration>.<Version>
Otherwise, how would you specify composite data for different intervals and types, and keep them separate?
At SPK we have period-of-record data like this:
New Bullards Bar.Elev.Inst.1Hour.0.POR
New Bullards Bar.Elev.Inst.~1Day.0.POR
Currently that's a separate timeseries with duplicate data. But it doesn't have to be.
They could be changed internally to be composites, then maintain the same names, and everything else matches the rest of the system (parameter type, interval, duration).
Otherwise, you end up with something like:
New Bullards Bar.Elev.Inst.Composite.0.POR - What is that? Hourly, daily? What if I only want daily data?
New Bullards Bar.Elev.Composite.~1Day.0.POR - Is that averaged data, or instantaneous?
---------------------------- | ||
|
||
As the composite time series is comprised of multiple other time series should this always be an error to specify? | ||
The marker for always latest or always first may make sense to allow, however, at the time series is supposed to be authoritative, that would add ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have to define what "authoritative" means in order to properly answer this question.
At SPK we have two different types of data: real time, and analyzed/processed data. The analyzed data isn't available in our CWMS system until at least 6 months after the water year ends.
Do you want the data as it was in the system when operational decisions were made, or what the 'true' system state that data represents was?
#. Composite Time Series are Irregular | ||
#. The definition of the composite time series is stored within the CWMS database | ||
#. The members of a composite time series define a continuous range | ||
#. The date ranges of a composite time series *MUST* not overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The members of a composite time series define a continuous range
This could be worked around by using versioning in the underlying implementation. If there's ambiguity (which I 100% guarantee will happen), then pick whichever is the most recent version.
Alternatively, for a simpler implementation, it could use the "archived" flag on the timeseries. If a timeseries is archived, then we can decide on a behavior based on the start-end dates of that timeseries.
Or, let the creator specify priority order.
Initial design work to meet the needs of #956 as previously discussed.