Investigating forecast year bug not caught by jenkins #424
Replies: 10 comments
-
This is a good topic. I do not think that a blanket of more testing
examples will solve the problem. I think the solution should rely more
upon unit testing of features. I do a lot of this while developing new
features. I create a specific example that allows me to iteratively change
inputs relevant to that feature and examine the results. It's all ad hoc
and rather undocumented except for the occasional note that I will put into
the Issue commentary.
A story to illustrate. Right now I am working on development of Mark
Maunder's new natural mortality option, which relies upon having access to
age maturity vector. This got me into the need to change the order of
operations for maturity relative to natural mortality. That's very
far-reaching consequence, so if I broke something, then our current Jenkins
test should catch it. So fine. But I also have had to tweak how and when
age-maturity is calculated (previously only in report code) and I realized
that the best place to calculate it is in the age-fecundity routine, then
while in the age-fecundity routine I saw that the code for age-fecundity
was complex because in integrating length fecundity with the age-length key
it was invoking the arrays created to dynamically hold the range of lengths
for each age. But we never use that feature (ALK tolerance from the
starter file). So I removed that possibility (for now). I'll put a note
in the Issue log, but a change like this is what creates testing trouble
because one of the examples might be using ALK tolerance.
So, I will restore the complex code for fecundity; create a new issue to
look specifically at that topic in future; and continue with a specific
testbed to examine the performance of the new feature.
Leads to obvious question: which such testbeds need archiving. How and
where?
…On Wed, Feb 10, 2021 at 3:01 PM Kathryn Doering ***@***.***> wrote:
Opening a team discussion rather than posting on
nmfs-ost/ss3-source-code#130
<#130> as
this is somewhat of a tangent.
Only the vermillion_snapper model in our testing set used a negative value
in the benchmark years; there was indeed a small change in forecasting
after fixing the bug (as Ian predicted there would be: nmfs-ost/ss3-source-code#130
(comment)
<#130 (comment)>),
but it was too small to be captured (~0.1%).
Not sure what to do about this finding, as I think flagging small changes
like this will cause more manual inspection of the results and catch few
problems. Perhaps the right approach is to try to expand coverage of SS
model features? For example, this would be pretty easy to do for the
forecasting period by modifying forecasting files of models already in the
test set. Trying to expand coverage in the main years of the model would be
trickier.
If anyone has ideas, I would be glad to hear them!
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<https://github.com/orgs/nmfs-stock-synthesis/teams/developers/discussions/3>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPV4IHOKBR6GA7QO7BNKS3S6MF3XANCNFSM4XN2LICA>
.
|
Beta Was this translation helpful? Give feedback.
-
Thanks, Rick! A while back I was trying to figure out how to harness the power of the one-off tests you create while developing; I put some of them in this google drive folder: https://drive.google.com/drive/u/0/folders/1chN-Q1AsJcn99vaag98dkE3TuGvc6k6u . I'm not sure if this is the best or easiest way to archive them, but that is one idea? As far as using those tests, I'm honestly just not sure how to transform them into an automated testing routine. While there are unit tests for features, they aren't unit tests for functions, which makes it a little tricky - I think we would still have to stick with running the full model and standardized comparison? And your story illustrates how everything can be intertwined in SS... But again, maybe I'm just not thinking creatively enough about how to use these in automated testing... |
Beta Was this translation helpful? Give feedback.
-
I should also say, rather than adding more models to the test set, I was thinking of modifying the forecast files of existing test set models so that they better cover the forecasting options. This is sort of in the "unit testing features" mindset, I think... |
Beta Was this translation helpful? Give feedback.
-
That folder is good and I should use it; we can improve it after we learn
to start using it.
The shortcoming of the current test routine is that it doesn't test the
newly developed feature in the context of all the other features found in
the examples, because those are old examples that have not implemented the
new feature. Then once you implement the new feature of course the results
will change.
…On Wed, Feb 10, 2021 at 4:08 PM Kathryn Doering ***@***.***> wrote:
Thanks, Rick! A while back I was trying to figure out how to harness the
power of the one-off tests you create while developing; I put some of them
in this google drive folder:
https://drive.google.com/drive/u/0/folders/1chN-Q1AsJcn99vaag98dkE3TuGvc6k6u
. I'm not sure if this is the best or easiest way to archive them, but that
is one idea?
As far as using those tests, I'm honestly just not sure how to transform
them into an automated testing routine. While there are unit tests for
features, they aren't unit tests for functions, which makes it a little
tricky - I think we would still have to stick with running the full model
and standardized comparison? And your story illustrates how everything can
be intertwined in SS...
But again, maybe I'm just not thinking creatively enough about how to use
these in automated testing...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/orgs/nmfs-stock-synthesis/teams/developers/discussions/3/comments/2>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPV4IAH2GHCFRW35QZMO7TS6MNWNANCNFSM4XN2LICA>
.
|
Beta Was this translation helpful? Give feedback.
-
Ah, good point, I see what you mean now, Rick, about new features not being tested in the automated routine. We could instead put the tests in a repository rather than google drive that could be pulled in by an automated testing routine? But the number of models and probably associated run times would increase pretty quickly.... This seems a good path to explore, though. We basically would get these tests "for free" since you are already developing them. |
Beta Was this translation helpful? Give feedback.
-
Free, but huge over time, so slow to run, and hard to maintain if some
future change makes a mandatory Input change for all models.
How about a collection of test routines. One to run through all the growth
options; one to run through all the selectivity options; etc. So testing
could be directed towards the feature that was most impacted. This would
complement, but not replace a standard test set.
…On Wed, Feb 10, 2021 at 4:21 PM Kathryn Doering ***@***.***> wrote:
Ah, good point, I see what you mean now, Rick, about new features not
being tested in the automated routine. We could instead put the tests in a
repository rather than google drive that could be pulled in by an automated
testing routine? But the number of models and probably associated run times
would increase pretty quickly....
This seems a good path to explore, though. We basically would get these
tests "for free" since you are already developing them.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<https://github.com/orgs/nmfs-stock-synthesis/teams/developers/discussions/3/comments/5>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABPV4IALZYZGB42B3N3SEN3S6MPI5ANCNFSM4XN2LICA>
.
|
Beta Was this translation helpful? Give feedback.
-
Rick, sorry to hear about the complexities you're discovering with the new M option, but good that they are helping us think about ways of making this process better. Now that Kathryn has improved the r4ss functions for reading and writing model input files, it would be relatively straightforward for any new feature to go through our set of test models and turn on that feature in all of them. However, this would just be a minimal test to see if the model could run because any new feature (like new M option) which had an impact on the model results would cause lots of differences from our reference output values and there's no easy way to confirm that these differences were intended. The mention of the "ALK tolerance" option brings up another worthwhile topic (which could be discussed separately if that would be better). As we've discussed before, the development of SS hasn't had a research track separate from the production track. If it had, the ALK tolerance could have been explored in a research context to confirm that the extra complexity added had enough payoff to be worth including. Rick, I'm sure you did enough testing at the time that you added that option to decide that it was worth including, but of course you couldn't foresee the situation you're now facing. I couldn't track down any email discussion that we surely had about the feature when first added, but I do remember creating the attached r4ss figure which has been useful for other stuff. However, this figure is showing long tails in spite of a 0.001 tolerance added to the "Simple" example model and the AGE_LENGTH_KEY is indeed showing values well below 0.001, so maybe the option isn't working right anyway. Therefore, turning off the ALK tolerance option seems totally reasonable a good step at this point to make implementing the new M option easier. |
Beta Was this translation helpful? Give feedback.
-
Thanks for reminder Ian. I do recall that, which is why when I encountered the code in one place still using arrays relevant to ALK_tolerance option, I thought I might simplify. The big issue with this new M option is order of operations for maturity, growth and M. I think I will separate the maturity code out of the WtLen function to give me more flexibility. Messy stuff could still happen in a multi-season model in which SS needs M in season 1, but that M depends on maturity that is calculated only in spawn_season which is season 2 of previous year. Then with settlements, morphs and platoons with different length-at-age by season, maturity-at-age due to length maturity is changing for each of them by season; etc. |
Beta Was this translation helpful? Give feedback.
-
ss3sim might be helpful here in the research mode where we could have a single set of files for a generic OM and run the model with the feature turned off 100 times and then turn the feature on to an extreme and run it again 100 times. this would basically be the same as using the SS bootstrap capability to generate data sets except it has an easy structure to change the input files and summarize the results across replicates between the OM and the EM. |
Beta Was this translation helpful? Give feedback.
-
That's a good point, Kelli! |
Beta Was this translation helpful? Give feedback.
-
Opening a team discussion rather than posting on #130 as this is somewhat of a tangent.
Only the vermillion_snapper model in our testing set used a negative value in the benchmark years; there was indeed a small change in forecasting after fixing the bug (as Ian predicted there would be: #130 (comment)), but it was too small to be captured (~0.1%).
Not sure what to do about this finding, as I think flagging small changes like this will cause more manual inspection of the results and catch few problems. Perhaps the right approach is to try to expand coverage of SS model features? For example, this would be pretty easy to do for the forecasting period by modifying forecasting files of models already in the test set. Trying to expand coverage in the main years of the model would be trickier.
If anyone has ideas, I would be glad to hear them!
Beta Was this translation helpful? Give feedback.
All reactions