Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a developer, I want to understand why data ingest takes so much longer with an unsorted CSV than with a sorted CSV #21

Open
HankHerr-NOAA opened this issue Jun 28, 2024 · 3 comments

Comments

@HankHerr-NOAA
Copy link
Contributor

This relates to VLab User Support ticket #131828. The unsorted and sorted CSV files have been uploaded here:

https://drive.google.com/drive/folders/1-mBAjDUNf9COiw0dzly7mJ2aQg2BDSFD

Using a standalone pointing to a database and running on the NWC ised-dev1 machine, it took 1h 6m to complete the evaluation using unsorted data (where time series are written by time first, and then feature). Using the sorted data (where time series are written by feature, first, and then time), the evaluation took 2m 21s. Both evaluations were run on a freshly cleaned database. The declaration using the sorted data is below; just modify the predicted source accordingly.

Why such a stark difference? If it points to a code change to make, this ticket can be resolved once that change is made. Otherwise, this ticket can be resolved once we understand the underlying cause and decide that no change is needed.

Thanks,

Hank

=====================================

label: HEFS Evaluations RSA
observed:
  label: USGS Streamflow Observations
  sources:
  - interface: usgs nwis
    uri: https://nwis.waterservices.usgs.gov/nwis/iv
  variable:
    name: '00060'
  feature_authority: nws lid
  type: observations
predicted:
  label: HEFS RSA Forecast Test
  sources: [omitted]/sorted_ALL_HEFS.tgz
  variable:
    name: QINE
  feature_authority: nws lid
  type: ensemble forecasts
features:
  - {observed: '11335000', predicted: MHBC1}
reference_dates:
  minimum: 2022-12-01T11:00:00Z
  maximum: 2023-03-31T12:00:00Z
valid_dates:
  minimum: 2022-12-01T11:00:00Z
  maximum: 2023-04-09T12:00:00Z
reference_date_pools:
  period: 1
  frequency: 1
  unit: days
lead_times:
  minimum: 0
  maximum: 72
  unit: hours
time_scale:
  function: mean
  period: 24
  unit: hours
values:
  minimum: 0.0
probability_thresholds: 
  values: [0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]
  operator: greater
  apply_to: observed
metrics:
  - name: sample size
  - name: mean error
  - name: box plot of errors
  - name: mean square error
  - name: brier skill score
ensemble_average: mean
duration_format: days
output_formats:
  - format: csv2
  - format: png
  - format: pairs
@HankHerr-NOAA
Copy link
Contributor Author

For local, NWC access to data, evaluation declarations, and output, see the directory issue131828 in the standard location.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

My test was run using revision 20240627-b58855f-dev in a repo with the remote just changed to GitHub.

Hank

@james-d-brown
Copy link
Collaborator

( The underlying reason is ingest of one, continuous timeseries vs. a very large number of very small (one-event) time-series, but there is a question beneath that concerning why this difference in topology makes such a big difference to ingest time - there is some kind of ingest contention, probably related to source locking, but TBD. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants