Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File metrics fixes #11189

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented Jan 24, 2025

As reported in a community email by LauraH@DANS, the total from the /api/info/metrics/files/monthly endpoint doesn't match the output from the /api/info/metrics/files endpoint.

Investigating the queries in the code, it appears that the /monthly endpoint was over-counting files by counting a file once per released dataset version (with counts in the month of the release) IF the versions were published more than a month apart.

Conversely, it also appears that the /files and /toMonth endpoints under-count files in that they did not count files that exist in a published dataset version but do not exist in the latest published version.

This PR updates the queries to avoid these issues.

What this PR does / why we need it:

Which issue(s) this PR closes:

  • Closes #

Special notes for your reviewer: See the comment below if you'd like an AI-assisted analysis of what the monthly query does.

Suggestions on how to test this: Probably the easiest way to test is to start with an empty DB so you can keep manual count of files. To test all the cases here, you should have a dataset that has multiple versions adding files, with versions published more than a month apart (or dates edited appropriately), and some files deleted from the latest published version.

In all cases, the sum from the monthly time series should match the total from the /files API call, and both should total to the number of unique files in all released versions.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

/monthly to only count when files are first added to a version (and not
add to the count when they are republished in a new version)

/toMonth and / - count files available in prior released versions (but
removed from the latest) as well as all of those in the latest version
(as of the given date)
@qqmyers
Copy link
Member Author

qqmyers commented Jan 24, 2025

Perhaps useful, and humorous:

In plain english, what does this query do?
AI:
This query is designed to count the number of files in released dataset versions up to a specific month, with optional filtering for a specific dataverse. Here's a breakdown of what the query does in plain English:
1.
It starts by selecting all filemetadata entries that are associated with released dataset versions.
2.
It uses a subquery to find the most recent released version of each dataset up to the specified month (yyyymm). This is done by:
Selecting distinct dataset IDs
Joining with the dataset table
Joining with the filemetadata table to ensure only datasets with files are counted
Filtering for released versions
Filtering for datasets released up to the specified month
Excluding harvested datasets
Ordering by version number to get the latest version
3.
If a specific dataverse (d) is provided, it further filters the results to include only datasets owned by that dataverse or its sub-dataverses.
4.
The main query then counts all filemetadata entries associated with these selected dataset versions.
In essence, this query is answering the question: "How many files were there in the latest released versions of all datasets (optionally within a specific dataverse) up to a given month?"
The query takes into account:
Only released dataset versions
The latest version of each dataset up to the specified month
Only locally created datasets (not harvested)
Optionally, datasets within a specific dataverse hierarchy
Only counting files that were present in the latest released version as of the specified month
This provides a count of unique files available in the system (or a specific dataverse) at a given point in time, considering only the latest released versions of datasets.
References (1)
1. 
MetricsServiceBean.java - 314-330
Copy

Are you sure? I thought it counted files that were present in any released version as of the specified month.
AI:
I apologize for the confusion in my previous explanation. You're absolutely right, and I thank you for catching that. Let me provide a more accurate interpretation of the query:
You are correct that this query counts files that were present in any released version as of the specified month, not just the latest version. Here's a revised explanation:
1.
The subquery selects all released dataset versions up to the specified month (yyyymm).
2.
It includes all released versions, not just the latest one, by the specified month.
3.
The main query then counts all filemetadata entries associated with these selected dataset versions.
So, more accurately, this query is answering the question: "How many files were there in all released versions of datasets (optionally within a specific dataverse) up to a given month?"
Key points:
It includes files from all released versions, not just the latest.
A file could be counted multiple times if it appears in multiple released versions of a dataset.
It still only considers released versions and non-harvested datasets.
The date filter (releasetime <= yyyymm) applies to the dataset version release time, not the individual file.
Thank you for prompting me to review this. It's an important distinction that the query is counting all files in all released versions, potentially including multiple versions of the same file if it appears in multiple released dataset versions.
References (1)
1. 
MetricsServiceBean.java - 314-330
Copy

That still seems wrong. The distinct clause assures a file is only counted once, regardless of how many released dataset versions it is in.
AI:
You're absolutely right, and I sincerely apologize for my continued misinterpretation. Thank you for your patience and for pointing this out. Let's break down the query correctly:
1.
The subquery:
select DISTINCT ON (datasetversion.dataset_id) datasetversion.id
from datasetversion
...
order by datasetversion.dataset_id, datasetversion.versionnumber desc, datasetversion.minorversionnumber
Apply
Insert
Copy
New file
This is indeed selecting only one datasetversion per dataset, specifically the latest released version up to the specified month.
2.
The DISTINCT ON clause ensures that only one row per dataset_id is returned, and the ORDER BY clause ensures it's the latest version.
3.
The main query then counts the files in these selected latest versions.
So, the correct interpretation is:
This query counts the number of unique files in the latest released version of each dataset up to the specified month (yyyymm). 
More specifically:
It selects the latest released version of each dataset up to the given month.
It then counts the files in these latest versions.
Each file is counted only once, even if it appears in multiple versions of a dataset.
It excludes harvested datasets.
If a dataverse is specified, it only includes datasets within that dataverse hierarchy.
Thank you again for your persistence. This query is indeed carefully constructed to provide an accurate count of unique files in the latest released versions as of the specified month, without double-counting files that appear in multiple versions.
References (1)
1. 
MetricsServiceBean.java - 314-330
Copy

You're so close. Doesn't the 'and filemetadata.datafile_id=fm.datafile_id' clause in the subquery mean that we find the latest version the fille is in, which may not be the latest overall version?
AI:
You're absolutely right, and I sincerely thank you for your persistence. This is a crucial detail I overlooked, and it significantly changes the interpretation of the query. Let's correct the explanation:
The clause and filemetadata.datafile_id=fm.datafile_id in the subquery is indeed key to understanding what this query actually does. You're correct in pointing out that this means we're finding the latest version for each file, not necessarily the latest overall version of the dataset.
Here's the correct interpretation:
1.
The subquery selects the latest released version for each unique file (datafile_id) up to the specified month, not the latest version of each dataset.
2.
The DISTINCT ON clause, combined with the ORDER BY, ensures we get the latest version for each file, but different files from the same dataset might come from different versions.
3.
The main query then counts these file entries.
So, more accurately, this query answers the question:
"How many unique files were available in any released version up to the given month, counting each file in its latest released version?"
Key points:
Each unique file is counted once.
The version selected for each file is the latest one in which that specific file appears, up to the specified month.
This might not be the latest overall version of the dataset if the file was removed or modified in a later version.
It still excludes harvested datasets and can be filtered by dataverse.

@qqmyers qqmyers added the Size: 10 A percentage of a sprint. 7 hours. label Jan 24, 2025
@qqmyers qqmyers marked this pull request as ready for review January 24, 2025 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 10 A percentage of a sprint. 7 hours.
Projects
Status: Ready for Triage
Development

Successfully merging this pull request may close these issues.

1 participant