Expanding dataset validity information in DBS #118

todor-ivanov · 2025-01-22T08:15:16Z

Impact of the new feature
This is a request related to improving data (version) management and to protect Open Data during tape deletion campaigns.

Is your feature request related to a problem? Please describe.
Original issue created in Jira: https://its.cern.ch/jira/browse/CMSTRANSF-857

We have multiple versions of valid datasets in the system. It's necessary to differentiate them for efficient disk space management and tape deletions.

Use cases:

Latest processings

We need to know for each dataset if it was superseded by another dataset. It's especially important for MiniAOD and NanoAOD, which may have many different versions. The best way to achieve that is to save the id/name of the new dataset. For this to work, we need PdmV to supply what datasets the current request supersedes and handle it properly in the workflow management.

It's important to keep in mind, that some data and MC doesn't get reprocessed, so we cannot rely simply on campaigns or RegExp patterns.

Open data

Open data needs to be protected separately as unique use case.

Describe the solution you'd like

As initially suggested in the Jira ticket the solution should wrap around providing 2 more flags for each dataset to DBS:

reference to the superseding dataset ((string)
open data flag (bool)

And later extended to the following set of 4 flags:

isOpenData (bool),
isLegacyData (bool),
isImportant (bool),
whyImportant (string)

The set of flags and flag data types are still subject to discussion.

Describe alternatives you've considered
No alternatives suggested

Additional context
N/A

todor-ivanov · 2025-01-22T08:34:06Z

Just to summaries the solution which we have discussed and proposed with @vkuznet here and here:

We are about to

Create a separate table just for those flags to be preserved and we are about to link it to the relevant datasets from the existing tables
Create the needed DBS APIs for:
- Write: to populate the new information
- Read: to fetch the flags based on dataset name

The information should be populated manually for the time being. Once we see it working we may think of automating the process through the WM system.

todor-ivanov · 2025-01-22T09:08:22Z

hi Dima,

Since as it was shortly mentioned in the private e-mails exchanged last night, that an initial set of flags could indeed help a lot during the next deletion campaign even before the full functionality is delivered, I am now pushing harder to at least create the new database schema and test it in integration. Could you just check the set of flags as proposed/summarized in the issue description here, and express your opinion:

Would those suffice, and could they be later a subject to a change, or whatever we decide now must stay forever (I presume the later).

todor-ivanov · 2025-01-29T08:01:17Z

I am quoting Dima from the Jira issue here, just as an update to the current one to keep it in line with the discussions happening in Jira as well.

"""
We had a meeting last week with Todor Trendafilov Ivanov and Hasan Ozturk and we converged on the following additional fields that need to be added to the main table DATASETS:

isOpendata (bool) - potentially any dataset may become public. False by default
isProtected (bool) - special purpose flag to be used by DM to exclude datasets from deletion campaigns. False by default.
expirataionDate (int) - will be used together with isProtected. Null by default.
Comment (VARCHAR2(250)) - general purpose field needed for important information especially for invalidation.

We found no reason to create an additional table just to add these 4 extra columns. VARCHAR2 is an efficient string format using only actual string size plus 1 byte for string length.

We have also reviewed the option to add information about superseded datasets, but we concluded that it will be very hard to make it work automatically since there is no mechanism to track relationships between datasets in different campaigns. Therefore, if people want to protect datasets that are not available in other campaigns or have other unique properties, we will use the new isProtected flag.

Valentin Y Kuznetsov, do you see any issue with this proposal?
"""

todor-ivanov self-assigned this Jan 22, 2025

todor-ivanov added this to WMCore quarterly developments Jan 22, 2025

todor-ivanov moved this to In Progress in WMCore quarterly developments Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expanding dataset validity information in DBS #118

Expanding dataset validity information in DBS #118

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 29, 2025

Expanding dataset validity information in DBS #118

Expanding dataset validity information in DBS #118

Comments

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 22, 2025

todor-ivanov commented Jan 29, 2025