Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanding dataset validity information in DBS #118

Open
todor-ivanov opened this issue Jan 22, 2025 · 3 comments
Open

Expanding dataset validity information in DBS #118

todor-ivanov opened this issue Jan 22, 2025 · 3 comments
Assignees

Comments

@todor-ivanov
Copy link
Contributor

Impact of the new feature
This is a request related to improving data (version) management and to protect Open Data during tape deletion campaigns.

Is your feature request related to a problem? Please describe.
Original issue created in Jira: https://its.cern.ch/jira/browse/CMSTRANSF-857

We have multiple versions of valid datasets in the system. It's necessary to differentiate them for efficient disk space management and tape deletions.

Use cases:

  • Latest processings

We need to know for each dataset if it was superseded by another dataset. It's especially important for MiniAOD and NanoAOD, which may have many different versions. The best way to achieve that is to save the id/name of the new dataset. For this to work, we need PdmV to supply what datasets the current request supersedes and handle it properly in the workflow management.

It's important to keep in mind, that some data and MC doesn't get reprocessed, so we cannot rely simply on campaigns or RegExp patterns.

  • Open data

Open data needs to be protected separately as unique use case.

Describe the solution you'd like

As initially suggested in the Jira ticket the solution should wrap around providing 2 more flags for each dataset to DBS:

  1. reference to the superseding dataset ((string)
  2. open data flag (bool)

And later extended to the following set of 4 flags:

  1. isOpenData (bool),
  2. isLegacyData (bool),
  3. isImportant (bool),
  4. whyImportant (string)

The set of flags and flag data types are still subject to discussion.

Describe alternatives you've considered
No alternatives suggested

Additional context
N/A

@todor-ivanov todor-ivanov self-assigned this Jan 22, 2025
@todor-ivanov
Copy link
Contributor Author

Just to summaries the solution which we have discussed and proposed with @vkuznet here and here:

We are about to

  • Create a separate table just for those flags to be preserved and we are about to link it to the relevant datasets from the existing tables
  • Create the needed DBS APIs for:
    • Write: to populate the new information
    • Read: to fetch the flags based on dataset name

The information should be populated manually for the time being. Once we see it working we may think of automating the process through the WM system.

@todor-ivanov
Copy link
Contributor Author

hi Dima,

Since as it was shortly mentioned in the private e-mails exchanged last night, that an initial set of flags could indeed help a lot during the next deletion campaign even before the full functionality is delivered, I am now pushing harder to at least create the new database schema and test it in integration. Could you just check the set of flags as proposed/summarized in the issue description here, and express your opinion:

  • Would those suffice, and could they be later a subject to a change, or whatever we decide now must stay forever (I presume the later).

@todor-ivanov
Copy link
Contributor Author

I am quoting Dima from the Jira issue here, just as an update to the current one to keep it in line with the discussions happening in Jira as well.

"""
We had a meeting last week with Todor Trendafilov Ivanov and Hasan Ozturk and we converged on the following additional fields that need to be added to the main table DATASETS:

  • isOpendata (bool) - potentially any dataset may become public. False by default
  • isProtected (bool) - special purpose flag to be used by DM to exclude datasets from deletion campaigns. False by default.
  • expirataionDate (int) - will be used together with isProtected. Null by default.
  • Comment (VARCHAR2(250)) - general purpose field needed for important information especially for invalidation.

We found no reason to create an additional table just to add these 4 extra columns. VARCHAR2 is an efficient string format using only actual string size plus 1 byte for string length.

We have also reviewed the option to add information about superseded datasets, but we concluded that it will be very hard to make it work automatically since there is no mechanism to track relationships between datasets in different campaigns. Therefore, if people want to protect datasets that are not available in other campaigns or have other unique properties, we will use the new isProtected flag.

Valentin Y Kuznetsov, do you see any issue with this proposal?
"""

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

No branches or pull requests

1 participant