Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBS API For Duplicate Lumi Section Detection #105

Open
hassan11196 opened this issue Nov 23, 2023 · 3 comments
Open

DBS API For Duplicate Lumi Section Detection #105

hassan11196 opened this issue Nov 23, 2023 · 3 comments
Assignees

Comments

@hassan11196
Copy link
Member

Due to recent reports from users about duplicate events in their datasets[1], we are considering to bring Duplicate check for each new dataset announced by Production.

Previously PnR used to check each dataset before announcing them through this logic[2], but it is quite expensive as it request all lumis for the dataset and analyzes them for duplicates. If duplicate events are found we used the following logic to remove files with duplicates.[3]

Thats why based on disucssion in the `DBS API to detect duplicate lumis' email thread, we are requesting a new optimized API from DBS that can check a dataset for duplicates events and also a way to remove them.

Requirements:

  • The API should provide a way to check individual datasets for duplicate luminosity sections.
  • It should be optimized for performance, considering the large volume of files in DBS.
  • The default granularity of api we require is on a dataset level for each output dataset we want to find duplicate lumis, however if the granularity is configurable by a param for dataset, block, file, run would it increase the complexity of the API? It would be quite helpful if this was configurable as it will allow for greater flexibility.
  • A way to remove the duplicates, either this can be through a existing api or a param in the new proposed api, e.g remove_duplicates=True.

[1] https://its.cern.ch/jira/browse/CMSPROD-85
[2]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1216
[3]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1245

@haozturk @vkuznet @todor-ivanov

@vkuznet
Copy link
Contributor

vkuznet commented Nov 23, 2023 via email

@todor-ivanov
Copy link

@vkuznet I am about to work on this, but someone must fix my role in this GH repository, so I can assign myself to the issue. I believe you can do that.

@hassan11196
Copy link
Member Author

Hi @todor-ivanov,
Any updates on this issue?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants