-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBS API For Duplicate Lumi Section Detection #105
Comments
Whoever who will work on this issue should implement it via two APIs:
- the search one should be based on GET HTTP method,
- while removing lumi functionality should be based on the PUT HTTP method because we will update the dataset content and not remove it.
Please do not combine two different actions in one API and follow RESTFul principles,
see this page
https://github.com/dmwm/dbs2go/blob/master/docs/apis.md
where DBS APIs are discussed and update it accordingly.
Sent from Proton Mail mobile
…-------- Original Message --------
On Nov 23, 2023, 6:21 AM, Muhammad Hassan Ahmed wrote:
Due to recent reports from users about duplicate events in their datasets[1], we are considering to bring Duplicate check for each new dataset announced by Production.
Previously PnR used to check each dataset before announcing them through this logic[2], but it is quite expensive as it request all lumis for the dataset and analyzes them for duplicates. If duplicate events are found we used the following logic to remove files with duplicates.[3]
Thats why based on disucssion in the `DBS API to detect duplicate lumis' email thread, we are requesting a new optimized API from DBS that can check a dataset for duplicates events and also a way to remove them.
Requirements:
- The API should provide a way to check individual datasets for duplicate luminosity sections.
- It should be optimized for performance, considering the large volume of files in DBS.
- The default granularity of api we require is on a dataset level for each output dataset we want to find duplicate lumis, however if the granularity is configurable by a param for dataset, block, file, run would it increase the complexity of the API? It would be quite helpful if this was configurable as it will allow for greater flexibility.
- A way to remove the duplicates, either this can be through a existing api or a param in the new proposed api, e.g remove_duplicates=True.
[1] https://its.cern.ch/jira/browse/CMSPROD-85
[2]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1216
[3]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1245
***@***.***(https://github.com/haozturk) ***@***.***(https://github.com/vkuznet) ***@***.***(https://github.com/todor-ivanov)
—
Reply to this email directly, [view it on GitHub](#105), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AAA6RUTVRQVXDCTY7MBWHBLYF4WR3AVCNFSM6AAAAAA7XTMW5OVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAYDQMBSGEYDCNA).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@vkuznet I am about to work on this, but someone must fix my role in this GH repository, so I can assign myself to the issue. I believe you can do that. |
Hi @todor-ivanov, Thanks. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Due to recent reports from users about duplicate events in their datasets[1], we are considering to bring Duplicate check for each new dataset announced by Production.
Previously PnR used to check each dataset before announcing them through this logic[2], but it is quite expensive as it request all lumis for the dataset and analyzes them for duplicates. If duplicate events are found we used the following logic to remove files with duplicates.[3]
Thats why based on disucssion in the `DBS API to detect duplicate lumis' email thread, we are requesting a new optimized API from DBS that can check a dataset for duplicates events and also a way to remove them.
Requirements:
remove_duplicates=True
.[1] https://its.cern.ch/jira/browse/CMSPROD-85
[2]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1216
[3]https://gitlab.cern.ch/CMSProductionReprocessing/WmAgentScripts/-/blob/master/Unified/checkor.py?ref_type=heads#L1245
@haozturk @vkuznet @todor-ivanov
The text was updated successfully, but these errors were encountered: