RequestManager fun (more retries that the FTS can cope with) #6136
-
Simon pointed out to me today that our FTS server is in a bit of a meltdown due to a replicate and register request gone terribly wrong.
So there are two problems: Then we noticed that the MaxAttempts in the RequestManager seem to be set to 256 by default (@chaen being overly optimistic that stuff will get fixed eventually ?) and even worse the documentation states: "MaxAttempts (default 1024): Maximum attempts to try an Operation, after what, it fails. Note that this only works for Operations with Files (the others are tried forever)." I'm not sure if this comes under "Operation with Files" and will stop after 256 tries (though it's going since February, so I am not getting my hopes up) or if this is one of these "eternal" operations, in which case, I'd like to request the implementation of some kind of sanity preserving limit, please. |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 1 reply
-
Hi Daniela, For the size of the column, the answer is always the same, we have to set a value somewhere, and there will always be somebody with longer name. But nothing prevents you from changing the size of the column in your DB. Now I agree that there should be some consistency across all the DBs 🤔 For the default, there are different defaults. The number of retry is configured in the CS on a per But if you do not specify anything there, you get the hard coded 1024 https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/RequestManagementSystem/private/OperationHandlerBase.py#L230 However, for The occurrences of this were sufficiently limited for me to bother implementing a fix yet. However, I still would not know what to test nor how to implement it... any idea ? Last point: I absolutely do not get the relation between your FTS server meltdown and the failing accounting request 🤔 |
Beta Was this translation helpful? Give feedback.
-
For the last point: |
Beta Was this translation helpful? Give feedback.
-
Related, I tried to cancel one of the dodgy requests:
I feel ignored. I checked again this morning and it looks the same. How do I get its status (which status ?) to some kind of final state ? |
Beta Was this translation helpful? Give feedback.
-
We just found that our databases is cluttered up with very old requests (see below) that seem to still be listed as 'Scheduled'. We'd like to implement a feature (defaulting to False as usual) that set the status to Cancelled if a request hasn't run in e.g. 60 days (also configurable).
@chaen Is there any reason we shouldn't be doing this ? |
Beta Was this translation helpful? Give feedback.
-
I don't think there is a good reason for not doing this (the RequestCleaningAgent is probably a good place for that). But I am extremely puzzled by how you manage to end up in this situation, and I'd like to have it fixed. We run hundreds of thousands of requests and at the moment, I have 2 requests that are listed as 'Scheduled' for more than a few hours, and it is because glasgow is in downtime since February..... So clearly, there is something odd. |
Beta Was this translation helpful? Give feedback.
-
To be fair a lot of these jobs were from February this year and maybe that's just what happens when you leave the default 256 tries in DIRAC ? Simon and me are going to poke around a bit longer to see if we can understand what is going on.
|
Beta Was this translation helpful? Give feedback.
-
I would consider this fixed (at least for GridPP) by: |
Beta Was this translation helpful? Give feedback.
I would consider this fixed (at least for GridPP) by:
#6148