RequestManager fun (more retries that the FTS can cope with) #6136

marianne013 · 2022-06-06T14:28:52Z

marianne013
Jun 6, 2022

Simon pointed out to me today that our FTS server is in a bit of a meltdown due to a replicate and register request gone terribly wrong.
While debugging this we came across an error in the log and when I looked up the matching request it seemed to be an ancient request of mine:

lx02:2021_Dec_13_1345_lz_EL7 > dirac-rms-request --Full 638131
CreationTime: '2022-02-04 12:03:25'
DIRACSetup: None
Error: None
JobID: 0
LastUpdate: '2022-06-06 14:15:56'
NotBefore: '2022-06-06 14:14:57'
Operations: [{
  Arguments: 'lls20:Accounting/DataStoreds14:keepAliveLapsei150es11:skipCACheckb0s7:timeouti600eees15:commitRegistersllls13:DataOperationzati2022ei2ei4ei12ei3ei23ei915411enezati2022ei2ei4ei12ei3ei24ei886154enels10:se.putFileu13:daniela.bauers15:DIRAC.Client.uks15:DIRAC.Client.uks34:UKI-NORTHGRID-LANCS-HEP-XGATE-disks4:roots10:SuccessfulI3498ef0.970742940903ef0.0ei1ei1ei0ei0eeeeee'
  Catalog: None
  CreationTime: '2022-02-04 12:03:25'
  Error: 'MySQL Error ( 1131 : Execution failed.: ( 1406: Data too long for column 'Destination' at row 1 ))'
  Files: []
  LastUpdate: '2022-02-04 12:03:25'
  OperationID: 609077
  Order: 0
  RequestID: 638131
  SourceSE: None
  Status: 'Waiting'
  SubmitTime: '2022-02-04 12:03:25'
  TargetSE: None
  Type: 'ForwardDISET'
  }]
OwnerDN: '/C=UK/O=eScience/OU=Imperial/L=Physics/CN=daniela bauer'
OwnerGroup: 'gridpp_user'
RequestID: 638131
RequestName: 'Accounting.DataStore.1643976205.19.0.738373743433'
SourceComponent: None
Status: 'Waiting'
SubmitTime: '2022-02-04 12:03:25'

So there are two problems:
GridPP has a SE that is 34 characters long, but only 32 seem to be allowed. Would it be possible to extend that ? (Source has teh same problem, and yes, I know UK site names tend to be on the long side, but it is what it is)

DIRAC/src/DIRAC/AccountingSystem/Client/Types/DataOperation.py

Line 13 in 871cbf9

("Destination", "VARCHAR(32)"),

Then we noticed that the MaxAttempts in the RequestManager seem to be set to 256 by default (@chaen being overly optimistic that stuff will get fixed eventually ?) and even worse the documentation states:
https://dirac.readthedocs.io/en/latest/AdministratorGuide/Systems/RequestManagement/rmsObjects.html?highlight=MaxAttempts#operation-types

"MaxAttempts (default 1024): Maximum attempts to try an Operation, after what, it fails. Note that this only works for Operations with Files (the others are tried forever)."

I'm not sure if this comes under "Operation with Files" and will stop after 256 tries (though it's going since February, so I am not getting my hopes up) or if this is one of these "eternal" operations, in which case, I'd like to request the implementation of some kind of sanity preserving limit, please.
(This is a small example, we have thousands of requests in similar states.)

Answered by marianne013

Jun 30, 2022

I would consider this fixed (at least for GridPP) by:
#6148

View full answer

chaen · 2022-06-07T08:59:47Z

chaen
Jun 7, 2022
Maintainer

Hi Daniela,

For the size of the column, the answer is always the same, we have to set a value somewhere, and there will always be somebody with longer name. But nothing prevents you from changing the size of the column in your DB. Now I agree that there should be some consistency across all the DBs 🤔

For the default, there are different defaults. The number of retry is configured in the CS on a per OperationType basis. For example

DIRAC/src/DIRAC/RequestManagementSystem/ConfigTemplate.cfg

Line 76 in 871cbf9

MaxAttempts = 256

But if you do not specify anything there, you get the hard coded 1024
https://github.com/DIRACGrid/DIRAC/blob/integration/src/DIRAC/RequestManagementSystem/private/OperationHandlerBase.py#L230

However, for ForwardDISET, this mechanism indeed doesn't work:
#4306

The occurrences of this were sufficiently limited for me to bother implementing a fix yet. However, I still would not know what to test nor how to implement it... any idea ?

Last point: I absolutely do not get the relation between your FTS server meltdown and the failing accounting request 🤔

0 replies

marianne013 · 2022-06-07T09:11:10Z

marianne013
Jun 7, 2022
Author

For the last point:
t2k send a massive request with a wrong proxy. They all fail and generate copious log files. Then they fail again and again for weeks (months?) generating ever more logs and our server goes: huh ?
Basically our server isn't optimised for 99% failures.....

0 replies

marianne013 · 2022-06-07T09:20:19Z

marianne013
Jun 7, 2022
Author

Related, I tried to cancel one of the dodgy requests:

lx02:2021_Dec_13_1345_lz_EL7 > dirac-rms-request --Cancel 639583
Request 639583 cancelled
lx02:2021_Dec_13_1345_lz_EL7 > dirac-rms-request 639583          
Request name='raw_3000_3999_25' ID=639583 Status='Scheduled' ('Canceled' in DB)

I feel ignored.

I checked again this morning and it looks the same. How do I get its status (which status ?) to some kind of final state ?

1 reply

chaen Jun 7, 2022
Maintainer

That request has already entered the FTS machinery.
So you have successfully set the Request to Canceled which is a final state, but you have to wait for the next time the FTS3Agent will look into your associated FTSJob to realize it should stop

DIRAC/src/DIRAC/DataManagementSystem/Agent/FTS3Agent.py

Lines 383 to 390 in 871cbf9

    
           # This flag is set to False if we want to stop the ongoing processing 
        
           # of an operation, typically when the matching RMS Request has been 
        
           # canceled (see below) 
        
           continueOperationProcessing = True 
        
           # Check the status of the associated RMS Request. 
        
           # If it is canceled or does not exist anymore then we will not create new FTS3Jobs, and mark 
        
           # this as FTS3Operation canceled.

marianne013 · 2022-06-08T11:32:14Z

marianne013
Jun 8, 2022
Author

We just found that our databases is cluttered up with very old requests (see below) that seem to still be listed as 'Scheduled'. We'd like to implement a feature (defaulting to False as usual) that set the status to Cancelled if a request hasn't run in e.g. 60 days (also configurable).

                | Scheduled | 2018-09-24 12:39:35 | gridpp_user             | 2018-09-24 12:39:14 |    227841 | NULL            | 2018-09-24 12:39:35 |
| NULL       | 2018-11-02 00:47:46 |     0 | /C=UK/O=eScience/OU=Manchester/L=HEP/CN=ryunosuke oneil | replicate_datasets_mc_1 | NULL

@chaen Is there any reason we shouldn't be doing this ?

0 replies

chaen · 2022-06-08T13:08:35Z

chaen
Jun 8, 2022
Maintainer

I don't think there is a good reason for not doing this (the RequestCleaningAgent is probably a good place for that).

But I am extremely puzzled by how you manage to end up in this situation, and I'd like to have it fixed.

We run hundreds of thousands of requests and at the moment, I have 2 requests that are listed as 'Scheduled' for more than a few hours, and it is because glasgow is in downtime since February.....

So clearly, there is something odd.
Are you sure that none of your requests have a matching FTS job ID ?

0 replies

marianne013 · 2022-06-08T13:56:27Z

marianne013
Jun 8, 2022
Author

To be fair a lot of these jobs were from February this year and maybe that's just what happens when you leave the default 256 tries in DIRAC ? Simon and me are going to poke around a bit longer to see if we can understand what is going on.
I noticed in the FileCatalogueDB SENames are allowed to be 127 characters long, so this seems to have diverged a while ago. No idea how I could work out where else DIRAC stores SE names though

MariaDB [FileCatalogDB]> describe FC_StorageElements;
+-----------+--------------+------+-----+---------+----------------+
| Field     | Type         | Null | Key | Default | Extra          |
+-----------+--------------+------+-----+---------+----------------+
| SEID      | int(11)      | NO   | PRI | NULL    | auto_increment |
| SEName    | varchar(127) | NO   | UNI | NULL    |                |
| AliasName | varchar(127) | YES  |     |         |                |
+-----------+--------------+------+-----+---------+----------------+
3 rows in set (0.002 sec)

0 replies

marianne013 · 2022-06-30T09:17:54Z

marianne013
Jun 30, 2022
Author

I would consider this fixed (at least for GridPP) by:
#6148

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RequestManager fun (more retries that the FTS can cope with) #6136

{{title}}

Replies: 7 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RequestManager fun (more retries that the FTS can cope with) #6136

marianne013 Jun 6, 2022

Replies: 7 comments · 1 reply

chaen Jun 7, 2022 Maintainer

marianne013 Jun 7, 2022 Author

marianne013 Jun 7, 2022 Author

chaen Jun 7, 2022 Maintainer

marianne013 Jun 8, 2022 Author

chaen Jun 8, 2022 Maintainer

marianne013 Jun 8, 2022 Author

marianne013 Jun 30, 2022 Author

marianne013
Jun 6, 2022

Replies: 7 comments 1 reply

chaen
Jun 7, 2022
Maintainer

marianne013
Jun 7, 2022
Author

marianne013
Jun 7, 2022
Author

chaen Jun 7, 2022
Maintainer

marianne013
Jun 8, 2022
Author

chaen
Jun 8, 2022
Maintainer

marianne013
Jun 8, 2022
Author

marianne013
Jun 30, 2022
Author