-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release 0.6.3 #481
Comments
IMAPA bit slower overall than 0.6.2: notable: unselect, searchUnseen, searchDeleted, fetchBody. Maybe relate to search override. JMAP-DraftA bit better overall than 0.6.2: same 99th, better 95th and mean. JMAP-RFC3 times hanging:
No error log on James side, not sure what happens yet. Gatling log: https://app.zenhub.com/files/325713638/acca925e-4e76-45f3-b5ec-639344ec8c51/download My run params: /root/upn/james/james-gatling, 6500 users, 15m user inject, 20m scenario. |
You could try to do a jmap rfc run without the changed params you did recently on the helm james conf. If it finishes, it means it's one of those. If not then it's more related to the release? |
Flamme graph would be nice for IMAP too IMO |
Hmm regarding IMAP being a bit slower... simple question first. Do you do a little warmup run first before doing the main one or not? |
Sure I do, for any case. |
Alright, flame graph it is then :) |
So with the same JMAP RFC simulation, TMail 0.6.2 was ok whereas TMail 0.6.3 (without changed configuration) hangs. I suspect JMAP RFC 0.6.3 degraded a bit when handling many connections simultaneously (only a few hangs though ~ 10/6500 users). I adapt the Gatling simulation to early quit for users who can not open a WebSocket connection/ has a premature close HTTP connection: linagora/james-gatling#151 This could avoids those users from trying to do further actions and hung the simulation. I am not sure this could make the simulation 100% stable yet... at least better than all hang right now. JMAP-RFC resultBasically the same to me. |
I re-test IMAP with a kinda unlimited 2000 concurrent req limit for IMAP -> the perf is the same. -> 200 is a appropriate value for our preprod env. |
Some error logs collectedLDAP"level":"ERROR","thread":"blocking-call-wrapper-535","logger":"org.apache.james.user.ldap.ReadOnlyLDAPUser","message":"Unexpected error upon authentication for [email protected]","context":"default","exception":"com.unboundid.ldap.sdk.LDAPException: An error occurred while attempting to connect to server 51.91.141.10:389: IOException(LDAPException(resultCode=82 (local error), errorMessage='A thread was interrupted while waiting for the connect thread to establish a connection to 51.91.141.10:389 Another one: "level":"ERROR","thread":"blocking-call-wrapper-536","logger":"org.apache.james.jmap.core.ProblemDetails","message":"Unexpected error upon API request","context":"default","exception":"org.apache.james.user.api.UsersRepositoryException: Unable check user existence from ldap\n\tat org.apache.james.user.ldap.ReadOnlyLDAPUsersDAO.getUserByName(ReadOnlyLDAPUsersDAO.java:285)\n\tat com.linagora.tmail.combined.identity.CombinedUserDAO.test(CombinedUserDAO.java:84)\n\tat com.linagora.tmail.combined.identity.CombinedUsersRepository.test(CombinedUsersRepository.java:20)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1(BasicAuthenticationStrategy.scala:132)\n\tat org.apache.james.jmap.http.BasicAuthenticationStrategy.$anonfun$isValid$1$adapted(BasicAuthenticationStrategy.scala:132)\n\tat reactor.core.publisher.MonoCallable.call(MonoCallable.java:92)\n\tat reactor.core.publisher.FluxSubscribeOnCallable$CallableSubscribeOnSubscription.run(FluxSubscribeOnCallable.java:227)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)\n\tat reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)\n\tat java.base/java.util.concurrent.FutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)\n\tat java.base/java.lang.Thread.run(Unknown Source)\nCaused by: com.unboundid.ldap.sdk.LDAPSearchException: Search processing was interrupted while waiting for a response from server 51.91.141.10:389.\n\tat com.unboundid.ldap.sdk.AbstractConnectionPool.search(AbstractConnectionPool.java:2160)\n\ These 2 errors only happen when running JMAP RFC with 10k users (we usually do this for 6.5k users). With many users we overload the LDAP -> some blocking thread hang (waiting for LDAP response) over reactor TTL (60s) and then be interrupted? Cassandra timeout"level":"ERROR","thread":"s1-io-0","mdc":{"protocol":"IMAP","mailbox":"INBOX","selectedMailbox":"baf2cdb0-307a-11eb-ad60-d91e26006be3","ip":"10.2.2.0","action":"STATUS","sessionId":"SID-tkgtbjglvbwf","user":"[email protected]","parameters":"StatusDataItems{statusItems=[MESSAGES, RECENT, UID_NEXT, UNSEEN]}"},"logger":"org.apache.james.imap.processor.AbstractMailboxProcessor","message":"Unexpected error during IMAP processing","context":"default","exception":"com.datastax.oss.driver.api.core.servererrors.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded). In case this was generated during read repair, the consistency level is not representative of the actual consistency.\n" A few IMAP STATUS commands fail because of this. Maybe because we overload Cassandra then it fails with requests requiring a high consistency level? RabbitMQ"level":"ERROR","thread":"AMQP Connection 51.210.36.126:5672","logger":"com.rabbitmq.client.impl.ForgivingExceptionHandler","message":"An unexpected connection driver error occurred","context":"default","exception":"java.net.SocketTimeoutException: Timeout during Connection negotiation\n\tat com.rabbitmq.client.impl.AMQConnection.handleSocketTimeout(AMQConnection.java:835)\n\tat com.rabbitmq.client.impl.AMQConnection.readFrame(AMQConnection.java:747)\n\tat com.rabbitmq.client.impl.AMQConnection.access$300(AMQConnection.java:47)\n\tat com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:666)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n Likely a driver error more than James itself. Could because RabbitMQ overloaded? Rspamd"level":"ERROR","thread":"spooler-388","logger":"org.apache.james.mailetcontainer.impl.ProcessorImpl","message":"Exception calling org.apache.james.rspamd.RspamdScanner: reactor.netty.http.client.PrematureCloseException: Connection prematurely closed BEFORE response\n I observe this only happens when Rspamd is down or overloaded. Maybe we just need to make sure the mail processing still operates when Rspamd is unavailable. (or it is already?) Do you consider these blockers? |
Maybe we should have configuration within gatling-jmap on the authentication method?
Could be a start? I can live with this flow for now but can you please create a ticket on |
All our requests use QUORUM consistency level. If needed, I can re-explain in detail why. In short: good compromise between consistency and availabilty. That's somewhat of a standard practice in the industry. Likely yes it shows Cassandra can be the bottleneck. I would be interested to get the count of error too. Are we speaking of 10? 10.000? Finally by offloading some of the searches to Cassandra we might load Cassandra a bit more than we should. |
What is suspicious is that it even tries to reconnect. What's the time of the log? Was it when re-deploying/scaling James? I already noticed scaling/re-deploying james can knock out the RabbitMQ cluster. |
Mailet configuration allows fine grain error management. We should ensure all default configuration is written so that we ignore error of the Can you open a ticket on |
None of these are blockers! But there is some really valuable lessons to be learned here. I'd be happy to dig a bit more the unknown parts here, especially the Cassandra errors. Thanks for this very valuable report. |
Regarding Cassandra if this is #481 (comment) with 1 failed status then it do not seem too bad to me, likely we should avoid spending too much time on this. |
Regarding Cassandra, search overrides for Maybe a little configuration tweak to be made: get rid of (ticket?) |
just a few ~10
There are only 3 consecutive ones that I am not sure of the timing. Could be as you said. Tickets: |
The text was updated successfully, but these errors were encountered: