Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Messages hang in pending state when database connection fails #3041

Open
Ronnnn opened this issue Jul 10, 2024 · 4 comments
Open

Messages hang in pending state when database connection fails #3041

Ronnnn opened this issue Jul 10, 2024 · 4 comments
Labels

Comments

@Ronnnn
Copy link

Ronnnn commented Jul 10, 2024

Describe the bug

In my 3.3.4 version of Postal I have this problem that the database sometimes loses connection.

Jul 10 08:55:44 post mariadbd[626]: 2024-07-10 6:55:44+0000 482 [Warning] Aborted connection 482 to db: 'unconnected' user: 'postal' host: 'localhost' (Got timeout reading communication packets)

That needs to be fixed on my server... BUT Postal is not able to handle this. The message which the worker is trying to send hangs in pending state after this:

2024-07-10 06:55:45 +0000 DEBUG  acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:55:45 +0000 INFO   running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:55:45 +0000 INFO   scheduling task to next run at 2024-07-10 06:56:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 DEBUG  acquired task role by renewing it component=worker thread=tasks
2024-07-10 06:56:45 +0000 INFO   running task component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:56:45 +0000 INFO   scheduling task to next run at 2024-07-10 06:57:45 UTC component=worker thread=tasks task=SendNotificationsScheduledTask
2024-07-10 06:57:03 +0000 INFO   starting message unqueue component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 INFO   queue latency is 2.470646619796753s component=worker thread=work0 original_queued_message=1134326
2024-07-10 06:57:03 +0000 ERROR  Mysql2::Error::ConnectionError (MySQL server has gone away) component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:326:in `query_on_connection' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:318:in `block in query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/connection_pool.rb:20:in `use' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:345:in `with_mysql' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:317:in `query' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:177:in `select' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/message.rb:12:in `find_one' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/lib/postal/message_db/database.rb:47:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/models/concerns/has_message.rb:12:in `message' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:42:in `check_message_exists' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:16:in `block (2 levels) in process' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/base.rb:38:in `catch_stops' component=worker thread=work0
2024-07-10 06:57:03 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:14:in `block in process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer/initial_processor.rb:11:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/message_dequeuer.rb:9:in `process' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:67:in `block in process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/relation/delegation.rb:88:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:65:in `process_messages' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/jobs/process_queued_messages_job.rb:14:in `call' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:164:in `block (4 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/lib/ruby/3.2.0/benchmark.rb:311:in `realtime' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:164:in `block (3 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:285:in `capture_errors' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:161:in `block (2 levels) in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:160:in `each' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:160:in `block in work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/activerecord-7.0.8.1/lib/active_record/connection_adapters/abstract/connection_pool.rb:215:in `with_connection' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:159:in `work' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:142:in `block (3 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:141:in `loop' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:141:in `block (2 levels) in start_work_thread' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/group_set.rb:18:in `call_without_id' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /usr/local/bundle/gems/klogger-logger-1.4.0/lib/klogger/logger.rb:72:in `tagged' component=worker thread=work0
2024-07-10 06:57:04 +0000 ERROR  /opt/postal/app/app/lib/worker/process.rb:139:in `block in start_work_thread' component=worker thread=work0
2024-07-10 06:57:06 +0000 INFO   starting message unqueue component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   queue latency is 4.297773838043213s component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   found 16 associated messages to process at the same time batch_key=outgoing-gmail.com component=worker thread=work1 original_queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   parsing message content as it hasn't been parsed before component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Connected to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Sending message 1::4766770 to <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327
2024-07-10 06:57:06 +0000 INFO   Accepted by <CUT> log_id=Y75FMTZH component=worker thread=work1 original_queued_message=1134327 queued_message=1134327

Now the message hangs in pending state because it is still locked by the failing worker process:

select id,message_id,locked_by,locked_at,retry_after,created_at,updated_at from queued_messages where message_id=4766769;
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| id      | message_id | locked_by                                  | locked_at                  | retry_after | created_at                 | updated_at                 |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
| 1134326 |    4766769 | host:mail pid:9 thread:12 22c1cddf97171ce2 | 2024-07-10 06:57:03.968458 | NULL        | 2024-07-10 06:57:01.510895 | 2024-07-10 06:57:01.510895 |
+---------+------------+--------------------------------------------+----------------------------+-------------+----------------------------+----------------------------+
1 row in set (0.001 sec)

The only way to get it through now is setting locked_at and locked_by fields to null.

Expected behaviour

The message should be cleaned properly by the worker so it can be picked up by another process the next time the database is back.

Environment details

  • OS: Ubuntu 22.04.4 LTS
  • Version 3.3.4
@willpower232
Copy link
Collaborator

I think this has been a problem since forever and I am not sure there is an easy solution. This is probably easier to solve now the queue is in the database as opposed to RabbitMQ.

Many people find they need to run multiple workers to consume their queue in a timely manner so process identification is tricky as the workers are not able to communicate with each other and I do not think Docker would be able to tell each worker about every other worker.

One solution could be a configurable lock retry setting so if locked_at was more than say an hour ago, the message is picked up again.

@Ronnnn
Copy link
Author

Ronnnn commented Jul 10, 2024

Thank you again for the quick responses. Really appreciate it. Might it be an idea to give the database layer some kind of retry-mechanism?

https://github.com/postalserver/postal/blob/da90e75036c27482699921613d838f4058a100e7/lib/postal/message_db/database.rb

Disclaimer: untested code not written by a human ;p

def query(query)
  retries = 0
  begin
    with_mysql do |mysql|
      query_on_connection(mysql, query)
    end
  rescue Mysql2::Error::ConnectionError => e
    if retries < 1 # Limit the number of retries to 1
      retries += 1
      logger.warn "Connection error on query, retrying... (Attempt #{retries + 1})"
      retry
    else
      logger.error "Repeated connection errors on query, giving up. Error: #{e.message}"
      raise
    end
  end
end

@notz
Copy link

notz commented Jul 15, 2024

Seems to be a duplicate of #3023

@memcorrupt
Copy link

+1 - this issue also happens to me

postal logs:

postal-web-1  | 2024-09-26 20:52:15 +0000 ERROR  POST /api/v1/send/message (500) event=request transaction=[snip] controller=LegacyAPI::SendController action=message format=html method=POST path=/api/v1/send/message request_id=[snip] ip_address=[snip] status=500 db_runtime=6.752912521362305 exception_class=Mysql2::Error::ConnectionError exception_message=MySQL server has gone away exception_backtrace=/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `_query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:151:in `block in query'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `handle_interrupt'\n/usr/local/bundle/gems/mysql2-0.5.6/lib/mysql2/client.rb:150:in `query'

mariadb logs:

[Warning] Aborted connection 34409 to db: 'unconnected' user: 'x' host: 'localhost' (Got timeout reading communication packets)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants