Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxysql 2.5.5 cluster crashing after losing access to a Mysql 5.7 replica #4475

Closed
cmejiat104 opened this issue Mar 20, 2024 · 11 comments · Fixed by #4481
Closed

Proxysql 2.5.5 cluster crashing after losing access to a Mysql 5.7 replica #4475

cmejiat104 opened this issue Mar 20, 2024 · 11 comments · Fixed by #4481

Comments

@cmejiat104
Copy link

Hello,

I have a Proxysql cluster with 3 nodes in front of a Percona Mysql server 5.7 configured with one MASTER and two REPLICAS (asynchronous replication).

Servers are running inside a private cloud; the environment is compound by three zones (like regions). In the Zone 01, we have two servers running proxysql01 and mysql01, same configuration in Zone 02 and Zone 03.

Recently there was a modification in one of our firewalls that isolated Zone 02, so proxysql01 and proxysql03 weren't able to talk to proxysql02 node and none of them were able to talk to mysql02 replica.

The 3 ProxySQL nodes started crashing and generating cores and even when Zone 02 access was fixed, instances were still crashing till / filesystem was full.

These are the messages found in Proxysql logs:

proxysql01:

19:47:05 MySQL_Monitor.cpp:7186:monitor_read_only_process_ready_tasks(): [ERROR] Timeout on read_only check for mysql02:3306 after 2002ms. If the server is overload, increase mysql-monitor_read_only_timeout.
19:47:08 MySQL_Monitor.cpp:7612:monitor_replication_lag_process_ready_tasks(): [ERROR] Timeout on replication lag health check for mysql02:3306 after 3003ms. If the server is overload, increase mysql-monitor_replication_lag_timeout.
19:47:11 MySQL_Monitor.cpp:1581:monitor_read_only_thread(): [ERROR] Timeout on read_only check for mysql02:3306 after 3002ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Can't connect to MySQL server on '10.172.51.57' (110).
19:47:23 MySQL_Monitor.cpp:6986:monitor_ping_process_ready_tasks(): [ERROR] Timeout on ping check for mysql02:3306 after 1001ms. If the server is overload, increase mysql-monitor_ping_timeout.
19:47:26 MySQL_Monitor.cpp:1581:monitor_read_only_thread(): [ERROR] Timeout on read_only check for mysql02:3306 after 3002ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Can't connect to MySQL server on '10.172.51.57' (110).
19:47:31 MySQL_Monitor.cpp:1581:monitor_read_only_thread(): [ERROR] Timeout on read_only check for mysql02:3306 after 3002ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Can't connect to MySQL server on '10.172.51.57' (110).
19:47:36 MySQL_Monitor.cpp:1581:monitor_read_only_thread(): [ERROR] Timeout on read_only check for mysql02:3306 after 3003ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Can't connect to MySQL server on '10.172.51.57' (110).
19:47:36 MySQL_Monitor.cpp:1761:monitor_read_only_thread(): [ERROR] Server mysql02:3306 missed 3 read_only checks. Assuming read_only=1
19:47:41 MySQL_Monitor.cpp:1581:monitor_read_only_thread(): [ERROR] Timeout on read_only check for mysql02:3306 after 3002ms. Unable to create a connection. If the server is overload, increase mysql-monitor_connect_timeout. Error: timeout on creating new connection: Can't connect to MySQL server on '10.172.51.57' (110).
19:47:41 MySQL_Monitor.cpp:1761:monitor_read_only_thread(): [ERROR] Server mysql02:3306 missed 3 read_only checks. Assuming read_only=1
19:47:42 MySQL_Monitor.cpp:3123:monitor_ping(): [ERROR] Server mysql02:3306 missed 4 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
proxysql: MySQL_Session.cpp:3371: void MySQL_Session::handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_STMT_EXECUTE(PtrSize_t&): Assertion `0' failed.
Error: signal 6:
/usr/bin/proxysql(_Z13crash_handleri+0x2a)[0x5c6d4a]
/lib64/libc.so.6(+0x36400)[0x7f268f519400]
/lib64/libc.so.6(gsignal+0x37)[0x7f268f519387]
/lib64/libc.so.6(abort+0x148)[0x7f268f51aa78]
/lib64/libc.so.6(+0x2f1a6)[0x7f268f5121a6]
/lib64/libc.so.6(+0x2f252)[0x7f268f512252]
/usr/bin/proxysql(_ZN13MySQL_Session75handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_STMT_EXECUTEER10_PtrSize_t+0x5af)[0x64907f]
/usr/bin/proxysql(_ZN13MySQL_Session20get_pkts_from_clientERbR10_PtrSize_t+0x6e9)[0x651929]
/usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0xa4)[0x652824]
/usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x47c)[0x6328bc]
/usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x5fa)[0x633a2a]
/usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0x6c)[0x5bf20c]
/lib64/libpthread.so.0(+0x7ea5)[0x7f26908ffea5]
/lib64/libc.so.6(clone+0x6d)[0x7f268f5e1b0d]
 ---- /usr/bin/proxysql(_Z13crash_handleri+0x2a) [0x5c6d4a] : crash_handler(int)
 ---- /usr/bin/proxysql(_ZN13MySQL_Session20get_pkts_from_clientERbR10_PtrSize_t+0x6e9) [0x651929] : MySQL_Session::get_pkts_from_client(bool&, _PtrSize_t&)
  ---- /usr/bin/proxysql(_ZN13MySQL_Session7handlerEv+0xa4) [0x652824] : MySQL_Session::handler()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread20process_all_sessionsEv+0x47c) [0x6328bc] : MySQL_Thread::process_all_sessions()
 ---- /usr/bin/proxysql(_ZN12MySQL_Thread3runEv+0x5fa) [0x633a2a] : MySQL_Thread::run()
 ---- /usr/bin/proxysql(_Z24mysql_worker_thread_funcPv+0x6c) [0x5bf20c] : mysql_worker_thread_func(void*)
To report a crashing bug visit: https://github.com/sysown/proxysql/issues
For support visit: https://proxysql.com/services/support/
19:47:47 main.cpp:1300:ProxySQL_daemonize_phase3(): [ERROR] ProxySQL crashed. Restarting!

Proxysql version:

ProxySQL version 2.5.5-percona-1.1, codename Truls

Mysql version:

5.7.43-47-log Percona Server 

OS version:

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Cores content:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/proxysql --idle-threads -c /etc/proxysql.cnf'.
Program terminated with signal 6, Aborted.
#0  0x00007f31aab55387 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install proxysql2-2.5.5-1.1.el7.x86_64
(gdb) bt
#0  0x00007f31aab55387 in raise () from /lib64/libc.so.6
#1  0x00007f31aab56a78 in abort () from /lib64/libc.so.6
#2  0x00007f31aab4e1a6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f31aab4e252 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000064907f in MySQL_Session::handler___status_WAITING_CLIENT_DATA___STATE_SLEEP___MYSQL_COM_STMT_EXECUTE (
    this=this@entry=0x7f319ed45e00, pkt=...) at MySQL_Session.cpp:3371
#5  0x0000000000651929 in MySQL_Session::get_pkts_from_client (this=this@entry=0x7f319ed45e00, wrong_pass=@0x7f31a33fbaba: false,
    pkt=...) at MySQL_Session.cpp:4047
#6  0x0000000000652824 in MySQL_Session::handler (this=this@entry=0x7f319ed45e00) at MySQL_Session.cpp:4773
#7  0x00000000006328bc in MySQL_Thread::process_all_sessions (this=this@entry=0x7f31a1d1e800) at MySQL_Thread.cpp:3959
#8  0x0000000000633a2a in MySQL_Thread::run (this=this@entry=0x7f31a1d1e800) at MySQL_Thread.cpp:3423
#9  0x00000000005bf20c in mysql_worker_thread_func (arg=0x7f31a998e260) at main.cpp:419
#10 0x00007f31abf3bea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f31aac1db0d in clone () from /lib64/libc.so.6
(gdb)

When we noticed the problem, our workaround was stopping all the three proxysql services, cleaning up / filesystem on each server and only starting Proxysql service in node 01, once this one was up and running, I did the same with the other two nodes.

We don't understand why the two healthy Proxysql nodes were not able to continue running, but crashing in an endless loop.

I hope you can give me any advice.

Thank you in advanced.

Regards,

@renecannao
Copy link
Contributor

Is your client using JDBC with auto-reconnect enabled?

@cmejiat104
Copy link
Author

Is your client using JDBC with auto-reconnect enabled?

Do you mean the client our Users or Services are using to access to Proxysql cluster? if that's the case, I'm not sure, as we have many services from different sources and maybe not all them use the same client config.

@cmejiat104
Copy link
Author

I talked to developers and they said:

@renecannao
Copy link
Contributor

Hi @cmejiat104 , the important bit in my previous question is actually "with auto-reconnect enabled?"
I have seen this problem (I will describe it shortly) with several customers using JDBC and auto-reconnect enabled, and it is possible that other libraries have the same problem.
For reference, JDBC auto-reconnect documentation says that the feature is not recommended:
https://dev.mysql.com/doc/connector-j/en/connector-j-connp-props-high-availability-and-clustering.html#cj-conn-prop_autoReconnect

Now, what is the issue that crashed ProxySQL?
The client tries to execute a prepared statement that doesn't exist. ProxySQL currently asserts when this happen (this is your backtrace).
When a client tries to execute a prepared statement that doesn't exist?
We have seen many scenarios in which a client is connected to a proxyql instance, it has a prepared statement prepared, then it lose connectivity with proxysql (for whatever reason), it then reconnects (because of autoReconnect) and tries to execute a prepared statement that doesn't exist anymore: proxysql asserts and crashes.

What happens when you have multiple proxysql instances, and a bugged client with autoReconnect tries to execute a prepared statement that doesn't exist? After crashing the first proxysql, the client will auto-reconnect to the 2nd proxysql instance and try to execute the non-existing prepared statement there too, crashing the 2nd proxysql instance as well.
If you have more proxysql instance, this client will crash them all.

I don't know the exact implementation in the libraries you are using, but if you have autoReconnect (or an equivalent named property) please disable it.

For further reference, auto-reconnect is now deprecated also in MySQL C API:
https://dev.mysql.com/doc/c-api/8.0/en/c-api-auto-reconnect.html

That said, I acknowledge that this must also be considered a serious bug in proxysql, because a buggy application/driver shouldn't lead to a proxysql crash.
We have this in our todo list.

@cmejiat104
Copy link
Author

cmejiat104 commented Mar 21, 2024

Thank you for your replies.

Unfortunately, modifications in Services' client is not possible at this moment, this is a Prod environment.

We have a UAT environment that was also impacted; although, in this case we only have two instances of ProxySQL, proxysql01 and proxysql02, but Mysql backend has three nodes (mysql01, mysql02 and mysql03) as Prod. In the case of UAT, proxysql01 crashed only once and proxysql02 twice, after that both stayed up and running; of course when this happened, for UAT was out of business hours, so only a few batch processes were running, but for Prod was in the middle of peak hours.

During the network issue, connections to & from proxysql02 and mysql02 dropped (between them were fine), in the case of Prod mysql02 was a Mysql Replica and for UAT mysql02 was the Master. BTW, I don't have any query rule configured, so all the transactions go always to the Mysql Master.

@cmejiat104
Copy link
Author

Might this another issue also related to my Proxysql crashes?

#3371

@renecannao
Copy link
Contributor

It is the same issue, yes

@cmejiat104
Copy link
Author

Ok thanks!, but the fix mentioned in the comments is not in place yet, right?

@renecannao
Copy link
Contributor

The fix mentioned in the comments is not in place yet because it is not complete.

@renecannao
Copy link
Contributor

This issue is closed by #4481

@cmejiat104
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants