Regression in 2.3.1 compared to 2.2.8 with virtual_server #2435

bleve · 2024-06-26T17:07:54Z

Describe the bug

When keepalived.service is starting and virtual_servers have been configured, after machine boot keepalived fail to start.

After first start try only ip_vs module is loaded.

When restarting keepalived.service, on second start it manages to load ip_vs_rr module and starts properly.

To Reproduce

Happens every time you reboot the machine, on first startup virtual_server code fails to start.

Expected behavior

I'd expect all modules to be loaded automatically and service to work.

Keepalived version

2.3.1

Output of keepalived -v

Keepalived v2.3.1 (05/24,2024)

Copyright(C) 2001-2024 Alexandre Cassen, <[email protected]>

Built with kernel headers for Linux 5.14.0
Running on Linux 5.14.0-456.foo9.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May 29 11:54:04 EEST 2024
Distro: Foobar Linux 9 (0b1001)

configure options: --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --enable-snmp --enable-snmp-rfc --enable-nftables --disable-iptables --enable-json --runstatedir=/run --with-tmp-dir=/run/keepalived --with-init=systemd build_alias=x86_64-redhat-linux-gnu host_alias=x86_64-redhat-linux-gnu PKG_CONFIG_PATH=:/usr/lib64/pkgconfig:/usr/share/pkgconfig CC=gcc CFLAGS=-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection LDFLAGS=-Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 

Config options:  NFTABLES LVS VRRP VRRP_AUTH VRRP_VMAC JSON OLD_CHKSUM_COMPAT SNMP_V3_FOR_V2 SNMP_VRRP SNMP_CHECKER SNMP_RFCV2 SNMP_RFCV3 INIT=systemd SYSTEMD_NOTIFY

System options:  VSYSLOG MEMFD_CREATE IPV6_MULTICAST_ALL IPV4_DEVCONF LIBNL3 RTA_ENCAP RTA_EXPIRES RTA_NEWDST RTA_PREF FRA_SUPPRESS_PREFIXLEN FRA_SUPPRESS_IFGROUP FRA_TUN_ID RTAX_CC_ALGO RTAX_QUICKACK RTEXT_FILTER_SKIP_STATS FRA_L3MDEV FRA_UID_RANGE RTAX_FASTOPEN_NO_COOKIE RTA_VIA FRA_PROTOCOL FRA_IP_PROTO FRA_SPORT_RANGE FRA_DPORT_RANGE RTA_TTL_PROPAGATE IFA_FLAGS LWTUNNEL_ENCAP_MPLS LWTUNNEL_ENCAP_ILA NET_LINUX_IF_H_COLLISION LIBIPTC_LINUX_NET_IF_H_COLLISION LIBIPVS_NETLINK IPVS_DEST_ATTR_ADDR_FAMILY IPVS_SYNCD_ATTRIBUTES IPVS_64BIT_STATS IPVS_TUN_TYPE IPVS_TUN_CSUM IPVS_TUN_GRE VRRP_IPVLAN IFLA_LINK_NETNSID GLOB_BRACE GLOB_ALTDIRFUNC INET6_ADDR_GEN_MODE VRF SO_MARK

Distro (please complete the following information):

Name: Foobar Linux
Version: 9
Architecture: x86_64

Details of any containerisation or hosted service (e.g. AWS)
QEMU/KVM VM

Configuration file:

global_defs {
    notification_email {
        root
    }
    enable_script_security
    notification_email_from root
    script_user root root
    smtp_connect_timeout 30
    smtp_server localhost
    v3_checksum_as_v2
    vrrp_garp_interval 0.5
    vrrp_gna_interval 0.5
    vrrp_higher_prio_send_advert true
    vrrp_startup_delay 10
    vrrp_version 3
}

vrrp_sync_group foo-ipvs {
    group {
        ipv4
        ipv6
    }
    #smtp_alert yes
    notify_master "/usr/libexec/keepalived/notify master"
    notify_backup "/usr/libexec/keepalived/notify backup"
    notify_fault "/usr/libexec/keepalived/notify fault"
}

vrrp_instance ipv4 {
    @foo-ipvs-01 priority 100
    @foo-ipvs-02 priority 99
    advert_int 1
    garp_lower_prio_repeat 1
    interface enp1s0
    nopreempt
    state BACKUP
    virtual_router_id 19
    virtual_ipaddress {
        172.23.50.19/24 dev enp1s0
    }
}

vrrp_instance ipv6 {
    @foo-ipvs-01 priority 100
    @foo-ipvs-02 priority 99
    advert_int 1
    garp_lower_prio_repeat 1
    interface enp1s0
    nopreempt
    state BACKUP
    virtual_router_id 19
    virtual_ipaddress {
        2001:db8:c02d:f50::19/64 dev enp1s0 preferred_lft 0
    }
}

virtual_server 172.23.50.19 80 {
    delay_loop 10
    protocol TCP
    lvs_sched rr
    lvs_method DR
    persistence_timeout 30
    real_server 172.23.50.20 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 5
            connect_port 80
        }
    }
    real_server 172.23.50.21 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 5
            connect_port 80
        }
    }
}

virtual_server 2001:db8:c02d:f50::19 80 {
    delay_loop 10
    protocol TCP
    lvs_sched rr
    lvs_method DR
    persistence_timeout 30
    real_server 2001:db8:c02d:f50::20 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 5
            connect_port 80
        }
    }
    real_server 2001:db8:c02d:f50::21 80 {
        weight 1
        TCP_CHECK {
            connect_timeout 5
            connect_port 80
        }
    }
}

Notify and track scripts

In this case notify script was not configured so it was a noop

System Log entries

Jun 26 17:00:29 foo-ipvs-01 systemd[1]: Starting LVS and VRRP High Availability Monitor...
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Starting Keepalived v2.3.1 (05/24,2024)
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Running on Linux 5.14.0-456.foo9.x86_64 #1 SMP PREEMPT_DYNAMIC Wed May 29 11:54:04 EEST 2024 (built for Linux 5.14.0)
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Command line: '/usr/sbin/keepalived' '--dont-fork' '--log-detail'
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Opening file '/etc/keepalived/keepalived.conf'.
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Configuration file /etc/keepalived/keepalived.conf
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: NOTICE: setting config option max_auto_priority should result in better keepalived performance
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Starting Healthcheck child process, pid=784
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Starting VRRP child process, pid=785
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Initializing ipvs
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Registering Kernel netlink reflector
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Registering Kernel netlink command channel
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Delaying startup for 10 seconds
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Assigned address 172.23.50.20 for interface enp1s0
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Assigned address fe80::5054:ff:fecb:4b13 for interface enp1s0
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: (ipv6) the first IPv6 VIP address should be link local
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Registering gratuitous ARP shared channel
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: Registering gratuitous NDISC shared channel
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: (ipv4) removing VIPs.
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: (ipv6) removing VIPs.
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: (ipv4) Entering BACKUP STATE (init)
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: (ipv6) Entering BACKUP STATE (init)
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: VRRP sockpool: [ifindex(  2), family(IPv4), proto(112), fd(13,14) multicast, address(224.0.0.18)]
Jun 26 17:00:29 foo-ipvs-01 Keepalived_vrrp[785]: VRRP sockpool: [ifindex(  2), family(IPv6), proto(112), fd(15,16) multicast, address(ff02::12)]
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: IPVS: Can't initialize ipvs: No such file or directory
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Shutting down service [172.23.50.20]:tcp:80 from VS [172.23.50.19]:tcp:80
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Shutting down service [172.23.50.21]:tcp:80 from VS [172.23.50.19]:tcp:80
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Shutting down service [2001:db8:c02d:f50::20]:tcp:80 from VS [2001:db8:c02d:f50::19]:tcp:80
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Shutting down service [2001:db8:c02d:f50::21]:tcp:80 from VS [2001:db8:c02d:f50::19]:tcp:80
Jun 26 17:00:29 foo-ipvs-01 Keepalived_healthcheckers[784]: Stopped - used (self/children) 0.001919/0.008294 user time, 0.000000/0.008293 system time
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: pid 784 exited with permanent error FATAL. Terminating
Jun 26 17:00:29 foo-ipvs-01 systemd[1]: keepalived.service: Main process exited, code=exited, status=1/FAILURE
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: CPU usage (self/children) user: 0.005101/0.010213 system: 0.007496/0.009083
Jun 26 17:00:29 foo-ipvs-01 systemd[1]: keepalived.service: Failed with result 'exit-code'.
Jun 26 17:00:29 foo-ipvs-01 Keepalived[783]: Stopped Keepalived v2.3.1 (05/24,2024)
Jun 26 17:00:29 foo-ipvs-01 systemd[1]: keepalived.service: Unit process 785 (keepalived) remains running after unit stopped.
Jun 26 17:00:29 foo-ipvs-01 systemd[1]: Failed to start LVS and VRRP High Availability Monitor.
Jun 26 17:00:30 foo-ipvs-01 Keepalived_vrrp[785]: Stopped - used (self/children) 0.001142/0.003573 user time, 0.003401/0.013423 system time

Did keepalived coredump?

No

Additional context

With 2.2.8 version exactly same configuration worked just fine so this is regression after that release.

The text was updated successfully, but these errors were encountered:

pqarmitage · 2024-06-29T17:04:35Z

Unfortunately I cannot test this issue on Foobar Linux since I do not have a subscription for it. However, nothing has changed in keepalived in relation to module loading since keepalived v2.2.8, and I have never come across this problem (nor has it been reported) before.

Once the ip_vs kernel module is loaded (and keepalived does this if necessary) then any of the other kernel modules (such as ip_vs_rr) should be automatically loaded by the kernel; certainly keepalived has never had, and has never needed to have, functionality to load any of the other ipvs modules (the ipvsadm utility likewise only loads the ip_vs module).

I have looked at the code that handles loading the ip_vs module, and in certain circumstances errno may not be set appropriately so the "no such file or directory" error may be incorrect, but it doesn't look like that to me. I will tidy that up soon.

My guess about what is occurring at startup is that the ip_vs module is being loaded by keepalived, but that the subsequent call of ipvs_init() (see ipvs_start() in keepalived/check/ipvswrapper.c) is occurring too quickly after loading the module and returns an error; the keepalived_healthchecker process then terminates. When you restart keepalived, since the ip_vs module is already loaded, the problem no longer occurs.

As a workaround you add a startup script. For example add:

global_defs {
    startup_script /etc/keepalived/keepalived_start.sh
}

and /etc/keepalived/keepalived_start.sh:

#!/bin/bash

modprobe ip_vs

If necessary you could add a loop after the modprobe to check until the ip_vs module has loaded.

bleve · 2024-06-29T17:24:11Z

I guess it is timing issue. You can test with centos stream 9 - it has same kernel and same tooling.

bleve · 2024-06-29T19:04:51Z

But again. No other changes than building 2.3.1 instead of 2.2.8 - with same options and this problem appeared.

pqarmitage · 2024-06-30T09:34:59Z

I can reproduce this on Centos Stream 9. If I remove the ip_vs module before running keepalived, then he problem occurs. I will investigate further.

pqarmitage · 2024-07-01T10:50:05Z

It appears that if we need to install the ip_vs module (via keepalived_modprobe), then the first ipvs_nl_send_message() call fails, but it succeeds thereafter.

On most distros the call genl_ctrl_resolve(sock, IPVS_GENL_NAME) loads the ip_vs module, so we don't need to call keepalived_modprobe(), and we don't need to retry the ipvs_nl_send_message(). On the other hand, if genl_ctrl_resolve() does not load the ip_vs module, then we need to make the second call of ipvs_nl_send_message(). It appears that RHEL based distros (including Centos Stream but not Fedora) do not load the ip_vs module when genl_ctrl_resolve() is called, but most other distros do.

It appears that if there is an entry:
alias net-pf-16-proto-16-family-IPVS ip_vs
in /lib/modules/{KERNEL_VER}/modules.alias then genl_ctrl_resolve() will cause the ip_vs module to be loaded, and if there is no such entry, then it does not load the ip_vs module. Whether this is cause and effect, or whether they are both caused by some other aspect of the kernel configuration I do not know.

It was commit c7bade7 that caused the problem, although on the face of it it shouldn't have made any difference. Moving the check for !msg to be first code in ipvs_nl_send_message() meant that the rather bizarre ipvs_nl_send_message(NULL, NULL, NULL) in ipvs_init() no longer called open_nl_sock(). Moving the check for !msg to after the call of open_nl_sock() resolved this issue, but was clearly wrong since nlmsg_free(msg) was then called with a NULL pointer (although nlmsg_free() does check for that). The solution is to call ipvs_nl_send_message() twice in ipvs_getinfo() if we have loaded the ip_vs module.

Commit a0b6d3b resolves this issue.

pqarmitage closed this as completed Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in 2.3.1 compared to 2.2.8 with virtual_server #2435

Regression in 2.3.1 compared to 2.2.8 with virtual_server #2435

bleve commented Jun 26, 2024 •

edited

Loading

pqarmitage commented Jun 29, 2024

bleve commented Jun 29, 2024

bleve commented Jun 29, 2024

pqarmitage commented Jun 30, 2024

pqarmitage commented Jul 1, 2024

Regression in 2.3.1 compared to 2.2.8 with virtual_server #2435

Regression in 2.3.1 compared to 2.2.8 with virtual_server #2435

Comments

bleve commented Jun 26, 2024 • edited Loading

pqarmitage commented Jun 29, 2024

bleve commented Jun 29, 2024

bleve commented Jun 29, 2024

pqarmitage commented Jun 30, 2024

pqarmitage commented Jul 1, 2024

bleve commented Jun 26, 2024 •

edited

Loading