Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ttaaefs_peerip causes silent node failure when set with an FQDN #12

Open
TM553432 opened this issue Feb 18, 2025 · 2 comments
Open

ttaaefs_peerip causes silent node failure when set with an FQDN #12

TM553432 opened this issue Feb 18, 2025 · 2 comments

Comments

@TM553432
Copy link

Discovered when trying to set up both realtime and fullsync NextGenRepl between 2 fully functional clusters of identical node count where all nodes were running KV 3.2.4. As part of setting up fullsync, the ttaaefs_peerip setting in the source cluster has to point to the fullsync peer in the sink cluster. As an FQDN had been set up for each node, we put the FQDN in instead of the IP address in the ttaaefs_peerip setting on the source cluster. This caused Riak to fall over silently with no log of cause. This was remedied by changing it to a hard-coded IPv4 address.
When using an environment where IP addresses can change, this seems to contradict Riak's ability to use an FQDN in the nodename.
To replicate:

  1. Create two clusters
  2. Set up NextGenRepl as one would normally
  3. On a source node, set ttaaefs_peerip to an FQDN that points to a sink node
  4. Restart Riak on the source node you updated
  5. Check whether Riak started.
  6. On the same source node, change ttaaefs_peerip to the IPv4 address of the same sink node.
  7. Restart Riak on the source node you updated.
  8. Check whether Riak started.
@TM553432 TM553432 changed the title Riak and Listener are correct but the node is not reachable ttaaefs_peerip causes silent node failure when set with an FQDN Feb 18, 2025
@martinsumner
Copy link
Contributor

The configuration schema specifically requires it to be an IP address, and uses a validation function to confirm:

https://github.com/OpenRiak/riak_kv/blob/openriak-3.2/priv/riak_kv.schema#L1110-L1129

So there should be some sort of Cuttlefish error at startup, but this operator response may have been lost in the upgrade of relx.

It might be that a FQDN would work, and this is just a schema issue. Looking at the code, the parsed string is simply passed to riak erlang client start_link function which can take an FQDN. You could test setting the FQDN via advanced.config, which will bypass the IP address validator in the riak.conf schema.

{riak_kv, [
    {ttaaefs_peerip, "fqdn.example.net"}
  ]
}

@martinsumner
Copy link
Contributor

You may find riak chkconfig useful to confirm riak.conf is correct before trying to run riak - https://docs.riak.com/riak/kv/latest/using/admin/riak-cli/index.html#chkconfig.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants