Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: start the overlay mesh before the routing is added #3227

Open
1 task done
atanas18 opened this issue Nov 27, 2024 · 13 comments
Open
1 task done

[Bug]: start the overlay mesh before the routing is added #3227

atanas18 opened this issue Nov 27, 2024 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@atanas18
Copy link

atanas18 commented Nov 27, 2024

What happened?

On a server reboot, the overlay mesh is started before the necessary ip route rules are added (to route that ip traffic trough the netmaker interface).
Because we have hundreds of mesh nodes under rfc1918 addresses (behind NAT), on a reboot of a public node (which is not behind NAT) start searching nodes in the overlay mesh trough rfc1918 addresses, which triggers Hetzner abuse for Netscan detected (because of hundreds requests to rfc1918). They don't like traffic on rfc1918 subnets over the public interface, and when the route is not up, in the beginning it's sending hundreds of TCP connections trying to reach the nodes under the NAT. Once the route is up, the problem is not happening, and traffic for rfc1918 is not send over the public interface anymore.
This behavior is happening after updating from 0.24.1 to 0.25.0, also happens on 0.26.0. On and before 0.24.1 we didn't have such problem. I guess something has changed between these versions.

Thanks.

Version

v0.25.0

What OS are you using?

Linux

Relevant log output

No response

Contributing guidelines

  • Yes, I did.
@atanas18 atanas18 added the bug Something isn't working label Nov 27, 2024
@yabinma
Copy link
Collaborator

yabinma commented Nov 28, 2024

@atanas18 can you please share more details?

  1. Is the issue happened on Netmaker server side or netclient side?
  2. Is the issue happened after Netmaker server restart or a client machine restart?
  3. How the ip routes are added after the restart? netclient is up by system daemon, for example on Ubunutu, it's systemd.
  4. Any screenshot or logs will be helpful for the investigation.
    Thanks.

@atanas18
Copy link
Author

atanas18 commented Nov 28, 2024

Hi @yabinma

  1. happened on netclient side
  2. on a client machine restart
  3. systemd service
  4. the log (from journalctl -u netclient):

Nov 27 08:53:33 systemd[1]: Starting netclient.service - Netclient Daemon...
Nov 27 08:53:50 systemd[1]: Started netclient.service - Netclient Daemon.
Nov 27 08:53:50 netclient[73807]: daemon called
Nov 27 08:53:50 netclient[73807]: [netclient] 2024-11-27 08:53:50 Starting firewall...
Nov 27 08:53:50 netclient[73807]: [netclient] 2024-11-27 08:53:50 iptables is supported
Nov 27 08:53:50 netclient[73807]: [netclient] 2024-11-27 08:53:50 adding forwarding rule
Nov 27 08:53:51 netclient[73807]: {"time":"2024-11-27T08:53:51.040946126Z","level":"ERROR","source":"daemon.go 229}","msg":"fail to pull config from server","error":"server config not found"}
Nov 27 08:53:51 netclient[73807]: [netclient] 2024-11-27 08:53:51 adding addresses to netmaker interface
Nov 27 08:54:52 netclient[73807]: [netclient] 2024-11-27 08:54:52 flushing netmaker rules...
Nov 27 08:54:53 netclient[73807]: [netclient] 2024-11-27 08:54:53 Starting firewall...
Nov 27 08:54:53 netclient[73807]: [netclient] 2024-11-27 08:54:53 iptables is supported
Nov 27 08:54:53 netclient[73807]: [netclient] 2024-11-27 08:54:53 adding forwarding rule
Nov 27 08:54:55 netclient[73807]: completed pull for server netmaker.domain.tld
Nov 27 08:54:55 netclient[73807]: [netclient] 2024-11-27 08:54:55 adding addresses to netmaker interface
Nov 27 08:54:55 netclient[73807]: [netclient] 2024-11-27 08:54:55 initialized endpoint detection on port 51821
Nov 27 08:54:55 netclient[73807]: [netclient] 2024-11-27 08:54:55 initialized endpoint detection on port 51821
Nov 27 08:55:00 netclient[73807]: [netclient] 2024-11-27 08:55:00 adding addresses to netmaker interface
Nov 27 08:55:53 netclient[73807]: [netclient] 2024-11-27 08:55:53 adding addresses to netmaker interface

currently this is v0.26.0 version client.

thanks

@yabinma
Copy link
Collaborator

yabinma commented Nov 28, 2024

@atanas18 , if the ip route change is managed by systemd service as well, can you please try to add the dependency in systemd configuration file? /etc/systemd/system/netclient.service, adding your systemd service in After= section, so that the netclient is started after your ip route change service.

@atanas18
Copy link
Author

@yabinma ah, maybe I didn't understand your 3rd question correctly. I meant the ip route that netclient is adding, not that I add custom routes myself.
This one:

172.16.10.0/24 dev netmaker proto kernel scope link src 172.16.10.11

@yabinma
Copy link
Collaborator

yabinma commented Nov 28, 2024

@yabinma ah, maybe I didn't understand your 3rd question correctly. I meant the ip route that netclient is adding, not that I add custom routes myself. This one:

172.16.10.0/24 dev netmaker proto kernel scope link src 172.16.10.11

@atanas18 , my bad, I may misunderstanding the issue.

172.16.10.0/24 dev netmaker proto kernel scope link src 172.16.10.11, yes, this route is added after the netmaker interface up.
How it's impacted in the case? Help me understand the issue please.

@atanas18
Copy link
Author

I sit down and rethink the situation .. and I think I mislead you somehow.
As we have many clients behind nat (but let's say the current client is not in that nat, it only has public IP) ... netclient tries to make many connections to the rfc1918 ips to reach the netclients behind the nat (and it cannot and shouldn't do that). I think it's better if the current netclient first check if we have an interface (or a route) from that rfc1918 subnets, to try to reach them, otherwise it's no point to try to reach them as there is no port forwarding on the NATs public IP to the clients behind the nat .. and only the punch hole is working once the client behind the NAT try to reach the current client.
I hope you understand me. I think this is the real problem.

@yabinma
Copy link
Collaborator

yabinma commented Nov 28, 2024

@atanas18 , hope my understanding is correct.
By default, all the hosts registered in the same network, it's designed to be connected each other, whatever behind a firewall/NAT or with public ip.
But as your requirement, it may be fulfilled with the ACL feature.
All the resources are able to communicate each other by default because of the default policy.
image

You may create a customized policy, to tag all the nodes behind NAT as a group and to allow them to be able to communicate each other. For the nodes with public ip, to setup a different policy.
This way, you may control the access among all the nodes in the network.

CC @abhishek9686

@atanas18
Copy link
Author

Well, in the end they must all communicate between each other, it's just in the beginning the nodes with only public IPs should not try to reach the ones with only private IPs :) that's my point. When the nodes with private IP try to reach the public, the punch hole is enough to have the communication both ways. But when the node with only public IP try to reach the ones with private IP is no go anyway.. it can't reach it for sure .. it just generates traffic which Hetzner catch and flag as portscanner and send an abuse email. There's really no point to try to reach rfc1918 IPs if you only have public IP interface (except the interface of the netmaker of course this one can be rfc1918 but either way it shouldn't try do the first try to reach the clients trough itself).

Is your proposal going to work for what I'm trying to explain? Can I make the group behind NAT be able to communicate with EVERYTHING, and then nodes with public IPs only to be able to also communicate with the ones behind NAT but only and ONLY after the ones behind NAT initiate the punch hole? The nodes with public IPs should not try to reach the ones behind NAT before that (otherwise as I say it's generating an abuse ticket on Hetzner, which I have to reply and explain the situation so they do not shut off the server... without explanation that's their final decision - shutting of the server).

Thanks.

@yabinma
Copy link
Collaborator

yabinma commented Nov 28, 2024

@atanas18 , what traffic is captured before the netmaker interface up? Is there source ip, destination ip in the Hetzner scan?

As I checked the code, there is no peer communication before the netmaker interface up.
After the netmaker interface up, there is an endpoint detection. But it's after the interface up,not before.
You may disable endpoint detection and check it again. Setting ENDPOINT_DETECTION=false in netmaker.env file on Netmaker server side, and then restart the netmaker server and check it again.

@atanas18
Copy link
Author

atanas18 commented Nov 29, 2024

Doesn't the ENDPOINT_DETECTION affect only the server? Will check the documentation about that. The problem for me is on the clients side for sure.
Also, I'm not sure if this happens before the netmaker interface up .. It could happen after the interface up, it's just that as Hetzner clients are not in the same 10.0.0.0/16 NAT range, the Hetzner client shouldn't try to connect with these 10.0.0.0/16 IPs to the clients behind the NAT.

@abhishek9686
Copy link
Member

Well, in the end they must all communicate between each other, it's just in the beginning the nodes with only public IPs should not try to reach the ones with only private IPs :) that's my point. When the nodes with private IP try to reach the public, the punch hole is enough to have the communication both ways. But when the node with only public IP try to reach the ones with private IP is no go anyway.. it can't reach it for sure .. it just generates traffic which Hetzner catch and flag as portscanner and send an abuse email. There's really no point to try to reach rfc1918 IPs if you only have public IP interface (except the interface of the netmaker of course this one can be rfc1918 but either way it shouldn't try do the first try to reach the clients trough itself).

Is your proposal going to work for what I'm trying to explain? Can I make the group behind NAT be able to communicate with EVERYTHING, and then nodes with public IPs only to be able to also communicate with the ones behind NAT but only and ONLY after the ones behind NAT initiate the punch hole? The nodes with public IPs should not try to reach the ones behind NAT before that (otherwise as I say it's generating an abuse ticket on Hetzner, which I have to reply and explain the situation so they do not shut off the server... without explanation that's their final decision - shutting of the server).

Thanks.

@atanas18 client will update it's peer endpoint to private IPs, only if it's able to communicate over it otherwise it uses public IP

@yabinma
Copy link
Collaborator

yabinma commented Nov 29, 2024

Doesn't the ENDPOINT_DETECTION affect only the server? Will check the documentation about that. The problem for me is on the clients side for sure. Also, I'm not sure if this happens before the netmaker interface up .. It could happen after the interface up, it's just that as Hetzner clients are not in the same 10.0.0.0/16 NAT range, the Hetzner client shouldn't try to connect with these 10.0.0.0/16 IPs to the clients behind the NAT.

ENDPOINT_DETECTION is a server side settings, but it will be cascaded to client side, and then to change the netclient behavior.

@yabinma
Copy link
Collaborator

yabinma commented Dec 4, 2024

@atanas18 , have you got a chance to try with turning off the ENDPOINT_DETECTION option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants