cluster: use hostname for TLS validation instead of resolved IP#5266
cluster: use hostname for TLS validation instead of resolved IP#5266toller892 wants to merge 1 commit into
Conversation
When peers are specified by hostname (e.g. --cluster.peer myhost:9094), the resolvePeers function resolves the hostname to an IP address. The TLS transport then validates the certificate against this IP address, requiring IP SANs in the certificate. This is unexpected since certificates are typically bound to hostnames via DNS SANs. Fix by tracking the mapping from resolved IP:port back to the original hostname. When dialing a peer via TLS, set ServerName to the original hostname so that certificate validation checks against DNS SANs. The mapping is maintained across peer refresh cycles so that reconnections also use the correct hostname for TLS validation. Fixes prometheus#5112
📝 WalkthroughWalkthroughThe PR modifies the cluster gossip stack to validate TLS peer certificates against user-provided hostnames instead of resolved IP addresses. Peer addresses are resolved to IPs while hostnames are tracked separately and passed to the TLS transport for proper ServerName configuration during certificate verification. ChangesTLS peer hostname mapping for certificate validation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cluster/tls_transport.go`:
- Around line 60-64: peerHostnames on TLSTransport is accessed concurrently
(written in SetPeerHostnames and read in WriteTo/DialTimeout) causing a data
race; add a sync.RWMutex field (e.g., peerHostnamesMtx) to TLSTransport, lock it
for writing inside SetPeerHostnames when replacing/updating the map, and use
RLock/RUnlock when reading in WriteTo and DialTimeout by adding a helper method
getPeerHostname(addr string) (returns hostname, bool) that acquires the read
lock and returns the map value; update all reads to call getPeerHostname and
ensure writes use the write lock to eliminate the race.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: cbe7d159-74c6-43f6-869e-f80819664b28
📒 Files selected for processing (4)
cluster/cluster.gocluster/cluster_test.gocluster/connection_pool.gocluster/tls_transport.go
| // peerHostnames maps resolved IP:port addresses back to the original | ||
| // hostname provided by the user. This is used to set the TLS ServerName | ||
| // to the hostname instead of the IP, enabling proper certificate | ||
| // validation against DNS SANs. | ||
| peerHostnames map[string]string |
There was a problem hiding this comment.
Data race on peerHostnames map.
SetPeerHostnames writes the map from Peer.refresh() (runs every 15s), while WriteTo (Line 220) and DialTimeout (Lines 239-246) read from it concurrently via memberlist goroutines. There is no synchronization protecting this field.
🔒 Proposed fix using sync.RWMutex
Add a mutex to protect peerHostnames:
type TLSTransport struct {
ctx context.Context
cancel context.CancelFunc
logger *slog.Logger
bindAddr string
bindPort int
done chan struct{}
listener net.Listener
packetCh chan *memberlist.Packet
streamCh chan net.Conn
connPool *connectionPool
tlsServerCfg *tls.Config
tlsClientCfg *tls.Config
+ peerHostnamesMtx sync.RWMutex
// peerHostnames maps resolved IP:port addresses back to the original
// hostname provided by the user. This is used to set the TLS ServerName
// to the hostname instead of the IP, enabling proper certificate
// validation against DNS SANs.
peerHostnames map[string]stringUpdate SetPeerHostnames:
func (t *TLSTransport) SetPeerHostnames(hostnames map[string]string) {
+ t.peerHostnamesMtx.Lock()
+ defer t.peerHostnamesMtx.Unlock()
t.peerHostnames = hostnames
}Add a helper to safely read hostname:
func (t *TLSTransport) getPeerHostname(addr string) (string, bool) {
t.peerHostnamesMtx.RLock()
defer t.peerHostnamesMtx.RUnlock()
host, ok := t.peerHostnames[addr]
return host, ok
}Then use t.getPeerHostname(addr) in WriteTo and DialTimeout.
Also applies to: 145-150, 220-220, 238-246
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cluster/tls_transport.go` around lines 60 - 64, peerHostnames on TLSTransport
is accessed concurrently (written in SetPeerHostnames and read in
WriteTo/DialTimeout) causing a data race; add a sync.RWMutex field (e.g.,
peerHostnamesMtx) to TLSTransport, lock it for writing inside SetPeerHostnames
when replacing/updating the map, and use RLock/RUnlock when reading in WriteTo
and DialTimeout by adding a helper method getPeerHostname(addr string) (returns
hostname, bool) that acquires the read lock and returns the map value; update
all reads to call getPeerHostname and ensure writes use the write lock to
eliminate the race.
Problem
When
--cluster.peerspecifies a hostname (e.g.myhost.example.com:9094), theresolvePeersfunction resolves it to an IP address. The TLS transport then validates the peer's certificate against this IP, requiring IP SANs. This fails for certificates that only have DNS SANs — which is the common case.Users are forced to configure
tls_client_config.server_nameas a workaround, even though the hostname was already provided to--cluster.peer.Fix
Track the mapping from resolved IP:port back to the original hostname during
resolvePeers. When dialing via TLS, setServerNameto the original hostname so certificate validation checks against DNS SANs.The mapping is maintained across peer refresh cycles so reconnections also use the correct hostname.
Changes:
cluster/cluster.go:resolvePeersnow returns amap[string]stringmapping resolved IP:port to original hostname. ThePeerstruct stores a reference to the TLS transport to update the mapping on refresh.cluster/tls_transport.go: AddedpeerHostnamesfield andSetPeerHostnamesmethod.DialTimeoutandWriteTouse the hostname for TLSServerNamewhen available.cluster/connection_pool.go:borrowConnectionaccepts an optional hostname parameter for TLSServerName.cluster/cluster_test.go: AddedTestResolvePeersHostnameMappingtest.Fixes #5112
Summary by CodeRabbit
New Features
Tests