Skip to content

cluster: use hostname for TLS validation instead of resolved IP#5266

Open
toller892 wants to merge 1 commit into
prometheus:mainfrom
toller892:fix/tls-hostname-validation
Open

cluster: use hostname for TLS validation instead of resolved IP#5266
toller892 wants to merge 1 commit into
prometheus:mainfrom
toller892:fix/tls-hostname-validation

Conversation

@toller892

@toller892 toller892 commented Jun 2, 2026

Copy link
Copy Markdown

Problem

When --cluster.peer specifies a hostname (e.g. myhost.example.com:9094), the resolvePeers function resolves it to an IP address. The TLS transport then validates the peer's certificate against this IP, requiring IP SANs. This fails for certificates that only have DNS SANs — which is the common case.

tls: failed to verify certificate: x509: cannot validate certificate for 49.12.57.136 because it doesn't contain any IP SANs

Users are forced to configure tls_client_config.server_name as a workaround, even though the hostname was already provided to --cluster.peer.

Fix

Track the mapping from resolved IP:port back to the original hostname during resolvePeers. When dialing via TLS, set ServerName to the original hostname so certificate validation checks against DNS SANs.

The mapping is maintained across peer refresh cycles so reconnections also use the correct hostname.

Changes:

  • cluster/cluster.go: resolvePeers now returns a map[string]string mapping resolved IP:port to original hostname. The Peer struct stores a reference to the TLS transport to update the mapping on refresh.
  • cluster/tls_transport.go: Added peerHostnames field and SetPeerHostnames method. DialTimeout and WriteTo use the hostname for TLS ServerName when available.
  • cluster/connection_pool.go: borrowConnection accepts an optional hostname parameter for TLS ServerName.
  • cluster/cluster_test.go: Added TestResolvePeersHostnameMapping test.

Fixes #5112

Summary by CodeRabbit

  • New Features

    • Enhanced TLS peer hostname mapping and validation for secure cluster communication
    • Improved peer resolution with hostname tracking and updates
  • Tests

    • Added test coverage for peer hostname mapping validation

When peers are specified by hostname (e.g. --cluster.peer myhost:9094),
the resolvePeers function resolves the hostname to an IP address. The TLS
transport then validates the certificate against this IP address, requiring
IP SANs in the certificate. This is unexpected since certificates are
typically bound to hostnames via DNS SANs.

Fix by tracking the mapping from resolved IP:port back to the original
hostname. When dialing a peer via TLS, set ServerName to the original
hostname so that certificate validation checks against DNS SANs.

The mapping is maintained across peer refresh cycles so that reconnections
also use the correct hostname for TLS validation.

Fixes prometheus#5112
@toller892 toller892 requested a review from a team as a code owner June 2, 2026 01:02
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The PR modifies the cluster gossip stack to validate TLS peer certificates against user-provided hostnames instead of resolved IP addresses. Peer addresses are resolved to IPs while hostnames are tracked separately and passed to the TLS transport for proper ServerName configuration during certificate verification.

Changes

TLS peer hostname mapping for certificate validation

Layer / File(s) Summary
TLS transport hostname mapping infrastructure
cluster/tls_transport.go, cluster/connection_pool.go
TLSTransport adds a peerHostnames map and SetPeerHostnames method. borrowConnection accepts a hostname parameter and uses it to clone and configure TLS ServerName for proper certificate validation in both WriteTo (packet transmission) and DialTimeout (stream connections).
Peer resolution with hostname extraction
cluster/cluster.go
resolvePeers now returns resolved addresses alongside a hostnames map that associates each resolved IP:port with its original hostname. Error paths are updated to match the new three-value return signature.
Peer struct and Create integration
cluster/cluster.go
Peer struct gains a tlsTransport field. Create captures the peerHostnames from resolvePeers and initializes the TLS transport via SetPeerHostnames before storing it on the instance.
Peer refresh with hostname updates
cluster/cluster.go
Peer.refresh uses the updated resolvePeers signature and calls SetPeerHostnames on the stored TLS transport when TLS is configured, keeping hostname mappings current across resolution refreshes.
Hostname mapping tests
cluster/cluster_test.go
Test suite imports net and adds TestResolvePeersHostnameMapping to verify that IP peers map to themselves and localhost resolves to an IP with a correct back-reference in the hostname mapping.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: using hostname instead of resolved IP for TLS validation.
Description check ✅ Passed The PR description is comprehensive and well-structured with problem statement, fix explanation, and detailed list of changes. It addresses the issue thoroughly.
Linked Issues check ✅ Passed The code changes fully address the requirements from issue #5112: track hostname-to-IP mappings during peer resolution and use the hostname for TLS ServerName validation instead of the resolved IP.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issue #5112: supporting hostname-based TLS validation. No out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cluster/tls_transport.go`:
- Around line 60-64: peerHostnames on TLSTransport is accessed concurrently
(written in SetPeerHostnames and read in WriteTo/DialTimeout) causing a data
race; add a sync.RWMutex field (e.g., peerHostnamesMtx) to TLSTransport, lock it
for writing inside SetPeerHostnames when replacing/updating the map, and use
RLock/RUnlock when reading in WriteTo and DialTimeout by adding a helper method
getPeerHostname(addr string) (returns hostname, bool) that acquires the read
lock and returns the map value; update all reads to call getPeerHostname and
ensure writes use the write lock to eliminate the race.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cbe7d159-74c6-43f6-869e-f80819664b28

📥 Commits

Reviewing files that changed from the base of the PR and between 45996cb and 806eba6.

📒 Files selected for processing (4)
  • cluster/cluster.go
  • cluster/cluster_test.go
  • cluster/connection_pool.go
  • cluster/tls_transport.go

Comment thread cluster/tls_transport.go
Comment on lines +60 to +64
// peerHostnames maps resolved IP:port addresses back to the original
// hostname provided by the user. This is used to set the TLS ServerName
// to the hostname instead of the IP, enabling proper certificate
// validation against DNS SANs.
peerHostnames map[string]string

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Data race on peerHostnames map.

SetPeerHostnames writes the map from Peer.refresh() (runs every 15s), while WriteTo (Line 220) and DialTimeout (Lines 239-246) read from it concurrently via memberlist goroutines. There is no synchronization protecting this field.

🔒 Proposed fix using sync.RWMutex

Add a mutex to protect peerHostnames:

 type TLSTransport struct {
 	ctx          context.Context
 	cancel       context.CancelFunc
 	logger       *slog.Logger
 	bindAddr     string
 	bindPort     int
 	done         chan struct{}
 	listener     net.Listener
 	packetCh     chan *memberlist.Packet
 	streamCh     chan net.Conn
 	connPool     *connectionPool
 	tlsServerCfg *tls.Config
 	tlsClientCfg *tls.Config

+	peerHostnamesMtx sync.RWMutex
 	// peerHostnames maps resolved IP:port addresses back to the original
 	// hostname provided by the user. This is used to set the TLS ServerName
 	// to the hostname instead of the IP, enabling proper certificate
 	// validation against DNS SANs.
 	peerHostnames map[string]string

Update SetPeerHostnames:

 func (t *TLSTransport) SetPeerHostnames(hostnames map[string]string) {
+	t.peerHostnamesMtx.Lock()
+	defer t.peerHostnamesMtx.Unlock()
 	t.peerHostnames = hostnames
 }

Add a helper to safely read hostname:

func (t *TLSTransport) getPeerHostname(addr string) (string, bool) {
	t.peerHostnamesMtx.RLock()
	defer t.peerHostnamesMtx.RUnlock()
	host, ok := t.peerHostnames[addr]
	return host, ok
}

Then use t.getPeerHostname(addr) in WriteTo and DialTimeout.

Also applies to: 145-150, 220-220, 238-246

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cluster/tls_transport.go` around lines 60 - 64, peerHostnames on TLSTransport
is accessed concurrently (written in SetPeerHostnames and read in
WriteTo/DialTimeout) causing a data race; add a sync.RWMutex field (e.g.,
peerHostnamesMtx) to TLSTransport, lock it for writing inside SetPeerHostnames
when replacing/updating the map, and use RLock/RUnlock when reading in WriteTo
and DialTimeout by adding a helper method getPeerHostname(addr string) (returns
hostname, bool) that acquires the read lock and returns the map value; update
all reads to call getPeerHostname and ensure writes use the write lock to
eliminate the race.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

clustering: peer TLS certificates are validated against their IP address instead of their hostname

1 participant