Description
Problem Description
Experiencing a common issue with long-lived gRPC connections being reset after periods of inactivity. The example application initializes gRPC clients during startup and stores them in a configuration map.
After approximately 38 minutes of inactivity, subsequent calls fail with:
UNAVAILABLE: io exception
Caused by: io.grpc.netty.shaded.io.netty.channel.unix.Errors$NativeIoException: readAddress(..) failed: Connection reset by peer
16:26:53 - Successful gRPC call made
17:04:26 - Call fails with "Connection reset by peer" (after ~38 minutes of inactivity)
Client gRPC app and the Server gRPC are deployed on kubernetes and are on separate namespace.
Example code:
@Configuration
public class ServiceConfig {
private Map<String, ServiceGrpcClient> clientMap = new HashMap<>();
@PostConstruct
public void init() {
// Initialize clients at startup
hostMap.forEach((key, host) -> {
ServiceGrpcClient client = new ServiceGrpcClient(host, portMap.get(key));
clientMap.put(key, client);
});
}
public ServiceGrpcClient getClient(String key) {
return clientMap.get(key);
}
}
@Slf4j
@Component
public class GenericServiceClient {
@Value("${grpc.service.host}")
private String host;
@Value("${grpc.service.port}")
private Integer port;
private ManagedChannel channel;
private ServiceStub blockingStub;
public GenericServiceClient(String serverHost, int target) {
this.targetPort = target;
this.serverHost = serverHost;
this.init();
}
public void init() {
log.debug("Connecting to Service: {}, port:{}", this.serverHost, this.targetPort);
this.blockingStub = ServiceStub.newBlockingStub(this.getManagedChannel());
}
private ManagedChannel getManagedChannel() {
ManagedChannel managedChannel = ManagedChannelBuilder.forAddress(this.serverHost, this.targetPort)
.usePlaintext().keepAliveTime(120, TimeUnit.SECONDS)
.keepAliveTimeout(60, TimeUnit.SECONDS).build();
return managedChannel;
}
public ServiceStub getBlockingStub() {
return blockingStub;
}
//some other methods to interact with the service can be added here
}
Is it a best practice to create gRPC clients at the startup( eager loading) vs lazy ?
Are there best practices for handling idle connection timeouts in gRPC where the clients are created at the startup?
What are the recommended keepalive settings to prevent this issue?
Is there a way to detect stale connections before attempting to use them?
How do others handle this in production environments with firewalls and load balancers?