Kubernetes probes fail because of net/http

In your Kubernetes clusters, you may experience readiness/liveness probe failing, although the application ( and its checks ) are working fine. This scenario might be complex to troubleshoot, and that is what Raftech is sharing with you today, a case scenario we encountered while managing a project for one of our customers.

Intro to probes

Navigating the complexities of distributed systems and microservices frameworks often involves the automatic detection of malfunctioning applications, transferring requests to operational systems, and rectifying faulty elements. Implementing health checks can be a practical strategy to uphold system dependability. Within the Kubernetes environment, these health checks are set up using probes that assess the condition of each pod.

By standard settings, Kubernetes follows the lifecycle of the pod and begins to channel traffic its way when the containers shift from the Pending to the Succeeded phase. The Kubelet feature also keeps an eye on any application crashes, initiating a pod restart to rectify the issue. A common misconception among developers is that this basic arrangement is sufficient, particularly when the application within the pod employs daemon process managers (like PM2 for Node.js). However, Kubernetes may consider a pod healthy and ready for requests immediately after all containers start. This can cause problems if the application needs to perform certain initialization tasks, establish database connections, or load data prior to processing application logic. The discrepancy in timing between the actual readiness of the application and when Kubernetes deems it ready can cause problems, especially when scaling the deployment. Unprepared applications may start receiving traffic, leading to errors.

This is where Kubernetes probes become valuable, as they determine when a container is prepared to receive traffic and when it should undergo a restart. As of Kubernetes version 1.16, there are three types of probes available. This article explores these different probe types, and delves into best practices.

The probes available for us are:

Readiness – determines if the pod is “ok” to receive the traffic
Liveness – determines if the pod is not in any form of a “lock”
Startup – gives the pod time to startup before the above checks kicks-in

“DOs” and “DON’Ts” for Kubernetes probes

If your service exposes an HTTP endpoint – use the ReadinessProbe
If your service uses multiple ports, the health check should be aimed at the port serving the traffic (non-admin etc )
Make sure you understand the configuration of your probes
Don’t use the same specs for liveness & readiness probes

More on probes you can find on the official documentation page.

Probes failing …

In most cases when your probes are failing that would indicate the application ( or its configuration for probes ) is incorrect. The scenario which we experienced had the configuration of the following probe

  livenessProbe:
    httpGet:
      path: /_health
      port: app-port
      scheme: HTTP
    failureThreshold: 3
    initialDelaySeconds: 5
    periodSeconds: 5
    successThreshold: 1
    timeoutSeconds: 1

With the above, we would start seeing in our logs entries containing information in the form of

net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
n/a

We observed that the issue never occurred in our dev/test environments and manifested itself only during high-load moments. This has led us to conclude that Kubernetes cannot spin up and process a new request towards the pod within the timeout specific ( in our case scenario ).

One of the solutions we came across was to increase the timeout to 10 seconds. However, we did not feel this would be a suitable, long-term solution.

An alternative to this was to use exec handler instead of http, and that has worked out perfectly for us.

livenessProbe:
    failureThreshold: 2
    exec:
      command:
      - wget
      - -q
      - -O
      - /dev/null
      - http://127.0.0.1:9009/health
    initialDelaySeconds: 5
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1