Traffic Health

Customizing Health for Request Traffic.

There are times when Kiali’s default thresholds for traffic health do not work well for a particular situation. For example, at times 404 response codes are expected. Kiali has the ability to set powerful, fine-grained overrides for health configuration.

Default Configuration

By default Kiali uses the traffic rate configuration shown below. Application errors have minimal tolerance while client errors have a higher tolerance reflecting that some level of client errors is often normal (e.g. 404 Not Found):

  • For http protocol 4xx are client errors and 5xx codes are application errors.
  • For grpc protocol all 1-16 are errors (0 is success).

So, for example, if the rate of application errors is >= 0.1% Kiali will show Degraded health and if > 10% will show Failure health.

# ...
  health_config:
    rate:
      - namespace: ".*"
        kind: ".*"
        name: ".*"
        tolerance:
          - code: "^5\\d\\d$"
            direction: ".*"
            protocol: "http"
            degraded: 0
            failure: 10
          - code: "^4\\d\\d$"
            direction: ".*"
            protocol: "http"
            degraded: 10
            failure: 20
          - code: "^[1-9]$|^1[0-6]$"
            direction: ".*"
            protocol: "grpc"
            degraded: 0
            failure: 10
# ...

Custom Configuration

Custom health configuration is specified in the Kiali CR. To see the supported configuration syntax for health_config see the Kiali CR Reference.

Kiali applies the first matching rate configuration (namespace, kind, etc) and calculates the status for each tolerance. The reported health will be the status with highest priority (see below).

Rate OptionDefinitionDefault
namespaceMatching Namespaces (regex).* (match all)
kindMatching Resource Types (workload|app|service) (regex).* (match all)
nameMatching Resource Names (regex).* (match all)
toleranceArray of tolerances to apply.
Tolerance Option Definition Default
code Matching Response Status Codes (regex) [1] required
direction Matching Request Directions (inbound|outbound) (regex) .* (match all)
protocol Matching Request Protocols (http|grpc) (regex) .* (match all)
degraded Degraded Threshold(% matching requests >= value) 0
failure Failure Threshold (% matching requests >= value) 0

[1] The status code typically depends on the request protocol. The special code -, a single dash, is used for requests that don’t receive a response, and therefore no response code.

Kiali reports traffic health with the following top-down status priority :

Priority Rule (value=% matching requests) Status
1 value >= FAILURE threshold FAILURE
2 value >= DEGRADED threshold AND value < FAILURE threshold DEGRADED
3 value > 0 AND value < DEGRADED threshold HEALTHY
4 value = 0 HEALTHY
5 No traffic No Health Information

Examples

These examples use the repo https://github.com/kiali/demos/tree/master/error-rates.

In this repo we can see 2 namespaces: alpha and beta (Demo design).

Alpha

Where nodes return the responses (You can configure responses here):

App (alpha/beta) Code Rate
x-server 200 9
x-server 404 1
y-server 200 9
y-server 500 1
z-server 200 8
z-server 201 1
z-server 201 1

The applied traffic rate configuration is:
# ...
health_config:
  rate:
   - namespace: "alpha"
     tolerance:
       - code: "404"
         failure: 10
         protocol: "http"
       - code: "[45]\\d[^\\D4]"
         protocol: "http"
   - namespace: "beta"
     tolerance:
       - code: "[4]\\d\\d"
         degraded: 30
         failure: 40
         protocol: "http"
       - code: "[5]\\d\\d"
         protocol: "http"
# ...

After Kiali adds default configuration we have the following (Debug Info Kiali):

{
  "healthConfig": {
    "rate": [
      {
        "namespace": "/alpha/",
        "kind": "/.*/",
        "name": "/.*/",
        "tolerance": [
          {
            "code": "/404/",
            "degraded": 0,
            "failure": 10,
            "protocol": "/http/",
            "direction": "/.*/"
          },
          {
            "code": "/[45]\\d[^\\D4]/",
            "degraded": 0,
            "failure": 0,
            "protocol": "/http/",
            "direction": "/.*/"
          }
        ]
      },
      {
        "namespace": "/beta/",
        "kind": "/.*/",
        "name": "/.*/",
        "tolerance": [
          {
            "code": "/[4]\\d\\d/",
            "degraded": 30,
            "failure": 40,
            "protocol": "/http/",
            "direction": "/.*/"
          },
          {
            "code": "/[5]\\d\\d/",
            "degraded": 0,
            "failure": 0,
            "protocol": "/http/",
            "direction": "/.*/"
          }
        ]
      },
      {
        "namespace": "/.*/",
        "kind": "/.*/",
        "name": "/.*/",
        "tolerance": [
          {
            "code": "/^5\\d\\d$/",
            "degraded": 0,
            "failure": 10,
            "protocol": "/http/",
            "direction": "/.*/"
          },
          {
            "code": "/^4\\d\\d$/",
            "degraded": 10,
            "failure": 20,
            "protocol": "/http/",
            "direction": "/.*/"
          },
          {
            "code": "/^[1-9]$|^1[0-6]$/",
            "degraded": 0,
            "failure": 10,
            "protocol": "/grpc/",
            "direction": "/.*/"
          }
        ]
      }
    ]
  }
}

What are we applying?

  • For namespace alpha, all resources

  • Protocol http if % requests with error code 404 are >= 10 then FAILURE, if they are > 0 then DEGRADED

  • Protocol http if % requests with others error codes are> 0 then FAILURE.

  • For namespace beta, all resources

  • Protocol http if % requests with error code 4xx are >= 40 then FAILURE, if they are >= 30 then DEGRADED

  • Protocol http if % requests with error code 5xx are > 0 then FAILURE

  • For other namespaces Kiali will apply the defaults.

  • Protocol http if % requests with error code 5xx are >= 20 then FAILURE, if they are >= 0.1 then DEGRADED

  • Protocol grpc if % requests with error code match /^[1-9]$|^1[0-6]$/ are >= 20 then FAILURE, if they are >= 0.1 then DEGRADED

Alpha Beta