Kubernetes Metrics

This guide focuses on monitoring the Kubernetes platform itself: Pod health, resource utilization, node status, and cluster state. These metrics are essential for understanding the runtime health of any Kubernetes workload, including Defakto components.

Overview

Monitoring the Kubernetes runtime provides visibility into:

Resource utilization - CPU, memory, network, and storage usage
Pod health - Readiness, restarts, and lifecycle events
Node status - Capacity, conditions, and pressure indicators
Cluster state - Overall health and resource availability

This platform-level monitoring complements application-specific metrics (see Server Metrics and Agent Metrics) to provide comprehensive observability.

Kubernetes Runtime Metrics

Monitoring pod and container metrics is critical to understanding the operational health of your Defakto installation. The Kubernetes runtime will collect and provide insight into the host resources made available to Defakto Components and the usage of those components.

Essential Kubernetes Metrics

Monitor these key metrics for Defakto pods and the cluster:

Pod-Level Metrics

CPU usage - container_cpu_usage_seconds_total
Memory usage - container_memory_working_set_bytes
Memory limits - container_spec_memory_limit_bytes
CPU throttling - container_cpu_cfs_throttled_seconds_total
Network I/O - container_network_receive_bytes_total, container_network_transmit_bytes_total

Pod Health & Lifecycle

Pod restarts - kube_pod_container_status_restarts_total
Pod status - kube_pod_status_phase (Running, Pending, Failed, etc.)
Container ready - kube_pod_container_status_ready
OOM kills - kube_pod_container_status_terminated_reason (OOMKilled)

Node-Level Metrics

Node capacity - kube_node_status_capacity (CPU, memory, pods)
Node allocatable - kube_node_status_allocatable
Node conditions - kube_node_status_condition (Ready, DiskPressure, MemoryPressure)
Disk pressure - Monitor for nodes under disk pressure affecting agents

Recommended: kube-state-metrics

kube-state-metrics is useful for collecting Kubernetes object state metrics. It exposes metrics about Kubernetes API objects like pods, deployments, nodes, and more.

Installing kube-state-metrics

Using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-state-metrics prometheus-community/kube-state-metrics \
  --namespace kube-system \
  --create-namespace

Key Metrics from kube-state-metrics

Particularly useful for Defakto monitoring:

For Trust Domain Servers (deployed in tdd-* namespaces):

# Trust Domain Server replica status 
kube_deployment_spec_replicas{namespace=~"^tdd-.*"}
kube_deployment_status_replicas{namespace=~"^tdd-.*"}

# Pod readiness for servers
kube_pod_status_ready{namespace=~"^tdd-.*"}

# Resource requests vs limits for servers
kube_pod_container_resource_requests{namespace=~"^tdd-.*"}
kube_pod_container_resource_limits{namespace=~"^tdd-.*"}

For Agents (deployed in spirl-system namespace):

# Count of agents running vs desired
kube_daemonset_status_desired_number_scheduled{namespace="spirl-system"}
kube_daemonset_status_number_ready{namespace="spirl-system"}

# Agent controller deployment status (webhook handler)
kube_deployment_spec_replicas{namespace="spirl-system"}
kube_deployment_status_replicas{namespace="spirl-system"}

# Pod readiness across agent namespace
kube_pod_status_ready{namespace="spirl-system"}

# Resource requests vs limits for agents
kube_pod_container_resource_requests{namespace="spirl-system"}
kube_pod_container_resource_limits{namespace="spirl-system"}

Recommended: metrics-server

metrics-server provides real-time resource utilization metrics for pods and nodes. Required for kubectl top commands and horizontal pod autoscaling.

Installing metrics-server

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

For local/development clusters, you may need:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch for self-signed certificates (development only)
kubectl patch deployment metrics-server -n kube-system --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

Using metrics-server

Once installed, use kubectl to view resource usage:

# View pod resource usage
kubectl top pods -n spirl-system

# View node resource usage
kubectl top nodes

Recommended: cadvisor Metrics

cAdvisor (Container Advisor) provides container resource usage and performance metrics. It's built into the kubelet, so it's automatically available in most Kubernetes distributions.

Key cAdvisor Metrics

cAdvisor exposes metrics through the kubelet on each node:

container_cpu_usage_seconds_total - Total CPU time consumed
container_memory_working_set_bytes - Current working set memory
container_memory_rss - Resident set size
container_network_receive_bytes_total - Network bytes received
container_network_transmit_bytes_total - Network bytes transmitted

Prometheus can scrape these directly from the kubelet /metrics/cadvisor endpoint.

Kubernetes Platform Metrics

Beyond Defakto-specific metrics, monitor the platform:

Cluster Health:

# Nodes ready
sum(kube_node_status_condition{condition="Ready",status="true"})

# Pods in Failed state
sum(kube_pod_status_phase{namespace=~"spirl-system|^tdd-.*",phase="Failed"})

# Pods not ready (should be 0 for healthy systems)
sum(kube_pod_status_ready{namespace=~"spirl-system|^tdd-.*",condition="false"})

# Defakto pods restarting
rate(kube_pod_container_status_restarts_total{namespace=~"spirl-system|^tdd-.*"}[5m])

Resource Saturation:

# CPU usage percentage
100 * sum(rate(container_cpu_usage_seconds_total{namespace=~"spirl-system|^tdd-.*"}[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="cpu"}) by (pod)

# Memory usage percentage  
100 * sum(container_memory_working_set_bytes{namespace=~"spirl-system|^tdd-.*"}) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="memory"}) by (pod)

Capacity Planning:

Node CPU and memory allocatable vs requested
Network bandwidth trends

Next Steps

Server Metrics - Application-specific metrics for Trust Domain Servers
Agent Metrics - Application-specific metrics for SPIRL Agents
Review All Metrics - Complete metrics reference

Overview​

Kubernetes Runtime Metrics​

Essential Kubernetes Metrics​

Pod-Level Metrics​

Pod Health & Lifecycle​

Node-Level Metrics​

Recommended: kube-state-metrics​

Installing kube-state-metrics​

Key Metrics from kube-state-metrics​

Recommended: metrics-server​

Installing metrics-server​

Using metrics-server​

Recommended: cadvisor Metrics​

Key cAdvisor Metrics​

Kubernetes Platform Metrics​

Next Steps​