Skip to main content

Kubernetes Metrics

This guide focuses on monitoring the Kubernetes platform itself: Pod health, resource utilization, node status, and cluster state. These metrics are essential for understanding the runtime health of any Kubernetes workload, including Defakto components.

Overview​

Monitoring the Kubernetes runtime provides visibility into:

  • Resource utilization - CPU, memory, network, and storage usage
  • Pod health - Readiness, restarts, and lifecycle events
  • Node status - Capacity, conditions, and pressure indicators
  • Cluster state - Overall health and resource availability

This platform-level monitoring complements application-specific metrics (see Server Metrics and Agent Metrics) to provide comprehensive observability.

Kubernetes Runtime Metrics​

Monitoring pod and container metrics is critical to understanding the operational health of your Defakto installation. The Kubernetes runtime will collect and provide insight into the host resources made available to Defakto Components and the usage of those components.

Essential Kubernetes Metrics​

Monitor these key metrics for Defakto pods and the cluster:

Pod-Level Metrics​

  • CPU usage - container_cpu_usage_seconds_total
  • Memory usage - container_memory_working_set_bytes
  • Memory limits - container_spec_memory_limit_bytes
  • CPU throttling - container_cpu_cfs_throttled_seconds_total
  • Network I/O - container_network_receive_bytes_total, container_network_transmit_bytes_total

Pod Health & Lifecycle​

  • Pod restarts - kube_pod_container_status_restarts_total
  • Pod status - kube_pod_status_phase (Running, Pending, Failed, etc.)
  • Container ready - kube_pod_container_status_ready
  • OOM kills - kube_pod_container_status_terminated_reason (OOMKilled)

Node-Level Metrics​

  • Node capacity - kube_node_status_capacity (CPU, memory, pods)
  • Node allocatable - kube_node_status_allocatable
  • Node conditions - kube_node_status_condition (Ready, DiskPressure, MemoryPressure)
  • Disk pressure - Monitor for nodes under disk pressure affecting agents

kube-state-metrics is useful for collecting Kubernetes object state metrics. It exposes metrics about Kubernetes API objects like pods, deployments, nodes, and more.

Installing kube-state-metrics​

Using Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-state-metrics prometheus-community/kube-state-metrics \
--namespace kube-system \
--create-namespace

Key Metrics from kube-state-metrics​

Particularly useful for Defakto monitoring:

For Trust Domain Servers (deployed in tdd-* namespaces):

# Trust Domain Server replica status 
kube_deployment_spec_replicas{namespace=~"^tdd-.*"}
kube_deployment_status_replicas{namespace=~"^tdd-.*"}

# Pod readiness for servers
kube_pod_status_ready{namespace=~"^tdd-.*"}

# Resource requests vs limits for servers
kube_pod_container_resource_requests{namespace=~"^tdd-.*"}
kube_pod_container_resource_limits{namespace=~"^tdd-.*"}

For Agents (deployed in spirl-system namespace):

# Count of agents running vs desired
kube_daemonset_status_desired_number_scheduled{namespace="spirl-system"}
kube_daemonset_status_number_ready{namespace="spirl-system"}

# Agent controller deployment status (webhook handler)
kube_deployment_spec_replicas{namespace="spirl-system"}
kube_deployment_status_replicas{namespace="spirl-system"}

# Pod readiness across agent namespace
kube_pod_status_ready{namespace="spirl-system"}

# Resource requests vs limits for agents
kube_pod_container_resource_requests{namespace="spirl-system"}
kube_pod_container_resource_limits{namespace="spirl-system"}

metrics-server provides real-time resource utilization metrics for pods and nodes. Required for kubectl top commands and horizontal pod autoscaling.

Installing metrics-server​

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

For local/development clusters, you may need:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Patch for self-signed certificates (development only)
kubectl patch deployment metrics-server -n kube-system --type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

Using metrics-server​

Once installed, use kubectl to view resource usage:

# View pod resource usage
kubectl top pods -n spirl-system

# View node resource usage
kubectl top nodes

cAdvisor (Container Advisor) provides container resource usage and performance metrics. It's built into the kubelet, so it's automatically available in most Kubernetes distributions.

Key cAdvisor Metrics​

cAdvisor exposes metrics through the kubelet on each node:

  • container_cpu_usage_seconds_total - Total CPU time consumed
  • container_memory_working_set_bytes - Current working set memory
  • container_memory_rss - Resident set size
  • container_network_receive_bytes_total - Network bytes received
  • container_network_transmit_bytes_total - Network bytes transmitted

Prometheus can scrape these directly from the kubelet /metrics/cadvisor endpoint.

Kubernetes Platform Metrics​

Beyond Defakto-specific metrics, monitor the platform:

Cluster Health:

# Nodes ready
sum(kube_node_status_condition{condition="Ready",status="true"})

# Pods in Failed state
sum(kube_pod_status_phase{namespace=~"spirl-system|^tdd-.*",phase="Failed"})

# Pods not ready (should be 0 for healthy systems)
sum(kube_pod_status_ready{namespace=~"spirl-system|^tdd-.*",condition="false"})

# Defakto pods restarting
rate(kube_pod_container_status_restarts_total{namespace=~"spirl-system|^tdd-.*"}[5m])

Resource Saturation:

# CPU usage percentage
100 * sum(rate(container_cpu_usage_seconds_total{namespace=~"spirl-system|^tdd-.*"}[5m])) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="cpu"}) by (pod)

# Memory usage percentage
100 * sum(container_memory_working_set_bytes{namespace=~"spirl-system|^tdd-.*"}) by (pod)
/ sum(kube_pod_container_resource_limits{namespace=~"spirl-system|^tdd-.*",resource="memory"}) by (pod)

Capacity Planning:

  • Node CPU and memory allocatable vs requested
  • Network bandwidth trends

Next Steps​