Skip to main content

Signing Key Rotation Runbook

This runbook covers a complete planned signing keyset rotation for a single trust domain deployment: Generating a replacement keyset, activating it, retiring the old one, and removing it from the trust bundle after agents have renewed their credentials.

Defakto rotates signing keys automatically on a schedule. Use this runbook for such situations as a KMS migration, a compliance rotation requirement, or security incident response. It is recommended that operators exercise any manual procedure on a regular basis in a non-production environment.

For background on the keyset lifecycle and how agents respond to rotation, see Signing Key Rotation. For full command syntax and flag reference, see spirlctl: Signing Key Rotation.


Before You Begin

You will need:

  • spirlctl installed and authenticated (spirlctl login)
  • The trust domain name and deployment name you intend to rotate
  • Permission to run keyset mutation commands in the target trust domain

Additionally, export these environment variables. All commands in this runbook use them:

export TRUST_DOMAIN=<trust-domain>
export DEPLOYMENT_NAME=<deployment>

The Rotation Sequence

A complete planned rotation follows five steps, in order:

prepare → activate → taint → (wait) → remove

Each step is a blocking remote call with a 120-second server-side timeout. The command does not return until the mutation and trust bundle publication are complete. Commands do not retry automatically. If a command fails, run list to observe the resulting state before deciding whether to proceed.


Step 1 — Inspect Current State

Before making any changes, confirm the current state:

spirlctl trust-domain deployment keyset list \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Example output:

Deployment: default
Signing Keys:
Type Keyset ID Issued At Expires At State
X.509 ks_abc1234xyz 2025-10-15 09:23:41 +0000 UTC 2026-01-13 09:23:41 +0000 UTC Active
JWT ks_abc1234xyz 2025-10-15 09:23:41 +0000 UTC 2026-01-13 09:23:41 +0000 UTC Active

If the output already shows a PREPARED keyset, decide whether to use it or generate a fresh one.

Export the currently ACTIVE keyset ID so we can taint and remove it later:

export OLD_KEYSET_ID=<active-keyset-id>

Step 2 — Prepare a New Keyset

Generate a fresh signing keyset and inject it into the trust bundle:

spirlctl trust-domain deployment keyset prepare \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Example output:

Keyset ks_def5678uvw prepared.

Export the new keyset ID from the output:

export NEW_KEYSET_ID=<new-keyset-id>

The new keyset is now PREPARED. It is included in the trust bundle and trusted for verification, but not yet signing any SVIDs. The bundle is published immediately, so agents will pick it up on their next refresh.


Step 3 — Activate the New Keyset

Promote the PREPARED keyset to ACTIVE. The currently active keyset automatically steps down to PREPARED:

spirlctl trust-domain deployment keyset activate $NEW_KEYSET_ID \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Example output:

Keyset ks_def5678uvw activated.

After this command:

  • All new SVIDs (both X.509 and JWT) are signed by the new keyset.
  • The old keyset is now PREPARED. It is still trusted for verification and remains in the bundle.
  • SVIDs issued before activation continue to verify until they expire or are renewed.

Verify the expected state:

spirlctl trust-domain deployment keyset list \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Confirm the new keyset ID is shown as ACTIVE and the old keyset ID is shown as PREPARED.


Step 4 — Taint the Old Keyset

Mark the old keyset as TAINTED to signal that it should be retired. Agents will proactively renew any X.509 SVIDs it has signed:

spirlctl trust-domain deployment keyset taint $OLD_KEYSET_ID \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Example output:

Keyset ks_abc1234xyz tainted.

The updated bundle is published immediately. After agents pick it up:

  • X.509-SVIDs: Each agent applies a random delay of up to 60 seconds before renewing the X.509 SVIDs it has cached for workloads. A per-agent rate limit caps taint-triggered renewals at 20 per second. In typical fleets, allow 2 to 5 minutes for renewal to propagate fully across all agents.
  • JWT-SVIDs: JWT-SVIDs do not renew proactively. They continue to be valid until their normal expiry (24 hours by default).

Step 5 — Wait for Propagation

Before removing the old keyset from the bundle, confirm that agents have renewed their X.509-SVIDs. Removing too soon can leave workloads using SVIDs that no longer verify, resulting in outages due to authentication failures.

Wait at least 2 minutes after taint for a typical fleet. For larger fleets, or if agents are known to be slow to reconnect, wait longer.

To confirm propagation, check the agent and server metrics for signs that the post-taint renewal wave has settled. Both spirl_agent_mint_svid_total and spirl_server_mint_svid_total will show a transient spike in successful X.509 mints immediately after the taint. Once both metrics return to their normal baseline, renewal is complete. See the Metrics Configuration reference for the full list of agent and server metrics.


Step 6 — Remove the Old Keyset

Evict the TAINTED keyset from the trust bundle:

spirlctl trust-domain deployment keyset remove $OLD_KEYSET_ID \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Example output:

Keyset ks_abc1234xyz removed.
danger

Removing a keyset is permanent. Any SVID still signed by the removed keyset will fail verification immediately. Do not run this step until you are confident X.509 renewal has propagated across your fleet.

After removal, the bundle no longer includes the old keyset. Relying parties will reject JWTs signed by it as soon as they fetch the updated bundle.

Confirm the final state:

spirlctl trust-domain deployment keyset list \
--trust-domain $TRUST_DOMAIN \
--deployment-name $DEPLOYMENT_NAME

Only the new keyset should appear, in the ACTIVE state.


Troubleshooting

"The request failed due to an internal error, please try again."

Two situations produce this error and cannot be distinguished from the message alone:

  1. Concurrent on-demand operations. Two keyset mutation commands targeting the same deployment cannot run at the same time. If a second command runs while the first is still in flight, it fails immediately.
  2. Race with automatic rotation. Defakto's scheduler runs automatic rotations on its own cadence. If an automatic rotation and an on-demand command both attempt a mutation, the second commit fails.

In both cases: run list to see the current keyset states. If the intended mutation already happened, no further action is needed. If it did not, wait a moment and retry.

"Error: cannot taint active key set"

You attempted to taint the currently ACTIVE keyset directly. Only a PREPARED keyset can be tainted. To retire the active keyset, first activate a successor (the active keyset steps down to PREPARED), then taint it. This is the sequence in Steps 3 and 4 above.

"Error: cannot activate tainted key set"

You attempted to activate a keyset that is in the TAINTED state. Only PREPARED keysets can be activated. Prepare a new keyset first.

Command timed out or connection interrupted

The 120-second server-side timeout elapsed, or the client connection was lost before the server returned. The server-side operation is also bounded by its own execution timeout, so a timed-out command is not left running indefinitely in the background. Run list to check current state — the mutation may have already completed. Retry only if the intended change is not reflected.

Agents are not renewing after taint

Allow 2 to 5 minutes in typical fleets before concluding something is wrong. Check:

  • Agent version. Taint-aware renewal requires Agent v0.37.0 or later. Older agents will not proactively renew, they will renew at normal SVID expiry instead.
  • Agent logs. Look for a bundle refresh event in the agent logs after the taint. If the bundle refresh is not occurring, there may be a server connectivity issue — see the Agent Runbook.
  • JWT consumers. JWT-SVIDs do not renew proactively — this is expected. They rotate at normal expiry or upon keyset removal.

A workload failed verification after remove

The removed keyset has been evicted from the bundle, so any SVID it signed will fail verification. This is the intended outcome. Affected workloads recover as soon as they obtain a fresh SVID signed by the current active keyset. To prevent this, always allow enough propagation time between taint and remove.


Escalation

If an issue cannot be resolved using this runbook:

  1. Collect the output of spirlctl trust-domain deployment keyset list for the affected deployment
  2. Note the exact error message and which step produced it
  3. Record the trust domain name and deployment name
  4. Contact Defakto Support with this information