Kagent & Khook

It’s difficult to achieve true availability in production.

Most teams are still far from being production-ready, struggling to manage complex Kubernetes environments where things can go wrong at any moment: pods crash, memory limits are exceeded, probes fail, and workloads get stuck pending.

That’s where agentic AI comes in,autonomous agents that can observe what’s happening, reason about the cause, and take corrective action.

And for actually building and deploying those agents in Kubernetes, that’s where Kagent comes in.

Kagent

Kagent is an open-source toolkit designed to help engineers build, deploy, and run AI-powered solutions directly within Kubernetes. It enables teams to create intelligent agents, use and deploy MCP (Model Context Protocol) servers, and supports agent-to-agent (A2A) collaboration for coordinated multi-agent workflows.

Description

Kagent supports OpenAI, Anthropic, Azure OpenAI, Vertex AI, Gemini, Ollama, and BYO OpenAI-compatible endpoints out of the box.

Description

Alongside Kagent, we have Khook—a Kubernetes controller that listens to cluster events and routes them to your agents on the Kagent platform.

Once you’ve built and deployed your agent with Kagent, Khook makes it truly autonomous: no human prompts required

When issues like PodRestart, PodPending, OOMKill, or ProbeFailed occur, Khook automatically triggers an AI agent that can analyze the situation and take corrective action (patching manifests, creating or deleting resources, scaling workloads, tuning probes), the agent analyzes the situation and executes safe, policy-guided fixes autonomously.

In today’s demo, we’ll see this autonomy in action how an AI agent can detect and fix these common Kubernetes issues automatically.

With that context in place, let’s dive into the architecture and see how the pieces fit together.

We’ll start with the solution’s high-level architecture and workflow, so even if you’re not familiar with Kubernetes, you can follow along, then zoom into the Kubernetes-specific view.

Note: For this demo, we’ll use the Qwen 3 235B (generic) model managed on Scaleway.

Architecture & workflow

Description

Kubernetes View

Description

So let’s begin our demo together!

We’ll start by installing Kagent—either with the CLI or via a Helm chart. For step-by-step instructions, see the official docs: https://kagent.dev/docs/kagent/introduction/installation

For this demo, we’ll use the BYO OpenAI-compatible provider and configure the model Qwen 3 235B.

Load your provider key into the shell

export PROVIDER_API_KEY=Dgs...

Create a Kubernetes Secret for Kagent


kubectl create secret generic kagent-my-provider -n kagent --from-literal PROVIDER_API_KEY=$PROVIDER_API_KEY

Register your model with Kagent (ModelConfig)


kubectl apply -f - <<EOF
apiVersion: kagent.dev/v1alpha2
kind: ModelConfig
metadata:
  name: qwen3-235b
  namespace: kagent
spec:
  apiKeySecret: kagent-my-provider
  apiKeySecretKey: PROVIDER_API_KEY
  model: qwen3-235b-a22b-instruct-2507
  provider: OpenAI
  openAI:
    baseUrl: "https://api.scaleway.ai/f3d8...../v1"
EOF

Now that the model configuration is complete, we can begin setting up our agent.

This agent will be configured based on the ModelConfig we just applied and the MCP server provided by Kagent.

When installing KAgent, it automatically provides a remote MCP server for Kubernetes, known as kagent-tool-server.

The kagent-tool-server offers built-in support for multiple components.

Description

We will build our agent using the tools provided by kagent-tool-server, focusing exclusively on the Kubernetes toolset.

We can build the agent either from the UI or with a manifest file, for this guide, we’ll use a manifest.

kubectl apply -f - <<EOF
apiVersion: kagent.dev/v1alpha2
kind: Agent
metadata:  
  name: khook-demo
  namespace: kagent
spec:
  declarative:
    modelConfig: qwen3-235b
    stream: true
    systemMessage: |-
      You are KUBE-MIND, an autonomous AI agent specialized in Kubernetes management, monitoring, and automation.
      You have full reasoning capabilities and can autonomously decide which tools to use to complete tasks efficiently and safely.
      You understand, plan, execute, and verify all Kubernetes operations.
    tools:
    - mcpServer:
        apiGroup: kagent.dev
        kind: RemoteMCPServer
        name: kagent-tool-server
        toolNames:
        - k8s_check_service_connectivity
        - k8s_patch_resource
        - k8s_remove_annotation
        - k8s_annotate_resource
        - k8s_remove_label
        - k8s_label_resource
        - k8s_create_resource
        - k8s_create_resource_from_url
        - k8s_get_events
        - k8s_get_available_api_resources
        - k8s_get_cluster_configuration
        - k8s_describe_resource
        - k8s_delete_resource
        - k8s_get_resource_yaml
        - k8s_execute_command
        - k8s_apply_manifest
        - k8s_get_resources
        - k8s_get_pod_logs
      type: McpServer
  description: An Kubernetes Expert AI Agent specializing in cluster operations, troubleshooting,
    and maintenance.
  type: Declarative
EOF

To build your agent through the UI:

kubectl -n kagent port-forward svc/kagent-ui 8080:8080

Visit http://localhost:8080

Now that the agent is deployed, let’s test its autonomy using Khook—install it by following the steps in the official repo: https://github.com/kagent-dev/khook

let s create the first hook

kubectl apply -f - <<EOF

apiVersion: kagent.dev/v1alpha2
kind: Hook
metadata:
  name: pod-monitor
  namespace: kagent
  labels:
    environment: development
    monitoring-type: basic
spec:
  eventConfigurations:
  # Monitor pod restarts
  - eventType: pod-restart
    agentRef:
      name: khook-demo
    prompt: |
      AUTONOMOUS MODE: Pod {{.ResourceName}} restarted at {{.EventTime}}. Analyze and fix immediately.
      INSTRUCTIONS:
      • Use all available tools to diagnose and resolve the issue
      • Focus on common issues: crashes, resource limits, health checks, configuration
      • Annotate fixed resources with: kagentFix=<timestamp>
      • If patching fails: delete and recreate resources
      • Never ask for permission - assume autonomous execution
      • Complete remediation without human approval
  # Monitor OOM kills
  - eventType: oom-kill
    agentRef:
      name: khook-demo
    prompt: |
      AUTONOMOUS MODE: OOM kill for {{.ResourceName}} at {{.EventTime}}. Analyze memory and optimize immediately.
      INSTRUCTIONS:
      • Use all available tools to diagnose and resolve memory issues
      • Focus on memory leaks, inefficient algorithms, large allocations, GC issues
      • Annotate fixed resources with: kagentFix=<timestamp>
      • If patching fails: delete and recreate resources
      • Never ask for permission - assume autonomous execution
      • Complete remediation without human approval
  - eventType: probe-failed
    agentRef:
      name: khook-demo
    prompt: |
      Health probe failed for {{.ResourceName}} at {{.EventTime}}.
      Please check application health and configuration.
      After analysis - use all available tools to try and resolve. Annotate the updated resources with "kagentFix: <dateTime>"
      - If a resource can't be patched - delete it and recreate as needed. Don't ask for permission. Assume autonomous execution.
      Autonomous remediation: proceed with the best possible way to remediate. Don't ask for approval.

EOF

In the hook, you specify which agent should receive each event and provide a guidance prompt to help the agent choose the right action.

With the hook in place, we’ll apply test manifests to induce ProbeFailed, OOMKilled, and Pod-Restart conditions. Khook will forward each event to our chosen agent, and the agent will attempt the correction as instructed.

kubectl apply -f - <<EOF

# -------------------------------
# Namespace for all demo objects
# -------------------------------
# ============================================
# 1) PROBE-FAILED (readiness fails; no restart)
# ============================================
apiVersion: apps/v1
kind: Deployment
metadata:
  name: probe-failed
  labels:
    scenario: probe-failed
spec:
  replicas: 1
  selector:
    matchLabels:
      app: probe-failed
  template:
    metadata:
      labels:
        app: probe-failed
        scenario: probe-failed
    spec:
      containers:
        - name: web
          image: nginx:1.25
          ports:
            - containerPort: 80
          # LIVENESS: points to a valid path so container is NOT restarted
          livenessProbe:
            httpGet:
              path: /
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          # READINESS: intentionally invalid path -> stays "NotReady"
          readinessProbe:
            httpGet:
              path: /does-not-exist
              port: 80
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 1
---
# ==============================
# 2 OOM-KILL (trigger OOMKilled)
# ==============================
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oom-kill
  labels:
    scenario: oom-kill
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oom-kill
  template:
    metadata:
      labels:
        app: oom-kill
        scenario: oom-kill
    spec:
      containers:
        - name: memhog
          image: python:3.11-alpine
          # Allocate ~200Mi then sleep; limit is 64Mi => OOMKill
          command: ["sh","-c"]
          args:
            - |
              python - <<'PY'
              import time
              blocks = []
              # Allocate ~200 MiB
              for _ in range(200):
                  blocks.append("x" * 1024 * 1024)
              time.sleep(3600)
              PY
          resources:
            requests:
              memory: "32Mi"
              cpu: "50m"
            limits:
              memory: "64Mi"
              cpu: "200m"
---
# ==================================
# 3 POD-RESTART (CrashLoopBackOff)
# ==================================
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pod-restart
  labels:
    scenario: pod-restart
spec:
  replicas: 1
  selector:
    matchLabels:
      app: pod-restart
  template:
    metadata:
      labels:
        app: pod-restart
        scenario: pod-restart
    spec:
      containers:
        - name: crasher
          image: busybox:1.36
          command: ["sh","-c"]
          args:
            - echo "Starting, then crashing in 5s..."; sleep 5; exit 1
          # (Optional) Add a simple liveness probe so CrashLoop shows clearly
          livenessProbe:
            exec:
              command: ["sh","-c","echo alive"]
            initialDelaySeconds: 3
            periodSeconds: 10
EOF

After applying those manifests, we’ll observe how the agent behaves

We’ll examine each deployment one by one.

We’ll begin with the probe-failed deployment.

kubectl get deployment probe-failed  -o yaml  -n kagent

Description

The agent corrects the readiness probe from path: /does-not-exist to path: /, and you can see this change reflected in the annotation.

then move on to the oom-kill use case

kubectl get deployment oom-kill  -o yaml  -n kagent

Description

The agent increases the pod’s resources, and the workload runs successfully, as shown in the figure.

Next, we’ll tackle the final scenario—pod-restart.

kubectl get deployment pod-restart  -o yaml  -n kagent

Description

The agent replaces the previous liveness command

["sh","-c","echo alive"]

with

sh -c 'ps aux | grep "sleep 3600" | grep -v grep || exit 1'

which verifies the target process and keeps the pod healthy—now it works correctly.

Conclusion

Kagent streamlines the building and deployment of Kubernetes-native AI agents, while Khook makes those agents truly autonomous by routing cluster events directly to them. Together, they transform ops from reactive firefighting to policy-guided, self-healing workflows. That said, there are clear next steps and challenges ahead: Multitenancy,Tool breadth, MCP servers for other SRE tools,Security

Once multitenancy, tool breadth, and all event routing catch up, AI-driven SRE won’t be a demo, it’ll be how we operate: agents that observe, reason, act, verify, and improve reliability over time.

Building a GitOps Multi-Tenant Kubernetes Architecture using Flux