Table of Contents

Kubernetes operators have become a powerful tool for automating complex operational tasks in Kubernetes environments. While basic deployments can be managed with native resources, operators allow us to encode domain-specific knowledge and automation into Kubernetes-native extensions. Whether you need to automate database failovers, handle complex application lifecycles, or create custom resource orchestration, operators provide a powerful pattern to solve these challenges.

In this post, I’ll walk through creating a simple yet practical Pod Restarter operator using KubeBuilder, and by the end of this tutorial you’ll understand:

How to use KubeBuilder to scaffold an operator project
How to define custom resources (CRDs)
How to implement reconciliation logic
How to test and deploy your operator

Kubernetes operators
#

A Kubernetes Operator extends Kubernetes functionality by using custom resources to improve the automation, efficiency, and reliability of managing applications in Kubernetes. By automating operational tasks into software that runs inside a Kubernetes cluster, it makes managing applications simpler and more efficient. It is built upon the basic Kubernetes resource and controller concepts to automate the entire lifecycle of the application that is managed. The benefits of creating and using operators are:

Automation: Eliminate manual operational tasks
Consistency: Apply best practices every time
Self-healing: Automatically detect and fix issues
Declarative Management: Define the desired state
Domain Knowledge: Encode operational expertise into reusable software

Controllers
#

A Kubernetes cluster has built-in controllers as part of the kube-controller-manager that manage Kubernetes native resources, such as:

ReplicaSet controller: Ensures the specified number of pod replicas are running.
Deployment controller: Manages replicasets and handles rolling updates.
Daemonset controller: Ensures a pod runs on all or selected nodes.
Service controller: Manages service endpoints and load balancers.
Node controller: Monitors node health and handles evictions.

The controllers use a non-terminating control loop that watches the shared state of the API server and makes changes attempting to move the current state towards the desired state.

A custom operator uses the same controller pattern to extend Kubernetes with domain specific knowledge to manage custom resources, for example managing custom workflows, managing external systems (databases, DNS records, cloud resources), implement complex application logic or automate application specific operational knowledge.

Every operator IS a controller, but not every controller is called an operator.

An Operator contains one or more specialized controllers that:

Manage custom resources via Custom Resource Definitions (CRDs).
- A CRD is the Kubernetes manifest that tells Kubernetes about the new resource type. It’s the “schema” that Kubernetes uses to validate custom resources.
- A Custom Resource (CR) is the instance defined for a CRD and are custom API objects that represent the application’s desired state.
Implement the controller logic in a control loop.
- Watches the desired state that are defined in the custom resource.
- Observes the current state.
- Reconciles the difference by taking action to match the current state with the desired state.

Difference between a controller and operator
#

To understand the difference between controllers and operators, let’s examine two popular Kubernetes projects: External-DNS and Cert-Manager.

External-DNS

External-DNS syncs Kubernetes resources with external DNS providers. It’s considered a controller rather than a full operator because:

It watches Kubernetes native resources (Services, Ingresses) without defining custom resources
It performs a specific, focused task rather than managing a complete application lifecycle
It translates Kubernetes resources into DNS records in external providers

Cert-Manager

Cert-Manager manages TLS certificates in Kubernetes and qualifies as an operator because:

It defines Custom Resource Definitions (CRDs) like Certificate, Issuer, and ClusterIssuer
It manages the complete certificate lifecycle from issuance through renewal
It encodes operational knowledge about certificate management (including handling ACME challenges, interacting with certificate authorities, and managing PKI)

While Cert-Manager is a relatively straightforward operator compared to something like a database operator, it fits the operator pattern by extending Kubernetes with custom resources and implementing domain-specific knowledge about certificate management.

Now that we understand what operators are, let’s explore the KubeBuilder framework that will help us build one.

KubeBuilder
#

Kubebuilder streamlines operator development by providing a structured framework that integrates with controller-runtime and Kubernetes APIs. It abstracts repetitive setup tasks, enabling efficient development of maintainable Kubernetes extensions so that you only have to focus on the business logic.

A Kubebuilder project represents the entire Go application that becomes the Operator and will consist off:

API (Custom Resource): Go structs that define your resource structure
CRD: YAML manifest generated from Go structs that teaches Kubernetes about your resource
Controller: The reconciliation logic that watches resources and makes changes

KubeBuilder’s scaffolding approach offers several advantages:

Reduced Boilerplate: Instead of writing hundreds of lines of setup code, KubeBuilder generates the project structure and boilerplate, letting you focus on business logic.
Standardized Structure: All KubeBuilder projects follow a consistent layout, making it easier for developers to navigate and understand the codebase.
Best Practices Built-in: The generated code incorporates community best practices for Kubernetes controllers, including proper error handling, logging, and metrics.
Simplified API Management: KubeBuilder handles CRD generation, validation markers, and conversion webhooks, simplifying the management of custom APIs.
Incremental Development: You can start with a simple controller and incrementally add features like webhooks, multiple API versions, and additional controllers.

Without KubeBuilder, you would need to:

Set up the project structure manually
Configure Go modules and dependencies
Write boilerplate code for controllers
Create RBAC manifests by hand
Generate CRDs from Go structs manually

KubeBuilder handles all of these tasks automatically, making it much faster to develop and maintain Kubernetes operators.

Prerequisites
#

I am using WSL to use Kubebuilder, which needs to following prerequisites:

Go (version v1.24.5+)
Make
Kubectl (version v1.11.3+)
KinD (for testing the operator in a local cluster)

Install Go

# Download the latest Go version
GO_VERSION=$(curl -sL 'https://go.dev/VERSION?m=text' | head -n 1)
wget https://go.dev/dl/${GO_VERSION}.linux-amd64.tar.gz

# Remove the old Go installation if it exists
if [ -d '/usr/local/go' ]; then
    rm -rf '/usr/local/go'
fi

# Extract and install
sudo tar -C /usr/local -xzf ${GO_VERSION}.linux-amd64.tar.gz

# Remove the downloaded tarball
rm ${GO_VERSION}.linux-amd64.tar.gz

# Set the PATH Variable
echo export PATH=$HOME/go/bin:/usr/local/go/bin:$PATH >> ~/.profile
source ~/.profile

# Verify the installation
go version

Install Make

sudo apt update
sudo apt install make
make --version

Install Docker

sudo apt update

# Install dependencies
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent

# Then add the GPG key for the official Docker repository to your system:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/docker-ce-archive-keyring.gpg > /dev/null

# Add the Docker repository to APT sources
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-ce-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker-ce.list > /dev/null

# Install Docker
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io

# Check that it’s running
sudo systemctl status docker

# If you want to avoid typing `sudo` whenever you run the `docker` command, add your username to the `docker` group
sudo usermod -aG docker ${USER}
logout

# Check groups assignments for user
groups

Install KinD

[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64  
chmod +x ./kind  
sudo cp ./kind /usr/local/bin/kind  
rm -rf kind

Install Kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
kubectl version --client

Installation
#

Download and install Kubebuilder

curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/
kubebuilder version

With our development environment ready, we can move on to creating our first Kubernetes operator.

Pod restarter operator
#

We will start with creating a simple operator called pod-restarter-operator. The pod restarter operator will provide an in-depth look how kubebuilder works, and how a operator works. The code for the operator can be found here: pod-restarter-operator .

The pod restarter operator will automatically restarts pod on a schedule. This is simple easy to understand logic with no external dependencies and is built entirely using Kubernetes native building blocks: Custom Resources, Controllers, and the reconciliation loop.

%%{init: {
  'theme': 'dark',
  'themeVariables': {
    'primaryTextColor': '#FFFFFF',
    'lineColor': '#E2E8F0',
    'noteTextColor': '#FFFFFF',
    'actorTextColor': '#FFFFFF',
    'signalColor': '#FFFFFF',
    'signalTextColor': '#FFFFFF'
  }
}}%%
sequenceDiagram
    participant API as Kubernetes API
    participant Op as Pod Restarter Operator
    participant Pods as Pods
    participant K8s as Kubernetes (ReplicaSet)

    Note over API,Op: Operator constantly watches PodRestarter resources
    API->>Op: Notify of PodRestarter change
    
    Op->>API: Get PodRestarter resource
    API-->>Op: Return PodRestarter
    
    Op->>API: List pods matching selector
    API-->>Op: Return matching pods
    
    Op->>Op: Check if interval elapsed
    
    alt Interval Not Elapsed
        Op->>API: Update status
        Op->>Op: Requeue reconciliation
    else Interval Elapsed
        Op->>API: Delete pods (using strategy)
        API->>Pods: Delete pods
        
        Note over Pods,K8s: Kubernetes built-in controllers handle recreation
        K8s->>API: Create replacement pods
        
        Op->>API: Update PodRestarter status
        Op->>Op: Requeue reconciliation
    end

The operator will:

Watch a custom resource called PodRestarter
Finds pods matching a label selector in a namespace
Restarts them every X minutes (configurable) by deleting the pods and letting Kubernetes recreate them
Tracks restart counts and last restart time in status
- Total restarts performed
- Last restart timestamp
- Next scheduled restart
- Count of matching pods
- Current conditions

The operator automatically restarts pods based on:

Label selector (Which pods to restart)
Interval (How often (1-1440 minutes))
Strategy (How to restart)
- all: All pods at once
- rolling: One at a time
- random-one: One random pod
MaxConcurrent (Limit simultaneous restarts)
Suspend (Pause/resume functionality)

The operator will have:

A PodRestart CRD.
A controller what watches these resources and restarts pods accordingly.

The operator needs minimal permissions:

PodRestarters: CRUD operations
Pods: Read and Delete only

The pod restarter will be configured under the demo.com domain in the resilience group:

resilience.demo.com/v1
└─┬──────┘ └─┬────┘└┬┘
  │          │      │
  │          │      └─ Version
  │          └─ Domain (created with kubebuilder init)
  └─ Group (logical grouping of resources)

The full API for the Operator would then be: resilience.demo.com/v1/PodRestarter

If necessary, I can add as many Groups and Kinds as necessary under the demo.com domain.

Creating a Kubebuilder project
#

The first step is initializing the project and setting up the basic project structure by running kubebuilder init in the repository:

kubebuilder init \
  --domain demo.com \
  --repo github.com/AshwinSarimin/pod-restarter-operator \
  --project-name pod-restarter-operator

domain: Sets the API Group domain for the CRDs. All APIs will be organized under this domain.
repo: Defines the version control repository location.
project-name: Sets the name for the operator project, is used for naming components and files (Makefile targets, Dockerfile, and deployment manifests), and affects the directory structures.

After running the command, Kubebuilder will create the files and folder structure that follow Kubernetes operator development best practices:

pod-restarter-operator/
├── cmd/
│   └── main.go          # Operator entry point
├── config/
│   ├── default/         # Kustomize configs
│   ├── manager/         # Deployment configs
│   ├── network-policy/  # Netpol configs
│   └── rbac/            # RBAC permissions
├── config/              # Test manifests
├── Dockerfile           # Container build
├── go.mod               # Go module name and dependencies
├── Makefile             # Build automation
└── PROJECT              # Tracks Kubebuilder metadata

cmd/main.go is the main entry point for the Operator.

The Makefile will be used for automating several steps, like generating manifests, creating a KinD cluster or pushing the controller to a container registry.

Having set up the project structure, the next step is to define our custom resource API.

Create a API
#

The next step is to create a new API group and version, and a new CRD (Kind):

kubebuilder create api \
  --group resilience \
  --version v1 \
  --kind PodRestarter \
  --resource \
  --controller

The command produces the Custom Resource (CR) and Custom Resource Definition (CRD) for the PodRestarter Kind. It creates the API with the group resilience.demo.com and version v1, uniquely identifying the new CRD of the PodRestarter Kind.

The following files are created in the project:

pod-restarter-operator/
├── api/
│   ├── v1/
│       └── podrestarter_types.go           # CRD definition # ← This must be configured
│       └── zz_generated.deepcopy.go        # Generated
├── config/
│   ├── crd/                                # CRDs that will be generated
│   ├── rbac/                               # Permissions
│   ├── samples/                            # Example CR # ← This must be configured
│   └── manager/                            # Kustomize files for deployment
├── internal/
│   ├── controller/             
│       └── podrestarter_controller.go      # Main reconciliation logic # ← This must be configured
└──     └── podrestarter_controller_test.go # test stub

The most important files here are the api/v1/podrestarter_types.go where the API is defined and the internal/controllers/podrestarter_controller.go where the reconciliation logic is implemented for this Kind (CRD). It also updates cmd/main.go to register the new types.

Configure the API
#

The api/v1/podrestarter_types.go file defines what users can configure in the PodRestarter CR. The content for this file must be replaced with the new logic for the PodRestarter Operator, this will contain:

Simple spec with selector and interval
Status with restart counts and timestamps
Validation markers

Click here to see the code (api/v1/podrestarter_types.go)

/*
Copyright 2025.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package v1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// PodRestarterSpec defines the desired state of PodRestarter
type PodRestarterSpec struct {
	// Selector is the label selector to find pods to restart
	// +kubebuilder:validation:Required
	Selector metav1.LabelSelector `json:"selector"`

	// IntervalMinutes is how often to restart pods (in minutes)
	// +kubebuilder:validation:Minimum=1
	// +kubebuilder:validation:Maximum=1440
	// +kubebuilder:default=5
	IntervalMinutes int32 `json:"intervalMinutes,omitempty"`

	// Strategy defines how to restart pods
	// - "all": Restart all matching pods at once
	// - "rolling": Restart one pod at a time
	// - "random-one": Restart one random pod
	// +kubebuilder:validation:Enum=all;rolling;random-one
	// +kubebuilder:default="all"
	Strategy string `json:"strategy,omitempty"`

	// MaxConcurrent limits how many pods to restart at once
	// 0 means no limit
	// +kubebuilder:validation:Minimum=0
	// +kubebuilder:default=0
	MaxConcurrent int32 `json:"maxConcurrent,omitempty"`

	// Suspend will pause pod restarts when true
	// +kubebuilder:default=false
	Suspend bool `json:"suspend,omitempty"`
}

// PodRestarterStatus defines the observed state of PodRestarter
type PodRestarterStatus struct {
	// TotalRestarts is the total number of pod restarts performed
	TotalRestarts int32 `json:"totalRestarts,omitempty"`

	// LastRestartTime is when pods were last restarted
	LastRestartTime *metav1.Time `json:"lastRestartTime,omitempty"`

	// NextRestartTime is when pods will be restarted next
	NextRestartTime *metav1.Time `json:"nextRestartTime,omitempty"`

	// MatchingPods is the current number of pods matching the selector
	MatchingPods int32 `json:"matchingPods,omitempty"`

	// Conditions represent the latest available observations
	Conditions []metav1.Condition `json:"conditions,omitempty"`

	// ObservedGeneration reflects the generation of the most recently observed PodRestarter
	ObservedGeneration int64 `json:"observedGeneration,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:scope=Namespaced
// +kubebuilder:printcolumn:name="Strategy",type=string,JSONPath=`.spec.strategy`
// +kubebuilder:printcolumn:name="Interval",type=integer,JSONPath=`.spec.intervalMinutes`
// +kubebuilder:printcolumn:name="Matching Pods",type=integer,JSONPath=`.status.matchingPods`
// +kubebuilder:printcolumn:name="Total Restarts",type=integer,JSONPath=`.status.totalRestarts`
// +kubebuilder:printcolumn:name="Last Restart",type=date,JSONPath=`.status.lastRestartTime`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

// PodRestarter is the Schema for the podrestarters API
type PodRestarter struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   PodRestarterSpec   `json:"spec,omitempty"`
	Status PodRestarterStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// PodRestarterList contains a list of PodRestarter
type PodRestarterList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []PodRestarter `json:"items"`
}

func init() {
	SchemeBuilder.Register(&PodRestarter{}, &PodRestarterList{})
}

Each time the API definitions are edited, the CRD manifests must be (re)generated:

make manifests

This will make sure that the Kustomization files in the config folder are created/updated based on the API definition(s).

Configure the Controller
#

The internal/controller/podrestarter_controller.go file contains the reconciliation logic. The content for this file must be replaced with the new logic for the PodRestarter Operator, which will contain:

Find pods matching selector
Check if interval has elapsed
Delete pod
Update status

Click here to see the code (internal/controller/podrestarter_controller.go)

/*
Copyright 2025.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

package controller

import (
	"context"
	"fmt"
	"math/rand"
	"time"

	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	chaosv1alpha1 "github.com/AshwinSarimin/service-router-operator/pod-restarter/api/v1alpha1"
)

const (
	ConditionTypeReady = "Ready"
)

// PodRestarterReconciler reconciles a PodRestarter object
type PodRestarterReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters/finalizers,verbs=update
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;delete

// Reconcile is the main reconciliation loop
func (r *PodRestarterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	log := log.FromContext(ctx)

	// Fetch the PodRestarter instance
	podRestarter := &chaosv1alpha1.PodRestarter{}
	if err := r.Get(ctx, req.NamespacedName, podRestarter); err != nil {
		if errors.IsNotFound(err) {
			log.Info("PodRestarter resource not found. Ignoring since object must be deleted")
			return ctrl.Result{}, nil
		}
		log.Error(err, "Failed to get PodRestarter")
		return ctrl.Result{}, err
	}

	// Check if suspended
	if podRestarter.Spec.Suspend {
		log.Info("PodRestarter is suspended, skipping reconciliation")
		r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "Suspended", "Pod restarts are suspended")
		if err := r.Status().Update(ctx, podRestarter); err != nil {
			log.Error(err, "Failed to update status")
			return ctrl.Result{}, err
		}
		return ctrl.Result{RequeueAfter: r.getInterval(podRestarter)}, nil
	}

	// Find matching pods
	podList := &corev1.PodList{}
	listOpts := []client.ListOption{
		client.InNamespace(podRestarter.Namespace),
	}

	if podRestarter.Spec.Selector.MatchLabels != nil {
		listOpts = append(listOpts, client.MatchingLabels(podRestarter.Spec.Selector.MatchLabels))
	}

	if err := r.List(ctx, podList, listOpts...); err != nil {
		log.Error(err, "Failed to list pods")
		r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "ListFailed", fmt.Sprintf("Failed to list pods: %v", err))
		if statusErr := r.Status().Update(ctx, podRestarter); statusErr != nil {
			log.Error(statusErr, "Failed to update status")
		}
		return ctrl.Result{}, err
	}

	// Update matching pods count
	podRestarter.Status.MatchingPods = int32(len(podList.Items))
	podRestarter.Status.ObservedGeneration = podRestarter.Generation

	// Calculate if it's time to restart
	interval := r.getInterval(podRestarter)
	shouldRestart := false

	if podRestarter.Status.LastRestartTime == nil {
		shouldRestart = true
		log.Info("First restart - will restart pods immediately")
	} else {
		timeSinceLastRestart := time.Since(podRestarter.Status.LastRestartTime.Time)
		shouldRestart = timeSinceLastRestart >= interval
		log.V(1).Info("Checking restart interval",
			"timeSinceLastRestart", timeSinceLastRestart,
			"interval", interval,
			"shouldRestart", shouldRestart)
	}

	if shouldRestart {
		if len(podList.Items) == 0 {
			log.Info("No pods matching selector, skipping restart")
			r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "NoPodsFound", "No pods match the selector")
		} else {
			// Restart pods based on strategy
			restarted, err := r.restartPods(ctx, podRestarter, podList.Items)
			if err != nil {
				log.Error(err, "Failed to restart pods")
				r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "RestartFailed", fmt.Sprintf("Failed to restart pods: %v", err))
				if statusErr := r.Status().Update(ctx, podRestarter); statusErr != nil {
					log.Error(statusErr, "Failed to update status")
				}
				return ctrl.Result{}, err
			}

			log.Info("Successfully restarted pods", "count", restarted, "strategy", podRestarter.Spec.Strategy)

			// Update status
			podRestarter.Status.TotalRestarts += int32(restarted)
			now := metav1.Now()
			podRestarter.Status.LastRestartTime = &now
			nextRestart := metav1.NewTime(now.Add(interval))
			podRestarter.Status.NextRestartTime = &nextRestart

			r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "Restarted", fmt.Sprintf("Restarted %d pod(s)", restarted))
		}
	} else {
		nextRestart := podRestarter.Status.LastRestartTime.Time.Add(interval)
		log.V(1).Info("Not time to restart yet", "nextRestartTime", nextRestart)
		podRestarter.Status.NextRestartTime = &metav1.Time{Time: nextRestart}
		r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "Waiting", "Waiting for next restart interval")
	}

	// Update status
	if err := r.Status().Update(ctx, podRestarter); err != nil {
		log.Error(err, "Failed to update status")
		return ctrl.Result{}, err
	}

	// Requeue after interval
	return ctrl.Result{RequeueAfter: interval}, nil
}

// restartPods restarts pods based on the strategy
func (r *PodRestarterReconciler) restartPods(ctx context.Context, podRestarter *chaosv1alpha1.PodRestarter, pods []corev1.Pod) (int, error) {
	log := log.FromContext(ctx)
	restarted := 0
	strategy := podRestarter.Spec.Strategy
	if strategy == "" {
		strategy = "all"
	}

	maxConcurrent := int(podRestarter.Spec.MaxConcurrent)
	if maxConcurrent == 0 {
		maxConcurrent = len(pods)
	}

	switch strategy {
	case "random-one":
		if len(pods) > 0 {
			randomIndex := rand.Intn(len(pods))
			pod := pods[randomIndex]
			log.Info("Restarting pod (random-one strategy)", "pod", pod.Name)
			if err := r.Delete(ctx, &pod); err != nil {
				if !errors.IsNotFound(err) {
					return restarted, fmt.Errorf("failed to delete pod %s: %w", pod.Name, err)
				}
			}
			restarted++
		}

	case "rolling":
		for i, pod := range pods {
			if restarted >= maxConcurrent {
				log.Info("Reached maxConcurrent limit", "restarted", restarted, "maxConcurrent", maxConcurrent)
				break
			}
			log.Info("Restarting pod (rolling strategy)", "pod", pod.Name, "index", i+1, "total", len(pods))
			if err := r.Delete(ctx, &pod); err != nil {
				if !errors.IsNotFound(err) {
					log.Error(err, "Failed to delete pod", "pod", pod.Name)
					continue
				}
			}
			restarted++
		}

	case "all":
		fallthrough
	default:
		for _, pod := range pods {
			if restarted >= maxConcurrent && maxConcurrent != len(pods) {
				log.Info("Reached maxConcurrent limit", "restarted", restarted, "maxConcurrent", maxConcurrent)
				break
			}
			log.Info("Restarting pod (all strategy)", "pod", pod.Name)
			if err := r.Delete(ctx, &pod); err != nil {
				if !errors.IsNotFound(err) {
					log.Error(err, "Failed to delete pod", "pod", pod.Name)
					continue
				}
			}
			restarted++
		}
	}

	return restarted, nil
}

// getInterval returns the restart interval duration
func (r *PodRestarterReconciler) getInterval(podRestarter *chaosv1alpha1.PodRestarter) time.Duration {
	minutes := podRestarter.Spec.IntervalMinutes
	if minutes == 0 {
		minutes = 5
	}
	return time.Duration(minutes) * time.Minute
}

// setCondition sets a condition on the PodRestarter status
func (r *PodRestarterReconciler) setCondition(podRestarter *chaosv1alpha1.PodRestarter, conditionType string, status metav1.ConditionStatus, reason, message string) {
	condition := metav1.Condition{
		Type:               conditionType,
		Status:             status,
		Reason:             reason,
		Message:            message,
		LastTransitionTime: metav1.Now(),
		ObservedGeneration: podRestarter.Generation,
	}

	found := false
	for i, existingCondition := range podRestarter.Status.Conditions {
		if existingCondition.Type == conditionType {
			if existingCondition.Status != status {
				podRestarter.Status.Conditions[i] = condition
			} else {
				podRestarter.Status.Conditions[i].Message = message
				podRestarter.Status.Conditions[i].Reason = reason
				podRestarter.Status.Conditions[i].LastTransitionTime = metav1.Now()
			}
			found = true
			break
		}
	}

	if !found {
		podRestarter.Status.Conditions = append(podRestarter.Status.Conditions, condition)
	}
}

// SetupWithManager sets up the controller with the Manager
func (r *PodRestarterReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&chaosv1alpha1.PodRestarter{}).
		Complete(r)
}

The controller contains several functions, of which the Reconcile() function is the most important function in the Operator. It’s triggered whenever:

A PodRestarter resource is created, updated, or deleted
The configured interval time has elapsed
A watched resource (like a Pod) changes

After the API and Controller are configured, the next step is to regenerate the manifests, test and build the operator locally:

Generate manifests

# Generate CRD and RBAC manifests
make manifests
# Generates:
# - config/crd/bases/resilience.demo.com_podrestarters.yaml
# - config/rbac/role.yaml (RBAC permissions)

# Generate DeepCopy code
make generate
# Generates:
# - zz_generated.deepcopy.go (DeepCopy methods)

make manifests reads the Kubebuilder markers in the code and generates Kubernetes manifests from Go types.

Validate and test

# Format all Go code according to standard style
make fmt

# Static analysis to find common mistakes
make vet

# Run all tests
make test

Expected output:

ok      github.com/AshwinSarimin/pod-restarter/api/v1    0.234s
ok      github.com/AshwinSarimin/pod-restarter/internal/controller     2.456s

Build the operator locally

# Build
make build
# Compiles the operator binary
# Output: bin/manager

If the build if succesfull. you’ll see:

go build -o bin\manager.exe cmd/main.go

The binary is now in bin\manager.exe, but it doesn’t do anything useful yet.

To summarize, we have the following configurations:

Concept	Info
Kubebuilder Project	The entire Go application
API (Custom Resource)	New resource type definition (`PodRestarter` struct in `podrestarter_types.go`)
Controller	The reconciliation logic (`Reconcile()` function in `podrestarter_controller.go`)
CRD	Kubernetes manifest describing the API

KubeBuilder markers
#

KubeBuilder markers are special comments in Go code that the Kubebuilder tool reads to generate Kubernetes manifests and configuration. They follow a specific format using Go comment syntax and act as annotations that provide instructions to the code generator.

For example in the internal/controller/podrestarter_controller.go file there are markers defined to generate RBAC rules for the Operator:

// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters/finalizers,verbs=update
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;delete

It specifies the API group the controller needs permissions for (resilience.demo.com), defines the specific resource type in that group (podrestarters) and list the permissions the controller needs on that resource.

This marker will generate ClusterRole rules in the controller RBAC manifests, for example:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/name: pod-restarter-operator
    app.kubernetes.io/managed-by: kustomize
  name: podrestarter-editor-role
rules:
- apiGroups:
  - resilience.demo.com
  resources:
  - podrestarters
  verbs:
  - create
  - delete
  - get
  - list
  - patch
  - update
  - watch
- apiGroups:
  - resilience.demo.com
  resources:
  - podrestarters/status
  verbs:
  - get

Other frequently used markers include:

Resource Definition Markers:

// +kubebuilder:resource:scope=Namespaced,shortName=pr

Defines properties of the CRD, including scope (Namespaced or Cluster) and shortName for kubectl.

Subresource Markers:

// +kubebuilder:subresource:status

Enables the status subresource, allowing separate status updates.

Validation Markers:

// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=100

Adds OpenAPI validation rules to the CRD.

Printcolumn Markers:

// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

Defines additional columns for kubectl get command output.

Webhook Markers:

// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=fail

Configures webhook settings for validation or mutation.

These markers are converted to YAML when executing make manifests, which generates all the necessary Kubernetes resources for the operator to function with the correct permissions and configurations.

After implementing the controller logic, we need to test our operator in a real Kubernetes environment.

Test the Operator
#

The operator can be tested in a Kubernetes cluster locally in a KinD cluster:

Create KinD cluster

# Create cluster
kind create cluster --name test-cluster

# Verify
kubectl cluster-info --context test-cluster
kubectl get nodes

Install CRDs

# Install the PodRestarter CRD
make install

# Verify
kubectl get crd podrestarters.resilience.demo.com
kubectl describe crd podrestarters.resilience.demo.com

Run operator locally

This step is optional to be able to run the operator locally and log the PodRestarter operator activity from the KinD cluster in the local terminal. What it does is:

Compiles and runs the operator on the local machine
Connects to the Kind cluster via kubeconfig
Watches for PodRestarter resources
Logs all activity to the terminal

Open a terminal and run:

make run

Expected output:

2025-01-07T10:30:00Z    INFO    setup   starting manager
2025-01-07T10:30:01Z    INFO    Starting EventSource    {"controller": "podrestarter"}
2025-01-07T10:30:01Z    INFO    Starting Controller     {"controller": "podrestarter"}
2025-01-07T10:30:01Z    INFO    Starting workers        {"controller": "podrestarter", "worker count": 1}

Leave this terminal running as a log viewer.

Deploy demo application

Open a new terminal.
Deploy a demo app with 3 replicas

kubectl create deployment demo-app --image=nginx:alpine --replicas=3

Update the sample CR

Update config\samples\resilience_v1_podrestarter.yaml:

apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
  labels:
    app.kubernetes.io/name: pod-restarter-operator
    app.kubernetes.io/managed-by: kustomize
  name: podrestarter-sample
spec:
  selector:
    matchLabels:
      app: demo-app # Must be the same label as the demo applications
  intervalMinutes: 2
  strategy: rolling
  maxConcurrent: 1
  suspend: false

Create the PodRestarter CR

kubectl apply -f config\samples\resilience_v1_podrestarter.yaml

Check status

In the operator logs terminal, you will see that 1 pod has been restarted by the operator:

2025-10-17T17:01:15+02:00       INFO    Restarting pod (rolling strategy)       {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "pod": "demo-app-56bb8cfcd7-mprqn", "index": 1, "total": 3}

2025-10-17T17:01:15+02:00       INFO    Reached maxConcurrent limit     {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "restarted": 1, "maxConcurrent": 1}

2025-10-17T17:01:15+02:00       INFO    Successfully restarted pods     {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "count": 1, "strategy": "rolling"}

2025-10-17T17:01:15+02:00       DEBUG   Checking restart interval       {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "22db838a-ea40-482e-a990-36a791924d0f", "timeSinceLastRestart": "453.161346ms", "interval": "2m0s", "shouldRestart": false}

2025-10-17T17:01:15+02:00       DEBUG   Not time to restart yet {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "22db838a-ea40-482e-a990-36a791924d0f", "nextRestartTime": "2025-10-17T17:03:15+02:00"}

In the cluster you will see that the 1 pod is terminated and being recreated by the deployment.

NAME                        READY   STATUS    RESTARTS   AGE
demo-app-56bb8cfcd7-c4zbc   1/1     Running   0          3m17s
demo-app-56bb8cfcd7-hkpmt   1/1     Running   0          9m31s
demo-app-56bb8cfcd7-lm4m5   1/1     Running   0          9m31s
demo-app-56bb8cfcd7-lm4m5   1/1     Terminating   0          10m
demo-app-56bb8cfcd7-whq5p   0/1     Pending       0          0s
demo-app-56bb8cfcd7-whq5p   0/1     Pending       0          0s
demo-app-56bb8cfcd7-whq5p   0/1     ContainerCreating   0          0s
demo-app-56bb8cfcd7-lm4m5   0/1     Terminating         0          10m
demo-app-56bb8cfcd7-lm4m5   0/1     Terminating         0          10m
demo-app-56bb8cfcd7-lm4m5   0/1     Terminating         0          10m
demo-app-56bb8cfcd7-lm4m5   0/1     Terminating         0          10m
demo-app-56bb8cfcd7-whq5p   1/1     Running             0          0s

The status for the PodRestarter CR will show the information about matching pods, total restarts and last restart.

# Check status:
kubectl get podrestarter

# Check detailed status:
kubectl describe podrestarter podrestarter-sample

Test different strategies

The operator also contains different strategies to test:

All strategy (restart all at once):

kubectl apply -f - <<EOF
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
  name: restarter-all
spec:
  selector:
    matchLabels:
      app: demo-app
  intervalMinutes: 2
  strategy: all
EOF

Random-one strategy (one random pod):

kubectl apply -f - <<EOF
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
  name: restarter-random
spec:
  selector:
    matchLabels:
      app: demo-app
  intervalMinutes: 2
  strategy: random-one
EOF

Test suspend feature

The operator contains a suspend feature, shown in the controller code: This can be configured by redeploying the PodRestarter CR:

apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
  labels:
    app.kubernetes.io/name: pod-restarter-operator
    app.kubernetes.io/managed-by: kustomize
  name: podrestarter-sample
spec:
  selector:
    matchLabels:
      app: demo-app
  intervalMinutes: 2
  strategy: rolling
  maxConcurrent: 1
  suspend: true      # Change this to true

kubectl apply -f config\samples\resilience_v1_podrestarter.yaml

Or by patching the current PodRestarter CR:

kubectl patch podrestarter podrestarter-sample -p '{"spec":{"suspend":true}}' --type=merge

# Check status
kubectl get podrestarter podrestarter-sample

The pods should now stop restarting and the operator logs would show:

2025-10-17T17:35:00Z    INFO    PodRestarter is suspended, skipping reconciliation

Resuming restarts can be done by changing the suspend value to false:

kubectl patch podrestarter podrestarter-sample -p '{"spec":{"suspend":false}}' --type=merge

Run the Operator
#

Now that the operator is ready, it can be used in other clusters. For this I use my homelab cluster to run the operator, and an Azure Container registry to store the controller image.

Kubebuilder comes with built-in commands for building and pushing images to a registry, and configuring the cluster.

Build image

make docker-build IMG=<REGISTRY_NAME>/pod-restarter:v0.1.0

#For example
make docker-build IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0

The command uses the Dockerfile stored in the Kubebuilder project, which is a multi-stage build that builds the Go binary and creates a minimal runtime image. The second runtime stage only contains the binary (making it much smaller) with a distroless minimal base image, which contributes to faster deployment and minimal attack surface.

Push to registry To be able to send a image to a private registry (like an Azure Container Registry), make sure you have the proper permissions.

az login --tenant xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx
az acr login --name teknologieur1acr
make docker-push IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0

Install CRDs on cluster

# Switch to the cluster context
kubectl config get-contexts
kubectl config use-context homelab-admin@homelab

# Install CRDs
make install

Deploy Operator

# Deploy operator
make deploy IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0

# Verify
kubectl get pods -n pod-restarter-operator-system
kubectl logs -n pod-restarter-operator-system deployment/pod-restarter-operator-controller-manager -f

Cleanup (optional)

make undeploy
make uninstall

KubeBuilder common commands
#

# Initialize project
kubebuilder init --domain demo.com --repo github.com/you/pod-restarter-operator

# Create API
kubebuilder create api --group test --version v1 --kind PodRestarter

# Generate CRD manifests from Go structs
make manifests

# Install CRDs in cluster
make install

# Run operator locally (for development)
make run

# Build and push Docker image
make docker-build docker-push IMG=<REGISTRY_NAME>/podrestarter-operator:v1

# Deploy operator to cluster
make deploy IMG=<REGISTRY_NAME>/podrestarter:v1

# Uninstall CRDs
make uninstall

# Undeploy operator
make undeploy

Conclusion
#

In this post, we’ve explored how to build a Kubernetes operator using KubeBuilder. We’ve seen how to scaffold a new project, define custom resources, implement reconciliation logic, and deploy the operator to a Kubernetes cluster.

The Pod Restarter operator we’ve built demonstrates several key concepts:

Creating custom resources with validation
Implementing a controller with reconciliation logic
Working with Kubernetes resources programmatically
Using status subresources to report on operator activities
Testing and deploying operators

While this example is relatively simple, it shows the power of the operator pattern for extending Kubernetes with custom behavior. The same principles can be applied to build more complex operators that manage databases, handle application-specific scaling, or automate infrastructure provisioning.

Additional resources
#

KubeBuilder Book - The official documentation for KubeBuilder
Operator Pattern - Kubernetes documentation on operators
Controller Runtime - The library that powers KubeBuilder
API Conventions

Kubernetes operators #

Controllers #

Difference between a controller and operator #

KubeBuilder #

Prerequisites #

Installation #

Pod restarter operator #

Creating a Kubebuilder project #

Create a API #

Configure the API #

Configure the Controller #

KubeBuilder markers #

Test the Operator #

Run the Operator #

KubeBuilder common commands #

Conclusion #

Additional resources #