Kubernetes operators have become a powerful tool for automating complex operational tasks in Kubernetes environments. While basic deployments can be managed with native resources, operators allow us to encode domain-specific knowledge and automation into Kubernetes-native extensions. Whether you need to automate database failovers, handle complex application lifecycles, or create custom resource orchestration, operators provide a powerful pattern to solve these challenges.
In this post, I’ll walk through creating a simple yet practical Pod Restarter operator using KubeBuilder, and by the end of this tutorial you’ll understand:
- How to use KubeBuilder to scaffold an operator project
- How to define custom resources (CRDs)
- How to implement reconciliation logic
- How to test and deploy your operator
Kubernetes operators #
A Kubernetes Operator extends Kubernetes functionality by using custom resources to improve the automation, efficiency, and reliability of managing applications in Kubernetes. By automating operational tasks into software that runs inside a Kubernetes cluster, it makes managing applications simpler and more efficient. It is built upon the basic Kubernetes resource and controller concepts to automate the entire lifecycle of the application that is managed. The benefits of creating and using operators are:
- Automation: Eliminate manual operational tasks
- Consistency: Apply best practices every time
- Self-healing: Automatically detect and fix issues
- Declarative Management: Define the desired state
- Domain Knowledge: Encode operational expertise into reusable software
Controllers #
A Kubernetes cluster has built-in controllers as part of the kube-controller-manager that manage Kubernetes native resources, such as:
- ReplicaSet controller: Ensures the specified number of pod replicas are running.
- Deployment controller: Manages replicasets and handles rolling updates.
- Daemonset controller: Ensures a pod runs on all or selected nodes.
- Service controller: Manages service endpoints and load balancers.
- Node controller: Monitors node health and handles evictions.
The controllers use a non-terminating control loop that watches the shared state of the API server and makes changes attempting to move the current state towards the desired state.
A custom operator uses the same controller pattern to extend Kubernetes with domain specific knowledge to manage custom resources, for example managing custom workflows, managing external systems (databases, DNS records, cloud resources), implement complex application logic or automate application specific operational knowledge.
Every operator IS a controller, but not every controller is called an operator.
An Operator contains one or more specialized controllers that:
- Manage custom resources via Custom Resource Definitions (CRDs).
- A CRD is the Kubernetes manifest that tells Kubernetes about the new resource type. It’s the “schema” that Kubernetes uses to validate custom resources.
- A Custom Resource (CR) is the instance defined for a CRD and are custom API objects that represent the application’s desired state.
- Implement the controller logic in a control loop.
- Watches the desired state that are defined in the custom resource.
- Observes the current state.
- Reconciles the difference by taking action to match the current state with the desired state.
Difference between a controller and operator #
To understand the difference between controllers and operators, let’s examine two popular Kubernetes projects: External-DNS and Cert-Manager.
External-DNS
External-DNS syncs Kubernetes resources with external DNS providers. It’s considered a controller rather than a full operator because:
- It watches Kubernetes native resources (Services, Ingresses) without defining custom resources
- It performs a specific, focused task rather than managing a complete application lifecycle
- It translates Kubernetes resources into DNS records in external providers
Cert-Manager
Cert-Manager manages TLS certificates in Kubernetes and qualifies as an operator because:
- It defines Custom Resource Definitions (CRDs) like
Certificate,Issuer, andClusterIssuer - It manages the complete certificate lifecycle from issuance through renewal
- It encodes operational knowledge about certificate management (including handling ACME challenges, interacting with certificate authorities, and managing PKI)
While Cert-Manager is a relatively straightforward operator compared to something like a database operator, it fits the operator pattern by extending Kubernetes with custom resources and implementing domain-specific knowledge about certificate management.
Now that we understand what operators are, let’s explore the KubeBuilder framework that will help us build one.
KubeBuilder #
Kubebuilder streamlines operator development by providing a structured framework that integrates with controller-runtime and Kubernetes APIs. It abstracts repetitive setup tasks, enabling efficient development of maintainable Kubernetes extensions so that you only have to focus on the business logic.
A Kubebuilder project represents the entire Go application that becomes the Operator and will consist off:
- API (Custom Resource): Go structs that define your resource structure
- CRD: YAML manifest generated from Go structs that teaches Kubernetes about your resource
- Controller: The reconciliation logic that watches resources and makes changes
KubeBuilder’s scaffolding approach offers several advantages:
-
Reduced Boilerplate: Instead of writing hundreds of lines of setup code, KubeBuilder generates the project structure and boilerplate, letting you focus on business logic.
-
Standardized Structure: All KubeBuilder projects follow a consistent layout, making it easier for developers to navigate and understand the codebase.
-
Best Practices Built-in: The generated code incorporates community best practices for Kubernetes controllers, including proper error handling, logging, and metrics.
-
Simplified API Management: KubeBuilder handles CRD generation, validation markers, and conversion webhooks, simplifying the management of custom APIs.
-
Incremental Development: You can start with a simple controller and incrementally add features like webhooks, multiple API versions, and additional controllers.
Without KubeBuilder, you would need to:
- Set up the project structure manually
- Configure Go modules and dependencies
- Write boilerplate code for controllers
- Create RBAC manifests by hand
- Generate CRDs from Go structs manually
KubeBuilder handles all of these tasks automatically, making it much faster to develop and maintain Kubernetes operators.
Prerequisites #
I am using WSL to use Kubebuilder, which needs to following prerequisites:
- Go (version v1.24.5+)
- Make
- Kubectl (version v1.11.3+)
- KinD (for testing the operator in a local cluster)
Install Go
# Download the latest Go version
GO_VERSION=$(curl -sL 'https://go.dev/VERSION?m=text' | head -n 1)
wget https://go.dev/dl/${GO_VERSION}.linux-amd64.tar.gz
# Remove the old Go installation if it exists
if [ -d '/usr/local/go' ]; then
rm -rf '/usr/local/go'
fi
# Extract and install
sudo tar -C /usr/local -xzf ${GO_VERSION}.linux-amd64.tar.gz
# Remove the downloaded tarball
rm ${GO_VERSION}.linux-amd64.tar.gz
# Set the PATH Variable
echo export PATH=$HOME/go/bin:/usr/local/go/bin:$PATH >> ~/.profile
source ~/.profile
# Verify the installation
go version
Install Make
sudo apt update
sudo apt install make
make --version
Install Docker
sudo apt update
# Install dependencies
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent
# Then add the GPG key for the official Docker repository to your system:
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor | sudo tee /usr/share/keyrings/docker-ce-archive-keyring.gpg > /dev/null
# Add the Docker repository to APT sources
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-ce-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker-ce.list > /dev/null
# Install Docker
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
# Check that it’s running
sudo systemctl status docker
# If you want to avoid typing `sudo` whenever you run the `docker` command, add your username to the `docker` group
sudo usermod -aG docker ${USER}
logout
# Check groups assignments for user
groups
Install KinD
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo cp ./kind /usr/local/bin/kind
rm -rf kind
Install Kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
kubectl version --client
Installation #
Download and install Kubebuilder
curl -L -o kubebuilder "https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)"
chmod +x kubebuilder && sudo mv kubebuilder /usr/local/bin/
kubebuilder version
With our development environment ready, we can move on to creating our first Kubernetes operator.
Pod restarter operator #
We will start with creating a simple operator called pod-restarter-operator. The pod restarter operator will provide an in-depth look how kubebuilder works, and how a operator works. The code for the operator can be found here: pod-restarter-operator .
The pod restarter operator will automatically restarts pod on a schedule. This is simple easy to understand logic with no external dependencies and is built entirely using Kubernetes native building blocks: Custom Resources, Controllers, and the reconciliation loop.
%%{init: {
'theme': 'dark',
'themeVariables': {
'primaryTextColor': '#FFFFFF',
'lineColor': '#E2E8F0',
'noteTextColor': '#FFFFFF',
'actorTextColor': '#FFFFFF',
'signalColor': '#FFFFFF',
'signalTextColor': '#FFFFFF'
}
}}%%
sequenceDiagram
participant API as Kubernetes API
participant Op as Pod Restarter Operator
participant Pods as Pods
participant K8s as Kubernetes (ReplicaSet)
Note over API,Op: Operator constantly watches PodRestarter resources
API->>Op: Notify of PodRestarter change
Op->>API: Get PodRestarter resource
API-->>Op: Return PodRestarter
Op->>API: List pods matching selector
API-->>Op: Return matching pods
Op->>Op: Check if interval elapsed
alt Interval Not Elapsed
Op->>API: Update status
Op->>Op: Requeue reconciliation
else Interval Elapsed
Op->>API: Delete pods (using strategy)
API->>Pods: Delete pods
Note over Pods,K8s: Kubernetes built-in controllers handle recreation
K8s->>API: Create replacement pods
Op->>API: Update PodRestarter status
Op->>Op: Requeue reconciliation
end
The operator will:
- Watch a custom resource called
PodRestarter - Finds pods matching a label selector in a namespace
- Restarts them every X minutes (configurable) by deleting the pods and letting Kubernetes recreate them
- Tracks restart counts and last restart time in status
- Total restarts performed
- Last restart timestamp
- Next scheduled restart
- Count of matching pods
- Current conditions
The operator automatically restarts pods based on:
- Label selector (Which pods to restart)
- Interval (How often (1-1440 minutes))
- Strategy (How to restart)
all: All pods at oncerolling: One at a timerandom-one: One random pod
- MaxConcurrent (Limit simultaneous restarts)
- Suspend (Pause/resume functionality)
The operator will have:
- A PodRestart CRD.
- A controller what watches these resources and restarts pods accordingly.
The operator needs minimal permissions:
- PodRestarters: CRUD operations
- Pods: Read and Delete only
The pod restarter will be configured under the demo.com domain in the resilience group:
resilience.demo.com/v1
└─┬──────┘ └─┬────┘└┬┘
│ │ │
│ │ └─ Version
│ └─ Domain (created with kubebuilder init)
└─ Group (logical grouping of resources)
The full API for the Operator would then be: resilience.demo.com/v1/PodRestarter
If necessary, I can add as many Groups and Kinds as necessary under the demo.com domain.
Creating a Kubebuilder project #
The first step is initializing the project and setting up the basic project structure by running kubebuilder init in the repository:
kubebuilder init \
--domain demo.com \
--repo github.com/AshwinSarimin/pod-restarter-operator \
--project-name pod-restarter-operator
- domain: Sets the API Group domain for the CRDs. All APIs will be organized under this domain.
- repo: Defines the version control repository location.
- project-name: Sets the name for the operator project, is used for naming components and files (Makefile targets, Dockerfile, and deployment manifests), and affects the directory structures.
After running the command, Kubebuilder will create the files and folder structure that follow Kubernetes operator development best practices:
pod-restarter-operator/
├── cmd/
│ └── main.go # Operator entry point
├── config/
│ ├── default/ # Kustomize configs
│ ├── manager/ # Deployment configs
│ ├── network-policy/ # Netpol configs
│ └── rbac/ # RBAC permissions
├── config/ # Test manifests
├── Dockerfile # Container build
├── go.mod # Go module name and dependencies
├── Makefile # Build automation
└── PROJECT # Tracks Kubebuilder metadata
cmd/main.go is the main entry point for the Operator.
The Makefile will be used for automating several steps, like generating manifests, creating a KinD cluster or pushing the controller to a container registry.
Having set up the project structure, the next step is to define our custom resource API.
Create a API #
The next step is to create a new API group and version, and a new CRD (Kind):
kubebuilder create api \
--group resilience \
--version v1 \
--kind PodRestarter \
--resource \
--controller
The command produces the Custom Resource (CR) and Custom Resource Definition (CRD) for the PodRestarter Kind. It creates the API with the group resilience.demo.com and version v1, uniquely identifying the new CRD of the PodRestarter Kind.
The following files are created in the project:
pod-restarter-operator/
├── api/
│ ├── v1/
│ └── podrestarter_types.go # CRD definition # ← This must be configured
│ └── zz_generated.deepcopy.go # Generated
├── config/
│ ├── crd/ # CRDs that will be generated
│ ├── rbac/ # Permissions
│ ├── samples/ # Example CR # ← This must be configured
│ └── manager/ # Kustomize files for deployment
├── internal/
│ ├── controller/
│ └── podrestarter_controller.go # Main reconciliation logic # ← This must be configured
└── └── podrestarter_controller_test.go # test stub
The most important files here are the api/v1/podrestarter_types.go where the API is defined and the internal/controllers/podrestarter_controller.go where the reconciliation logic is implemented for this Kind (CRD).
It also updates cmd/main.go to register the new types.
Configure the API #
The api/v1/podrestarter_types.go file defines what users can configure in the PodRestarter CR. The content for this file must be replaced with the new logic for the PodRestarter Operator, this will contain:
- Simple spec with selector and interval
- Status with restart counts and timestamps
- Validation markers
Click here to see the code (api/v1/podrestarter_types.go)
/*
Copyright 2025.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// PodRestarterSpec defines the desired state of PodRestarter
type PodRestarterSpec struct {
// Selector is the label selector to find pods to restart
// +kubebuilder:validation:Required
Selector metav1.LabelSelector `json:"selector"`
// IntervalMinutes is how often to restart pods (in minutes)
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=1440
// +kubebuilder:default=5
IntervalMinutes int32 `json:"intervalMinutes,omitempty"`
// Strategy defines how to restart pods
// - "all": Restart all matching pods at once
// - "rolling": Restart one pod at a time
// - "random-one": Restart one random pod
// +kubebuilder:validation:Enum=all;rolling;random-one
// +kubebuilder:default="all"
Strategy string `json:"strategy,omitempty"`
// MaxConcurrent limits how many pods to restart at once
// 0 means no limit
// +kubebuilder:validation:Minimum=0
// +kubebuilder:default=0
MaxConcurrent int32 `json:"maxConcurrent,omitempty"`
// Suspend will pause pod restarts when true
// +kubebuilder:default=false
Suspend bool `json:"suspend,omitempty"`
}
// PodRestarterStatus defines the observed state of PodRestarter
type PodRestarterStatus struct {
// TotalRestarts is the total number of pod restarts performed
TotalRestarts int32 `json:"totalRestarts,omitempty"`
// LastRestartTime is when pods were last restarted
LastRestartTime *metav1.Time `json:"lastRestartTime,omitempty"`
// NextRestartTime is when pods will be restarted next
NextRestartTime *metav1.Time `json:"nextRestartTime,omitempty"`
// MatchingPods is the current number of pods matching the selector
MatchingPods int32 `json:"matchingPods,omitempty"`
// Conditions represent the latest available observations
Conditions []metav1.Condition `json:"conditions,omitempty"`
// ObservedGeneration reflects the generation of the most recently observed PodRestarter
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:scope=Namespaced
// +kubebuilder:printcolumn:name="Strategy",type=string,JSONPath=`.spec.strategy`
// +kubebuilder:printcolumn:name="Interval",type=integer,JSONPath=`.spec.intervalMinutes`
// +kubebuilder:printcolumn:name="Matching Pods",type=integer,JSONPath=`.status.matchingPods`
// +kubebuilder:printcolumn:name="Total Restarts",type=integer,JSONPath=`.status.totalRestarts`
// +kubebuilder:printcolumn:name="Last Restart",type=date,JSONPath=`.status.lastRestartTime`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
// PodRestarter is the Schema for the podrestarters API
type PodRestarter struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec PodRestarterSpec `json:"spec,omitempty"`
Status PodRestarterStatus `json:"status,omitempty"`
}
// +kubebuilder:object:root=true
// PodRestarterList contains a list of PodRestarter
type PodRestarterList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []PodRestarter `json:"items"`
}
func init() {
SchemeBuilder.Register(&PodRestarter{}, &PodRestarterList{})
}
Each time the API definitions are edited, the CRD manifests must be (re)generated:
make manifests
This will make sure that the Kustomization files in the config folder are created/updated based on the API definition(s).
Configure the Controller #
The internal/controller/podrestarter_controller.go file contains the reconciliation logic. The content for this file must be replaced with the new logic for the PodRestarter Operator, which will contain:
- Find pods matching selector
- Check if interval has elapsed
- Delete pod
- Update status
Click here to see the code (internal/controller/podrestarter_controller.go)
/*
Copyright 2025.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package controller
import (
"context"
"fmt"
"math/rand"
"time"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
chaosv1alpha1 "github.com/AshwinSarimin/service-router-operator/pod-restarter/api/v1alpha1"
)
const (
ConditionTypeReady = "Ready"
)
// PodRestarterReconciler reconciles a PodRestarter object
type PodRestarterReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=chaos.platform.com,resources=podrestarters/finalizers,verbs=update
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;delete
// Reconcile is the main reconciliation loop
func (r *PodRestarterReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := log.FromContext(ctx)
// Fetch the PodRestarter instance
podRestarter := &chaosv1alpha1.PodRestarter{}
if err := r.Get(ctx, req.NamespacedName, podRestarter); err != nil {
if errors.IsNotFound(err) {
log.Info("PodRestarter resource not found. Ignoring since object must be deleted")
return ctrl.Result{}, nil
}
log.Error(err, "Failed to get PodRestarter")
return ctrl.Result{}, err
}
// Check if suspended
if podRestarter.Spec.Suspend {
log.Info("PodRestarter is suspended, skipping reconciliation")
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "Suspended", "Pod restarts are suspended")
if err := r.Status().Update(ctx, podRestarter); err != nil {
log.Error(err, "Failed to update status")
return ctrl.Result{}, err
}
return ctrl.Result{RequeueAfter: r.getInterval(podRestarter)}, nil
}
// Find matching pods
podList := &corev1.PodList{}
listOpts := []client.ListOption{
client.InNamespace(podRestarter.Namespace),
}
if podRestarter.Spec.Selector.MatchLabels != nil {
listOpts = append(listOpts, client.MatchingLabels(podRestarter.Spec.Selector.MatchLabels))
}
if err := r.List(ctx, podList, listOpts...); err != nil {
log.Error(err, "Failed to list pods")
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "ListFailed", fmt.Sprintf("Failed to list pods: %v", err))
if statusErr := r.Status().Update(ctx, podRestarter); statusErr != nil {
log.Error(statusErr, "Failed to update status")
}
return ctrl.Result{}, err
}
// Update matching pods count
podRestarter.Status.MatchingPods = int32(len(podList.Items))
podRestarter.Status.ObservedGeneration = podRestarter.Generation
// Calculate if it's time to restart
interval := r.getInterval(podRestarter)
shouldRestart := false
if podRestarter.Status.LastRestartTime == nil {
shouldRestart = true
log.Info("First restart - will restart pods immediately")
} else {
timeSinceLastRestart := time.Since(podRestarter.Status.LastRestartTime.Time)
shouldRestart = timeSinceLastRestart >= interval
log.V(1).Info("Checking restart interval",
"timeSinceLastRestart", timeSinceLastRestart,
"interval", interval,
"shouldRestart", shouldRestart)
}
if shouldRestart {
if len(podList.Items) == 0 {
log.Info("No pods matching selector, skipping restart")
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "NoPodsFound", "No pods match the selector")
} else {
// Restart pods based on strategy
restarted, err := r.restartPods(ctx, podRestarter, podList.Items)
if err != nil {
log.Error(err, "Failed to restart pods")
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionFalse, "RestartFailed", fmt.Sprintf("Failed to restart pods: %v", err))
if statusErr := r.Status().Update(ctx, podRestarter); statusErr != nil {
log.Error(statusErr, "Failed to update status")
}
return ctrl.Result{}, err
}
log.Info("Successfully restarted pods", "count", restarted, "strategy", podRestarter.Spec.Strategy)
// Update status
podRestarter.Status.TotalRestarts += int32(restarted)
now := metav1.Now()
podRestarter.Status.LastRestartTime = &now
nextRestart := metav1.NewTime(now.Add(interval))
podRestarter.Status.NextRestartTime = &nextRestart
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "Restarted", fmt.Sprintf("Restarted %d pod(s)", restarted))
}
} else {
nextRestart := podRestarter.Status.LastRestartTime.Time.Add(interval)
log.V(1).Info("Not time to restart yet", "nextRestartTime", nextRestart)
podRestarter.Status.NextRestartTime = &metav1.Time{Time: nextRestart}
r.setCondition(podRestarter, ConditionTypeReady, metav1.ConditionTrue, "Waiting", "Waiting for next restart interval")
}
// Update status
if err := r.Status().Update(ctx, podRestarter); err != nil {
log.Error(err, "Failed to update status")
return ctrl.Result{}, err
}
// Requeue after interval
return ctrl.Result{RequeueAfter: interval}, nil
}
// restartPods restarts pods based on the strategy
func (r *PodRestarterReconciler) restartPods(ctx context.Context, podRestarter *chaosv1alpha1.PodRestarter, pods []corev1.Pod) (int, error) {
log := log.FromContext(ctx)
restarted := 0
strategy := podRestarter.Spec.Strategy
if strategy == "" {
strategy = "all"
}
maxConcurrent := int(podRestarter.Spec.MaxConcurrent)
if maxConcurrent == 0 {
maxConcurrent = len(pods)
}
switch strategy {
case "random-one":
if len(pods) > 0 {
randomIndex := rand.Intn(len(pods))
pod := pods[randomIndex]
log.Info("Restarting pod (random-one strategy)", "pod", pod.Name)
if err := r.Delete(ctx, &pod); err != nil {
if !errors.IsNotFound(err) {
return restarted, fmt.Errorf("failed to delete pod %s: %w", pod.Name, err)
}
}
restarted++
}
case "rolling":
for i, pod := range pods {
if restarted >= maxConcurrent {
log.Info("Reached maxConcurrent limit", "restarted", restarted, "maxConcurrent", maxConcurrent)
break
}
log.Info("Restarting pod (rolling strategy)", "pod", pod.Name, "index", i+1, "total", len(pods))
if err := r.Delete(ctx, &pod); err != nil {
if !errors.IsNotFound(err) {
log.Error(err, "Failed to delete pod", "pod", pod.Name)
continue
}
}
restarted++
}
case "all":
fallthrough
default:
for _, pod := range pods {
if restarted >= maxConcurrent && maxConcurrent != len(pods) {
log.Info("Reached maxConcurrent limit", "restarted", restarted, "maxConcurrent", maxConcurrent)
break
}
log.Info("Restarting pod (all strategy)", "pod", pod.Name)
if err := r.Delete(ctx, &pod); err != nil {
if !errors.IsNotFound(err) {
log.Error(err, "Failed to delete pod", "pod", pod.Name)
continue
}
}
restarted++
}
}
return restarted, nil
}
// getInterval returns the restart interval duration
func (r *PodRestarterReconciler) getInterval(podRestarter *chaosv1alpha1.PodRestarter) time.Duration {
minutes := podRestarter.Spec.IntervalMinutes
if minutes == 0 {
minutes = 5
}
return time.Duration(minutes) * time.Minute
}
// setCondition sets a condition on the PodRestarter status
func (r *PodRestarterReconciler) setCondition(podRestarter *chaosv1alpha1.PodRestarter, conditionType string, status metav1.ConditionStatus, reason, message string) {
condition := metav1.Condition{
Type: conditionType,
Status: status,
Reason: reason,
Message: message,
LastTransitionTime: metav1.Now(),
ObservedGeneration: podRestarter.Generation,
}
found := false
for i, existingCondition := range podRestarter.Status.Conditions {
if existingCondition.Type == conditionType {
if existingCondition.Status != status {
podRestarter.Status.Conditions[i] = condition
} else {
podRestarter.Status.Conditions[i].Message = message
podRestarter.Status.Conditions[i].Reason = reason
podRestarter.Status.Conditions[i].LastTransitionTime = metav1.Now()
}
found = true
break
}
}
if !found {
podRestarter.Status.Conditions = append(podRestarter.Status.Conditions, condition)
}
}
// SetupWithManager sets up the controller with the Manager
func (r *PodRestarterReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&chaosv1alpha1.PodRestarter{}).
Complete(r)
}
The controller contains several functions, of which the Reconcile() function is the most important function in the Operator. It’s triggered whenever:
- A PodRestarter resource is created, updated, or deleted
- The configured interval time has elapsed
- A watched resource (like a Pod) changes
After the API and Controller are configured, the next step is to regenerate the manifests, test and build the operator locally:
Generate manifests
# Generate CRD and RBAC manifests
make manifests
# Generates:
# - config/crd/bases/resilience.demo.com_podrestarters.yaml
# - config/rbac/role.yaml (RBAC permissions)
# Generate DeepCopy code
make generate
# Generates:
# - zz_generated.deepcopy.go (DeepCopy methods)
make manifests reads the Kubebuilder markers
in the code and generates Kubernetes manifests from Go types.
Validate and test
# Format all Go code according to standard style
make fmt
# Static analysis to find common mistakes
make vet
# Run all tests
make test
Expected output:
ok github.com/AshwinSarimin/pod-restarter/api/v1 0.234s
ok github.com/AshwinSarimin/pod-restarter/internal/controller 2.456s
Build the operator locally
# Build
make build
# Compiles the operator binary
# Output: bin/manager
If the build if succesfull. you’ll see:
go build -o bin\manager.exe cmd/main.go
The binary is now in bin\manager.exe, but it doesn’t do anything useful yet.
To summarize, we have the following configurations:
| Concept | Info |
|---|---|
| Kubebuilder Project | The entire Go application |
| API (Custom Resource) | New resource type definition (PodRestarter struct in podrestarter_types.go) |
| Controller | The reconciliation logic (Reconcile() function in podrestarter_controller.go) |
| CRD | Kubernetes manifest describing the API |
KubeBuilder markers #
KubeBuilder markers are special comments in Go code that the Kubebuilder tool reads to generate Kubernetes manifests and configuration. They follow a specific format using Go comment syntax and act as annotations that provide instructions to the code generator.
For example in the internal/controller/podrestarter_controller.go file there are markers defined to generate RBAC rules for the Operator:
// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=resilience.demo.com,resources=podrestarters/finalizers,verbs=update
// +kubebuilder:rbac:groups="",resources=pods,verbs=get;list;watch;delete
It specifies the API group the controller needs permissions for (resilience.demo.com), defines the specific resource type in that group (podrestarters) and list the permissions the controller needs on that resource.
This marker will generate ClusterRole rules in the controller RBAC manifests, for example:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/name: pod-restarter-operator
app.kubernetes.io/managed-by: kustomize
name: podrestarter-editor-role
rules:
- apiGroups:
- resilience.demo.com
resources:
- podrestarters
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- resilience.demo.com
resources:
- podrestarters/status
verbs:
- get
Other frequently used markers include:
Resource Definition Markers:
// +kubebuilder:resource:scope=Namespaced,shortName=pr
Defines properties of the CRD, including scope (Namespaced or Cluster) and shortName for kubectl.
Subresource Markers:
// +kubebuilder:subresource:status
Enables the status subresource, allowing separate status updates.
Validation Markers:
// +kubebuilder:validation:Minimum=0
// +kubebuilder:validation:Maximum=100
Adds OpenAPI validation rules to the CRD.
Printcolumn Markers:
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
Defines additional columns for kubectl get command output.
Webhook Markers:
// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=fail
Configures webhook settings for validation or mutation.
These markers are converted to YAML when executing make manifests, which generates all the necessary Kubernetes resources for the operator to function with the correct permissions and configurations.
After implementing the controller logic, we need to test our operator in a real Kubernetes environment.
Test the Operator #
The operator can be tested in a Kubernetes cluster locally in a KinD cluster:
- Create KinD cluster
# Create cluster
kind create cluster --name test-cluster
# Verify
kubectl cluster-info --context test-cluster
kubectl get nodes
- Install CRDs
# Install the PodRestarter CRD
make install
# Verify
kubectl get crd podrestarters.resilience.demo.com
kubectl describe crd podrestarters.resilience.demo.com
- Run operator locally
This step is optional to be able to run the operator locally and log the PodRestarter operator activity from the KinD cluster in the local terminal. What it does is:
- Compiles and runs the operator on the local machine
- Connects to the Kind cluster via kubeconfig
- Watches for PodRestarter resources
- Logs all activity to the terminal
Open a terminal and run:
make run
Expected output:
2025-01-07T10:30:00Z INFO setup starting manager
2025-01-07T10:30:01Z INFO Starting EventSource {"controller": "podrestarter"}
2025-01-07T10:30:01Z INFO Starting Controller {"controller": "podrestarter"}
2025-01-07T10:30:01Z INFO Starting workers {"controller": "podrestarter", "worker count": 1}
Leave this terminal running as a log viewer.
- Deploy demo application
- Open a new terminal.
- Deploy a demo app with 3 replicas
kubectl create deployment demo-app --image=nginx:alpine --replicas=3
- Update the sample CR
Update config\samples\resilience_v1_podrestarter.yaml:
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
labels:
app.kubernetes.io/name: pod-restarter-operator
app.kubernetes.io/managed-by: kustomize
name: podrestarter-sample
spec:
selector:
matchLabels:
app: demo-app # Must be the same label as the demo applications
intervalMinutes: 2
strategy: rolling
maxConcurrent: 1
suspend: false
- Create the PodRestarter CR
kubectl apply -f config\samples\resilience_v1_podrestarter.yaml
- Check status
In the operator logs terminal, you will see that 1 pod has been restarted by the operator:
2025-10-17T17:01:15+02:00 INFO Restarting pod (rolling strategy) {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "pod": "demo-app-56bb8cfcd7-mprqn", "index": 1, "total": 3}
2025-10-17T17:01:15+02:00 INFO Reached maxConcurrent limit {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "restarted": 1, "maxConcurrent": 1}
2025-10-17T17:01:15+02:00 INFO Successfully restarted pods {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "06f6ad6e-33db-43fb-a156-8bb4efb778fd", "count": 1, "strategy": "rolling"}
2025-10-17T17:01:15+02:00 DEBUG Checking restart interval {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "22db838a-ea40-482e-a990-36a791924d0f", "timeSinceLastRestart": "453.161346ms", "interval": "2m0s", "shouldRestart": false}
2025-10-17T17:01:15+02:00 DEBUG Not time to restart yet {"controller": "podrestarter", "controllerGroup": "resilience.demo.com", "controllerKind": "PodRestarter", "PodRestarter": {"name":"podrestarter-sample","namespace":"default"}, "namespace": "default", "name": "podrestarter-sample", "reconcileID": "22db838a-ea40-482e-a990-36a791924d0f", "nextRestartTime": "2025-10-17T17:03:15+02:00"}
In the cluster you will see that the 1 pod is terminated and being recreated by the deployment.
NAME READY STATUS RESTARTS AGE
demo-app-56bb8cfcd7-c4zbc 1/1 Running 0 3m17s
demo-app-56bb8cfcd7-hkpmt 1/1 Running 0 9m31s
demo-app-56bb8cfcd7-lm4m5 1/1 Running 0 9m31s
demo-app-56bb8cfcd7-lm4m5 1/1 Terminating 0 10m
demo-app-56bb8cfcd7-whq5p 0/1 Pending 0 0s
demo-app-56bb8cfcd7-whq5p 0/1 Pending 0 0s
demo-app-56bb8cfcd7-whq5p 0/1 ContainerCreating 0 0s
demo-app-56bb8cfcd7-lm4m5 0/1 Terminating 0 10m
demo-app-56bb8cfcd7-lm4m5 0/1 Terminating 0 10m
demo-app-56bb8cfcd7-lm4m5 0/1 Terminating 0 10m
demo-app-56bb8cfcd7-lm4m5 0/1 Terminating 0 10m
demo-app-56bb8cfcd7-whq5p 1/1 Running 0 0s
The status for the PodRestarter CR will show the information about matching pods, total restarts and last restart.
# Check status:
kubectl get podrestarter
# Check detailed status:
kubectl describe podrestarter podrestarter-sample
- Test different strategies
The operator also contains different strategies to test:
All strategy (restart all at once):
kubectl apply -f - <<EOF
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
name: restarter-all
spec:
selector:
matchLabels:
app: demo-app
intervalMinutes: 2
strategy: all
EOF
Random-one strategy (one random pod):
kubectl apply -f - <<EOF
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
name: restarter-random
spec:
selector:
matchLabels:
app: demo-app
intervalMinutes: 2
strategy: random-one
EOF
- Test suspend feature
The operator contains a suspend feature, shown in the controller code: This can be configured by redeploying the PodRestarter CR:
apiVersion: resilience.demo.com/v1
kind: PodRestarter
metadata:
labels:
app.kubernetes.io/name: pod-restarter-operator
app.kubernetes.io/managed-by: kustomize
name: podrestarter-sample
spec:
selector:
matchLabels:
app: demo-app
intervalMinutes: 2
strategy: rolling
maxConcurrent: 1
suspend: true # Change this to true
kubectl apply -f config\samples\resilience_v1_podrestarter.yaml
Or by patching the current PodRestarter CR:
kubectl patch podrestarter podrestarter-sample -p '{"spec":{"suspend":true}}' --type=merge
# Check status
kubectl get podrestarter podrestarter-sample
The pods should now stop restarting and the operator logs would show:
2025-10-17T17:35:00Z INFO PodRestarter is suspended, skipping reconciliation
Resuming restarts can be done by changing the suspend value to false:
kubectl patch podrestarter podrestarter-sample -p '{"spec":{"suspend":false}}' --type=merge
Run the Operator #
Now that the operator is ready, it can be used in other clusters. For this I use my homelab cluster to run the operator, and an Azure Container registry to store the controller image.
Kubebuilder comes with built-in commands for building and pushing images to a registry, and configuring the cluster.
- Build image
make docker-build IMG=<REGISTRY_NAME>/pod-restarter:v0.1.0
#For example
make docker-build IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0
The command uses the Dockerfile stored in the Kubebuilder project, which is a multi-stage build that builds the Go binary and creates a minimal runtime image. The second runtime stage only contains the binary (making it much smaller) with a distroless minimal base image, which contributes to faster deployment and minimal attack surface.
- Push to registry To be able to send a image to a private registry (like an Azure Container Registry), make sure you have the proper permissions.
az login --tenant xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx
az acr login --name teknologieur1acr
make docker-push IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0
- Install CRDs on cluster
# Switch to the cluster context
kubectl config get-contexts
kubectl config use-context homelab-admin@homelab
# Install CRDs
make install
- Deploy Operator
# Deploy operator
make deploy IMG=teknologieur1acr.azurecr.io/pod-restarter:v0.1.0
# Verify
kubectl get pods -n pod-restarter-operator-system
kubectl logs -n pod-restarter-operator-system deployment/pod-restarter-operator-controller-manager -f
- Cleanup (optional)
make undeploy
make uninstall
KubeBuilder common commands #
# Initialize project
kubebuilder init --domain demo.com --repo github.com/you/pod-restarter-operator
# Create API
kubebuilder create api --group test --version v1 --kind PodRestarter
# Generate CRD manifests from Go structs
make manifests
# Install CRDs in cluster
make install
# Run operator locally (for development)
make run
# Build and push Docker image
make docker-build docker-push IMG=<REGISTRY_NAME>/podrestarter-operator:v1
# Deploy operator to cluster
make deploy IMG=<REGISTRY_NAME>/podrestarter:v1
# Uninstall CRDs
make uninstall
# Undeploy operator
make undeploy
Conclusion #
In this post, we’ve explored how to build a Kubernetes operator using KubeBuilder. We’ve seen how to scaffold a new project, define custom resources, implement reconciliation logic, and deploy the operator to a Kubernetes cluster.
The Pod Restarter operator we’ve built demonstrates several key concepts:
- Creating custom resources with validation
- Implementing a controller with reconciliation logic
- Working with Kubernetes resources programmatically
- Using status subresources to report on operator activities
- Testing and deploying operators
While this example is relatively simple, it shows the power of the operator pattern for extending Kubernetes with custom behavior. The same principles can be applied to build more complex operators that manage databases, handle application-specific scaling, or automate infrastructure provisioning.
Additional resources #
- KubeBuilder Book - The official documentation for KubeBuilder
- Operator Pattern - Kubernetes documentation on operators
- Controller Runtime - The library that powers KubeBuilder
- API Conventions