Skip to main content

Service Router Operator for multi-region DNS management on AKS

Table of Contents

Introduction
#

In this post, I walk through the Service Router Operator, a Kubernetes operator I built to automate multi-region DNS provisioning on AKS using Istio and ExternalDNS.

The code repository for the operator can be found here: service-router-operator

Multi region DNS on AKS
#

The AKS platform can operate across multiple Azure regions (in this case West Europe and North Europe) to provide high availability, disaster recovery capabilities and geographic proximity to users. In this multi-region architecture, a critical requirement is the ability to route traffic seamlessy between regions. When an application or service becomes unavailable in one region, whether due to maintenance, failure, or regional issues, traffic must be automatically or manually redirected to a healthy instance in another region.

Managing DNS across multiple regions presents significant operational complexity: application teams must route traffic to the correct regional cluster, maintain DNS records across multiple Azure Private DNS zones, and handle both regional isolation and cross-region failover scenarios. This way DNS management becomes complex and error-prone.

Service Router Operator
#

To address these challenges, the Service Router Operator provides automated DNS management for multi-region deployments:

  • Automates DNS Record Creation: Automatically generates DNS records based on Custom Resources.
  • Enables Regional Control: Supports both Active (multi-region) and RegionBound (single-region) operational modes.
  • Prevents Conflicts: Uses label-based filtering to ensure each region’s ExternalDNS instance only manages its designated DNS records.
  • Simplifies Operations: application teams simply declare their services and desired routing behavior using Helm charts.

Our AKS platform spans two Azure regions (North Europe and West Europe) each with its own private DNS zone. When an workload team wants to expose a service, they need DNS records in both zones. The record in the West Europe zone should point to the West Europe cluster’s Istio ingress gateway, and the one in the North Europe zone should point to the North Europe gateway. Straightforward enough, until you account for:

  • Active-Active services: the service runs in both regions, each cluster manages its own DNS
  • RegionBound services: the service only runs in one region, but DNS records still need to exist in all zones pointing to the active cluster
  • Failover scenarios: if a cluster goes down, another cluster needs to take over DNS management for the failed region.

Why an Operator?
#

If you’ve read my previous post on KubeBuilder, you’ll know I’m a fan of the operator pattern for exactly this kind of problem. An operator runs a continuous reconciliation loop, watching the desired state defined in custom resources and continuously working to make reality match it. For DNS management, this is ideal:

  • If a DNS record is manually deleted, the operator recreates it within seconds.
  • If a Gateway’s LoadBalancer IP changes, the operator updates all CNAME records automatically.
  • application teams define simple, declarative resources without needing to understand the DNS internals.
  • Platform teams retain control over cluster-wide infrastructure through their own set of resources.

How It Works
#

The Service Router automates DNS provisioning and traffic routing for services deployed across multiple AKS clusters and regions.

Components
#

Service Router Operator
#

The operator manages the lifecycle of DNS records by continuously reconciling Custom Resource Definitions (CRDs) to ensure the desired state matches the actual state.

ExternalDNS
#

ExternalDNS is an AKS platform component that automatically synchronizes Kubernetes networking resources (Services, Ingresses, and DNSEndpoint CRs) with DNS providers like Azure Private DNS zones. It monitors DNSEndpoint custom resources created by the Service Router Operator and provisions the corresponding DNS records.

Each AKS cluster runs multiple ExternalDNS instances, one for each regional DNS zone. This enables seamless failover when a cluster becomes unavailable, as the healthy cluster’s ExternalDNS instance can take over DNS management for the failed region.

Label-Based Filtering
#

The Service Router Operator uses labels to ensure region-specific ExternalDNS instances only manage their designated DNS records. When a DNSEndpoint is created with label app: external-dns-North Europe, only the North Europe ExternalDNS instance processes it and creates records in the North Europe Private DNS zone. This ensures:

  • Prevention of DNS conflicts between regions
  • Independent region management
  • Clear ownership and responsibility
  • Support for both Active and RegionBound operational modes

Important: The Service Router Operator does not directly create DNS records. It creates DNSEndpoint Custom Resources that ExternalDNS watches and uses to provision DNS records in Azure Private DNS.

By combining the Service Router Operator with ExternalDNS and Azure Private DNS, the platform achieves fully automated, conflict-free DNS management that enables seamless traffic routing between regions.

Custom Resources
#

The Service Router Operator defines five Custom Resource Definitions across two API groups. Platform teams manage the cluster-wide infrastructure resources, and application teams manage their own namespace-scoped resources.

cluster.router.io/v1alpha1   → ClusterIdentity, DNSConfiguration
routing.router.io/v1alpha1   → Gateway, DNSPolicy, ServiceRoute

ClusterIdentity (cluster-scoped, platform team): Defines the cluster’s metadata region, cluster name, base domain, and environment letter. This is used to construct DNS names and to determine which ExternalDNS controllers are relevant for this cluster.

DNSConfiguration (cluster-scoped, platform team): Lists all ExternalDNS controller instances available across the platform, mapping each controller name to its region. This is the single source of truth for which ExternalDNS controllers exist.

Gateway (namespace-scoped, platform team): Wraps an Istio ingress gateway with DNS target information. The operator creates the Istio Gateway resource and keeps its hosts list synchronized with all ServiceRoutes referencing it.

DNSPolicy (namespace-scoped, workload team): Defines how DNS records should be propagated for services in that namespace — either in Active mode (this cluster manages its own region only) or RegionBound mode (one cluster manages DNS for multiple regions).

ServiceRoute (namespace-scoped, workload team): Links a Kubernetes service to a Gateway and triggers DNS record creation. When you create a ServiceRoute, the operator constructs the DNS name and creates the appropriate DNSEndpoint resources for ExternalDNS to pick up.

DNS Provisioning Flow
#

The operator does not create DNS records directly. Instead, it creates DNSEndpoint custom resources (from the ExternalDNS CRD API), which ExternalDNS watches and uses to provision records in Azure Private DNS.

Workload team creates ServiceRoute
Service Router Operator reconciles
  ├── Reads ClusterIdentity (region, domain, cluster)
  ├── Reads DNSPolicy (mode, active controllers)
  ├── Reads Gateway (target postfix, Istio controller)
  └── Creates DNSEndpoint CRDs (one per active ExternalDNS controller)
ExternalDNS controller watches DNSEndpoints (filtered by label)
ExternalDNS provisions CNAME record in Azure Private DNS:
  api-ns-p-prod-myapp.example.com → aks-West Europe-internal.example.com
A separate IngressDNS controller (part of the operator) watches
the Gateway's LoadBalancer Service and creates an A record:
  aks-West Europe-internal.example.com → 10.123.45.67

The two-step CNAME + A record design means that if the gateway IP ever changes (for example, after a cluster recreation), only the A record needs to update — all service CNAME records automatically follow without any changes.

DNS names are constructed deterministically from ServiceRoute fields:

{serviceName}-ns-{envLetter}-{environment}-{application}.{domain}

For example, a ServiceRoute with serviceName: api, environment: prod, application: myapp on a cluster with environmentLetter: p and domain: example.com produces api-ns-p-prod-myapp.example.com.

Operational Modes
#

The DNSPolicy controls how DNS records are spread across regions.

Active Mode — use this when your service runs in multiple regions. Each cluster manages DNS only for its own region. A client in West Europe queries the West Europe DNS zone and routes to the West Europe cluster; a client in North Europe does the same for its cluster.

RegionBound Mode — use this when your service only runs in one region but you still want DNS records in all regional zones. You specify a sourceRegion, and only the cluster in that region becomes active. The active cluster creates DNSEndpoint resources for all ExternalDNS controllers (West Europe and North Europe), so both zones get records pointing to the single active cluster. Clusters in other regions are automatically inactive and do not create any DNS records.

Active Mode RegionBound Mode
DNS Scope Each cluster manages its own region One cluster manages all regions
Traffic Pattern Regional routing (low latency) Centralized routing (cross-region)
Best For High availability, data residency Single-region services, cost optimization

Label-based conflict prevention
#

To prevent ExternalDNS instances from interfering with each other, the operator sets a router.io/region label on every DNSEndpoint. Each ExternalDNS deployment is configured with --label-filter=router.io/region=West Europe (or North Europe), so it only processes records intended for it. This ensures West Europe ExternalDNS only writes to the West Europe DNS zone, and North Europe ExternalDNS only writes to the North Europe DNS zone — no conflicts, even in complex multi-cluster failover scenarios.

Implementation
#

The following guide walks through setting up a test environment on AKS with the Istio addon and External DNS, installing the operator, and deploying a service with automated DNS. Flux GitOps handles all cluster and workload configuration.

aks-implementation

The following steps are necessary to test the operator:

Name Path Purpose
clusters gitops/clusters/base Installs External DNS, the Service Router and platform specific CRD’s
workloads gitops/workloads Configures the workload specific CRD’s

Prerequisites
#

  • Azure CLI (az) with an active subscription
    • The following az extensions:
      • az extension add -n k8s-configuration
      • az extension add -n k8s-extension

Create Flux managed identity
#

# Create managed identity for Flux
az identity create \
  --resource-group $RESOURCE_GROUP \
  --name "id-flux"

FLUX_CLIENT_ID=$(az identity show \
  --resource-group $RESOURCE_GROUP \
  --name "id-flux" \
  --query clientId -o tsv)

FLUX_TENANT_ID=$(az identity show \
  --resource-group $RESOURCE_GROUP \
  --name "id-flux" \
  --query tenantId -o tsv)

Create an AKS Cluster
#

RESOURCE_GROUP="service-router-test-rg"
CLUSTER_NAME="aks-test"
LOCATION="westeurope"

az group create --name $RESOURCE_GROUP --location $LOCATION

az aks create \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --location $LOCATION \
  --node-count 2 \
  --node-vm-size Standard_D2s_v3 \
  --enable-asm \
  --enable-workload-identity \
  --enable-oidc-issuer \
  --generate-ssh-keys

# Enable the Istio Ingress gateway
az aks mesh enable-ingress-gateway \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --ingress-gateway-type internal

# Install the Flux extension
az k8s-extension create \
  --resource-group $RESOURCE_GROUP \
  --cluster-name $CLUSTER_NAME \
  --cluster-type managedClusters \
  --name flux \
  --extension-type microsoft.flux \
  --config workloadIdentity.enable=true workloadIdentity.azureClientId=$FLUX_CLIENT_ID workloadIdentity.azureTenantId=$FLUX_TENANT_ID

Verify the Istio ingress gateway is running and has an IP address:

az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME

kubectl get svc -n aks-istio-ingress
# NAME                                TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)
# aks-istio-ingressgateway-internal   LoadBalancer   10.0.XXX.XXX   10.X.X.X      ...

Create ACR
#

ACR_NAME="serviceroutertestacr"

# Create the registry
az acr create \
  --resource-group $RESOURCE_GROUP \
  --name $ACR_NAME \
  --sku Basic

# Get the ACR ID
ACR_ID=$(az acr show \
  --resource-group $RESOURCE_GROUP \
  --name $ACR_NAME \
  --query id -o tsv)

# Attach the ACR to the AKS cluster
az aks update \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --attach-acr $ACR_NAME

Configure Flux identity
#

# Get the AKS OIDC issuer
OIDC_ISSUER=$(az aks show \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --query oidcIssuerProfile.issuerUrl -o tsv)

# Create federated credential for source-controller
az identity federated-credential create \
  --name id-flux-source \
  --identity-name id-flux \
  --resource-group $RESOURCE_GROUP \
  --issuer $OIDC_ISSUER \
  --subject "system:serviceaccount:flux-system:source-controller" \
  --audience api://AzureADTokenExchange

# Create federated credential for kustomize-controller
az identity federated-credential create \
  --name id-flux-kustomize \
  --identity-name id-flux \
  --resource-group $RESOURCE_GROUP \
  --issuer $OIDC_ISSUER \
  --subject "system:serviceaccount:flux-system:kustomize-controller" \
  --audience api://AzureADTokenExchange

# Grant ACR Pull permissions to Flux identity
az role assignment create \
  --assignee $FLUX_CLIENT_ID \
  --role "AcrPull" \
  --scope $ACR_ID

Create an Azure Private DNS Zone
#

DNS_ZONE="test-aks.nl"

az network private-dns zone create \
  --resource-group $RESOURCE_GROUP \
  --name $DNS_ZONE

# Link the zone to the AKS VNet
NODE_RESOURCE_GROUP=$(az aks show \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --query "nodeResourceGroup" -o tsv)

VNET_ID=$(az network vnet list \
  --resource-group $NODE_RESOURCE_GROUP \
  --query "[].id" -o tsv)

az network private-dns link vnet create \
  --resource-group $RESOURCE_GROUP \
  --zone-name $DNS_ZONE \
  --name aks-vnet-link \
  --virtual-network $VNET_ID \
  --registration-enabled false

Configure Workload Identity for ExternalDNS
#

# Get the AKS OIDC issuer
OIDC_ISSUER=$(az aks show \
  --resource-group $RESOURCE_GROUP \
  --name $CLUSTER_NAME \
  --query oidcIssuerProfile.issuerUrl -o tsv)

SUBSCRIPTION_ID=$(az account show --query id -o tsv)
DNS_ZONE_ID="/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.Network/privateDnsZones/$DNS_ZONE"

# Create managed identity for ExternalDNS
az identity create \
  --resource-group $RESOURCE_GROUP \
  --name "id-external-dns-West Europe"

CLIENT_ID=$(az identity show \
  --resource-group $RESOURCE_GROUP \
  --name "id-external-dns-West Europe" \
  --query clientId -o tsv)

# Grant DNS Zone Contributor on the private DNS zone
az role assignment create \
  --assignee $CLIENT_ID \
  --role "Private DNS Zone Contributor" \
  --scope $DNS_ZONE_ID

# Create federated credential for Workload Identity
az identity federated-credential create \
  --name external-dns-West Europe \
  --identity-name id-external-dns-West Europe \
  --resource-group $RESOURCE_GROUP \
  --issuer $OIDC_ISSUER \
  --subject "system:serviceaccount:ns-external-dns:external-dns-West Europe"

Build and push the operator helm chart
#

The Helm chart for the operator can be found here: service-router-operator

# Set chart version (update as needed)
CHART_VERSION="0.2.0"

# Package the Helm chart from the charts directory
make helm-package

# Push the chart to ACR
helm push dist/service-router-operator-${CHART_VERSION}.tgz oci://${ACR_NAME}.azurecr.io/helm

# Verify the chart was pushed
az acr repository show \
  --name $ACR_NAME \
  --repository helm/service-router-operator

# Clean up the local package
rm service-router-operator-${CHART_VERSION}.tgz

Build and push the operator image
#

az acr build \
  --registry $ACR_NAME \
  --image service-router-operator:latest \
  --file Dockerfile .

# Verify the image is available in the registry:
az acr repository show-tags \
  --name $ACR_NAME \
  --repository service-router-operator \
  --output table

Configure cluster with Flux
#

The GitOps folders for the cluster can be found here: gitops

az k8s-configuration flux create \
  --cluster-name $CLUSTER_NAME \
  --resource-group $RESOURCE_GROUP \
  --cluster-type managedClusters \
  --name cluster-config \
  --namespace flux-system \
  --scope cluster \
  --url https://github.com/AshwinSarimin/service-router-operator \
  --branch main \
  --kustomization name=clusters path=./gitops/clusters/base prune=true \
  --kustomization name=workloads path=./gitops/workloads prune=true

Verify
#

# Check ClusterIdentity
kubectl get clusterIdentity -n ns-service-router clusteridentity -o yaml | grep -A 10 status

# Check DNSConfiguration
kubectl get DNSConfiguration -n ns-service-router dnsconfiguration -o yaml | grep -A 10 status

# Check Gateway
kubectl get gateway.routing.router.io -n ns-service-router app-gateway-ingress -o yaml | grep -A 10 status

# Check ServiceRoute
kubectl get serviceroute -n ns-service-router test-workload-route -o yaml | grep -A 10 status

# Check that the operator created DNSEndpoint resources
kubectl get dnsendpoints -n ns-service-router

# Check what the DNSEndpoint contains
kubectl get dnsendpoint -n ns-service-router -o yaml

# Check ExternalDNS picked it up
kubectl logs -n ns-external-dns  -l app.kubernetes.io/name=external-dns --tail=20

Within about a minute, you should see the CNAME record in Azure Private DNS pointing to the Istio ingress gateway hostname, and the A record for the gateway hostname pointing to the LoadBalancer IP. Traffic from within the VNet to test-workload-ns-service-router-workload-a.test-aks.nl will now route through Istio to the test service.

Copilot
#

Building a production-grade Kubernetes operator requires expertise across multiple domains: Go, Kubernetes, DNS, Istio and more. By creating a persistent knowledge base that GitHub Copilot can reference throughout development, I transformed it into an expert pair programmer. This AI-assisted approach, combined with custom instructions and specialized agents, significantly accelerated the development process.

GitHub Copilot’s custom instructions feature makes this possible, and I took it further by creating specialized AI agents for different roles:

The 3 agents each serve as expert consultants with distinct expertise:

  • Kubernetes Operator Agent: My primary development agent, used during active coding and code reviews. Lives and breathes Kubernetes operators.
  • Devil’s Advocate Agent: Consulted before major design decisions to stress-test ideas by challenging assumptions, focusing deeply on each concern, and summarizing both strong defenses and remaining vulnerabilities.
  • Technical Writer Agent: Engaged when documenting features or writing tutorials. Provides step-by-step guidance and adapts to different writing styles (Docs, Tutorials, Architecture).

The instructions are the knowledge base for domain-specific practices. These markdown files contain detailed guidelines that Copilot references automatically when working with specific file types:

  • go.instructions.md: Ensures idiomatic error handling, proper context usage, and consistent code structure
  • go-operator.instructions.md: Defines controller standards and reconciliation patterns
  • helm.instructions.md: Enforces security best practices and comprehensive health checks
  • code-review.instructions.md: Provides a checklist that catches issues before they reach production

Don’t create agents until you have solid instruction files. Instructions are the knowledge base, agents are the interface. The key is starting with one good instruction file and building from there. Even a single go.instructions.md will dramatically improve AI code generation quality.

Order of development:

  • Create basic instructions (go.instructions.md)
  • Add domain-specific instructions (go-operator.instructions.md)
  • Build specialized agents for specific workflows
  • Iterate based on actual usage patterns

Conclusion
#

This solution modernized our DNS networking layer for the AKS Fleet, eliminating the operational burden of manual DNS management across regions. What once required careful coordination between platform and application teams is now a simple declarative resource.

With AI as a force multiplier, the project was completed faster than expected, though every line of code was reviewed and thoroughly tested before committing. The Devil’s Advocate agent proved particularly valuable for validating design decisions before implementation.

If you’re running AKS across multiple regions and DNS management is still a manual or scripted process, I hope this gives you some ideas.