Skip to main content

Automating AKS maintenance notifications with Bicep

·10 mins·
Azure AKS Azure Resource Graph KQL
Table of Contents

AKS Communication Manager enables proactive monitoring of AKS maintenance tasks through Azure Resource Notifications and Azure Resource Graph. Unlike the manual approach described in Microsoft’s documentation , this post demonstrates how to use Bicep for automated, repeatable deployment of maintenance notifications. Instead of manually configuring notifications for each cluster through the Azure Portal, you can version control your monitoring infrastructure and deploy it consistently across multiple clusters with Bicep.

Info
#

An AKS cluster uses Azure Resource Notifications to publish the following scheduled maintenance events:

  • K8sVersionUpgrade: Whenever a Kubernetes version upgrade is planned, started or finished.
  • NodeOSUpgrade: Whenever a Kubernetes node OS upgrade is planned, started or finished.

The benefit for these notifications is to better prepare for eventual disruptions and get notified when a scheduled maintenance has failed. The event will show the reasons why it has failed, which reduces manual operational tasks. The available values are Scheduled, Started, Completed, Canceled, and Failed.

akscommunicationmanager

The solution consists of two main components:

User-assigned Managed Identity A single user-assigned managed identity with Reader access to the AKS cluster to retrieve maintenance event data from Azure Resource Graph. This approach is more scalable than the Microsoft-documented system-assigned identity pattern, so that one identity can be used to serve multiple alert rules, reducing overhead when managing multiple notification scenarios.

Log search alert rules The Log Search alert rules execute KQL queries against Azure Resource Graph’s containerserviceeventresources table and trigger notifications through action groups when maintenance events occur. Each rule monitors specific upgrade types (Kubernetes version or Node OS) and tracks event status changes.

The containerserviceeventresources table in Azure Resource Graph is the data source for all AKS maintenance notifications. This table stores records for scheduled maintenance events.

Event Lifecycle:

  1. Scheduled: Event created 7 days before maintenance (first notification)
  2. Scheduled: Updated 24 hours before maintenance (second notification)
  3. Started: Maintenance begins
  4. Completed/Failed: Final status with details

Prerequisites
#

Before deploying this solution, ensure you have:

Configure solution
#

The solution consists of the following folder structure and files:

bicep/
├── deployments/
│   └── monitoringCluster/
│       └── main.bicep                           # Bicep deployment file
├── templates/
│   ├── actionGroup.bicep                        # Bicep template for action groups
│   └── alertRule.bicep                          # Bicep template for alert rules
├── kql/
│   ├── alertrules-platform-cluster-upgrade.kql  # KQL query for Kubernetes version upgrades
│   └── alertrules-platform-os-upgrade.kql       # KQL query for Node OS upgrades

KQL
#

Working with KQL in Bicep
#

My first idea was to define the KQL query in Bicep as a multiline string, but I ran into issues because expression interpolation inside multiline strings isn’t supported yet. Each query needs a dynamic cluster ID, so I worked around it by keeping the interpolated part as a separate string variable and concatenating the pieces into a single query string, like this:

Click here to see the code
var scheduledEventsBaseQuery = '''
arg("").containerserviceeventresources
| where type == "microsoft.containerservice/managedclusters/scheduledevents"'''

var scheduledEventsClusterIdQuery = '| where id contains "${resourceId('Microsoft.ContainerService/managedClusters', clusterName)}"'

var upgradeNotificationSelectQuery = '''
| where properties has "eventStatus"
| extend status = substring(properties, indexof(properties, "eventStatus") + strlen("eventStatus") + 3, 50)
| extend status = substring(status, 0, indexof(status, ",") - 1)
| where status != ""
| where properties has "eventDetails"
| extend upgradeType = case(
                            properties has "K8sVersionUpgrade",
                            "K8sVersionUpgrade",
                            properties has "NodeOSUpgrade",
                            "NodeOSUpgrade",
                            ""
                        )
| extend details = parse_json(tostring(properties.eventDetails))
| where properties has "lastUpdateTime"
| extend eventTime = substring(properties, indexof(properties, "lastUpdateTime") + strlen("lastUpdateTime") + 3, 50)
| extend eventTime = substring(eventTime, 0, indexof(eventTime, ",") - 1)
| extend eventTime = todatetime(tostring(eventTime))
| where eventTime >= ago(2h)
| where upgradeType == "K8sVersionUpgrade"
| project    
    eventTime,
    upgradeType,
    status,
    properties,
    name,
    details
| order by eventTime asc
'''

var upgradeNotificationQuery = '${scheduledEventsBaseQuery}\r\n${scheduledEventsClusterIdQuery}\r\n${upgradeNotificationSelectQuery}'
  • scheduledEventsBaseQuery: a multiline string containing the first part of the KQL query.
  • scheduledEventsClusterIdQuery: a single-line string that retrieves the AKS cluster resourceId for the KQL query.
  • upgradeNotificationSelectQuery: a multiline string that contains the rest of the query.
  • upgradeNotificationQuery: a string expression that interpolates the three variables into one final string, inserting "\r\n" between them to preserve newlines.

This approach works but complicates debugging the queries. That’s why I decided to store the KQL queries in separate files, which has the following benefits:

  • Run queries directly in Azure Resource Graph Explorer during development
  • Get syntax highlighting and autocompletion in VS Code (with the KQL extension)
  • Update queries without touching Bicep code

Each query uses a $(CLUSTER_ID) placeholder that gets replaced during deployment. This pattern keeps the queries reusable while allowing cluster specific targeting.

KQL files
#

The KQL files contains the queries to find scheduled cluster upgrade events for a specific AKS cluster in the last 2 hours.

Click here to see the code (kql/alertrules-platform-cluster-upgrade.kql)
arg("").containerserviceeventresources
| where type == "microsoft.containerservice/managedclusters/scheduledevents"
| where id contains "$(CLUSTER_ID)"
| where properties has "eventStatus"
| extend status = substring(properties, indexof(properties, "eventStatus") + strlen("eventStatus") + 3, 50)
| extend status = substring(status, 0, indexof(status, ",") - 1)
| where status != ""
| where properties has "eventDetails"
| extend upgradeType = case(
                            properties has "K8sVersionUpgrade",
                            "K8sVersionUpgrade",
                            properties has "NodeOSUpgrade",
                            "NodeOSUpgrade",
                            ""
                        )
| extend details = parse_json(tostring(properties.eventDetails))
| where properties has "lastUpdateTime"
| extend eventTime = substring(properties, indexof(properties, "lastUpdateTime") + strlen("lastUpdateTime") + 3, 50)
| extend eventTime = substring(eventTime, 0, indexof(eventTime, ",") - 1)
| extend eventTime = todatetime(tostring(eventTime))
| where eventTime >= ago(2h)
| where upgradeType == "K8sVersionUpgrade"
| project    
    eventTime,
    upgradeType,
    status,
    properties,
    name,
    details
| order by eventTime asc
Click here to see the code (kql/alertrules-platform-os-upgrade.kql)
arg("").containerserviceeventresources
| where type == "microsoft.containerservice/managedclusters/scheduledevents"
| where id contains "$(CLUSTER_ID)"
| where properties has "eventStatus"
| extend status = substring(properties, indexof(properties, "eventStatus") + strlen("eventStatus") + 3, 50)
| extend status = substring(status, 0, indexof(status, ",") - 1)
| where status != ""
| where properties has "eventDetails"
| extend upgradeType = case(
                            properties has "K8sVersionUpgrade",
                            "K8sVersionUpgrade",
                            properties has "NodeOSUpgrade",
                            "NodeOSUpgrade",
                            ""
                        )
| extend details = parse_json(tostring(properties.eventDetails))
| where properties has "lastUpdateTime"
| extend eventTime = substring(properties, indexof(properties, "lastUpdateTime") + strlen("lastUpdateTime") + 3, 50)
| extend eventTime = substring(eventTime, 0, indexof(eventTime, ",") - 1)
| extend eventTime = todatetime(tostring(eventTime))
| where eventTime >= ago(2h)
| where upgradeType == "NodeOSUpgrade"
| project    
    eventTime,
    upgradeType,
    status,
    properties,
    name,
    details
| order by eventTime asc

Bicep
#

Main deployment
#

The deployments/monitoringCluster/main.bicep file defines deployment steps for AKS communication manager:

  1. Loads KQL files from the kql/ directory
  2. Injects the target cluster’s resource ID into each query
  3. References the user-assigned managed identity
  4. Defines alert rule configurations as objects
  5. Deploys action groups and alert rules through modules
Click here to see the code (deployments/monitoringCluster/main.bicep)
// =========== //
// Parameters  //
// =========== //

param location string
param clusterName string
param monitoringResourceGroupName string
param actionGroups array
param managedIdentitiesResourceGroupName string

// =================== //
// Variables           //
// =================== //

var clusterUpgradeNotificationKql = loadTextContent('../../kql/alertrules-platform-cluster-upgrade.kql')
var osUpgradeNotificationKql = loadTextContent('../../kql/alertrules-platform-os-upgrade.kql')
var clusterUpgradeNotificationQuery = replace(clusterUpgradeNotificationKql, '$(CLUSTER_ID)', resourceId('Microsoft.ContainerService/managedClusters', clusterName))
var osUpgradeNotificationQuery = replace(osUpgradeNotificationKql, '$(CLUSTER_ID)', resourceId('Microsoft.ContainerService/managedClusters', clusterName))
var alertRulesManagedIdentityName = '${clusterName}-alertrules'

// =================== //
// Existing resources  //
// =================== //

resource alertRulesManagedIdentity 'Microsoft.ManagedIdentity/userAssignedIdentities@2024-11-30' existing = {
  name: alertRulesManagedIdentityName
  scope: resourceGroup(managedIdentitiesResourceGroupName)
}

var clusterUpgradeAlertRules = [
  {
    name: '${clusterName}-alertrules-platform-cluster-upgrade'
    displayName: '${clusterName}-alertrules-platform-cluster-upgrade'
    location: location
    severity: 3
    kind: 'LogAlert'
    identity: {
      type: 'UserAssigned'
      userAssignedIdentities: {
        '${alertRulesManagedIdentity.id}': {}
      }
    }
    evaluationFrequency: 'PT2H'
    scopes: [
      resourceId('Microsoft.ContainerService/managedClusters', clusterName)
    ]
    targetResourceTypes: [
      'Microsoft.ContainerService/managedClusters'
    ]
    windowSize: 'PT2H'
    criteria: {
      allOf: [
        {
          query: clusterUpgradeNotificationQuery
          timeAggregation: 'Count'
          dimensions: [
            {
              name: 'status'
              operator: 'Include'
              values: [
                '*'
              ]
            }
          ]
          operator: 'GreaterThan'
          threshold: json('0')
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
  }
  {
    name: '${clusterName}-alertrules-platform-os-upgrade'
    displayName: '${clusterName}-alertrules-platform-os-upgrade'
    location: location
    severity: 3
    kind: 'LogAlert'
    identity: {
      type: 'UserAssigned'
      userAssignedIdentities: {
        '${alertRulesManagedIdentity.id}': {}
      }
    }
    evaluationFrequency: 'PT2H'
    scopes: [
      resourceId('Microsoft.ContainerService/managedClusters', clusterName)
    ]
    targetResourceTypes: [
      'Microsoft.ContainerService/managedClusters'
    ]
    windowSize: 'PT2H'
    criteria: {
      allOf: [
        {
          query: osUpgradeNotificationQuery
          timeAggregation: 'Count'
          dimensions: [
            {
              name: 'status'
              operator: 'Include'
              values: [
                '*'
              ]
            }
          ]
          operator: 'GreaterThan'
          threshold: json('0')
          failingPeriods: {
            numberOfEvaluationPeriods: 1
            minFailingPeriodsToAlert: 1
          }
        }
      ]
    }
  }
]

// =================== //
// Action Group        //
// =================== //

module monitoringActionGroups '../../templates/actionGroup.bicep' = [for (actionGroup, index) in actionGroups: {
  name: 'monitoringCluster-${clusterName}-actionGroup-${index}'
  scope: resourceGroup(monitoringResourceGroupName)
  params: {
    name: actionGroup.name
    groupShortName: actionGroup.name
    emailAddress: (contains(actionGroup, 'emailAddress')) ? actionGroup.emailAddress : ''
    webhookUrl: (contains(actionGroup, 'teamsWebhookUrl')) ? actionGroup.teamsWebhookUrl : ''
    exists: (contains(actionGroup, 'exists')) ? actionGroup.exists : false
  }
}]

// =================== //
// Alert Rules         //
// =================== //

module alertRule '../../templates/alertRule.bicep' = [for (alertRule, index) in clusterUpgradeAlertRules: {
  name: 'monitoringCluster-${clusterName}-alertRule-${index}'
  scope: resourceGroup(monitoringResourceGroupName)
  params: {
    name: alertRule.name
    displayName: alertRule.displayName
    severity: alertRule.severity ?? 3
    location: alertRule.location
    kind: alertRule.kind
    identity: alertRule.identity
    evaluationFrequency: alertRule.evaluationFrequency
    scopes: alertRule.scopes
    targetResourceTypes: alertRule.targetResourceTypes
    windowSize: alertRule.windowSize
    criteria: alertRule.criteria
    actions: {
      actionGroups: [for i in range(0, length(actionGroups)): monitoringActionGroups[i].outputs.id]
    }
  }
}]

Action Group module
#

The templates/actionGroup.bicep file deploys (or reference) an Azure Monitor Action Group used by alert rules to notify via email, SMS, and webhook (Teams), with an option to skip creation if the Action Group already exists.

Click here to see the code (templates/actionGroup.bicep)
// Parameters
@description('Name of the Action Group resource.')
param name string

@description('Short name used in SMS messages.')
param groupShortName string

@description('Email receiver address.')
param emailAddress string

@description('Webhook url for Teams integration.')
param webhookUrl string

@description('Set to true if the Action Group already exists; template will not create it.')
param exists bool

@description('Use Common Alert Schema for supported receivers.')
param useCommonAlertSchema bool = true

@description('Country code for SMS receiver.')
param countryCode string = '31'

@description('Phone number for SMS receiver.')
param phoneNumber string = ''

@description('Resource tags.')
param tags object = {}

var location = 'Global'

// Resources
resource actionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = if(!exists) {
  name: name
  location: location
  tags: tags
  properties: {
    groupShortName: groupShortName
    enabled: true
    emailReceivers: !empty(emailAddress) ? [
      {
        name: 'EmailAndTextMessageOthers_-EmailAction-'
        emailAddress: emailAddress
        useCommonAlertSchema: useCommonAlertSchema
      }
    ] : []
    smsReceivers: !empty(countryCode) && !empty(phoneNumber) ? [
      {
        name: 'EmailAndTextMessageOthers_-SMSAction-'
        countryCode: countryCode
        phoneNumber: phoneNumber
      }
    ] : []
    webhookReceivers: !empty(webhookUrl) ? [
      {
        name: groupShortName
        serviceUri: webhookUrl
        useCommonAlertSchema: useCommonAlertSchema
        useAadAuth: false
      }
    ] : []
  }
}

resource actionGroupExisting 'Microsoft.Insights/actionGroups@2023-01-01' existing = if (exists) {
  name: name
}

output id string = exists ? actionGroupExisting.id : actionGroup.id
output name string = exists ? actionGroupExisting.name : actionGroup.name

Alert rule module
#

The templates/alertRule.bicep file deploys a Log Alert (scheduled query rule) in Azure Monitor with configurable identity, query criteria, scopes and actions.

Click here to see the code (templates/alertRule.bicep)
param name string
param displayName string
param location string = 'westeurope'
param severity int
param kind string
param identity object
param evaluationFrequency string
param scopes array
param targetResourceTypes array
param windowSize string
param criteria object
param actions object

resource scheduledQueryRule 'microsoft.insights/scheduledqueryrules@2025-01-01-preview' = if (kind == 'LogAlert') {
  name: name
  location: location
  kind: 'LogAlert'
  identity: identity
  properties: {
    displayName: displayName
    severity: severity
    enabled: true
    evaluationFrequency: evaluationFrequency
    scopes: scopes
    targetResourceTypes: targetResourceTypes
    windowSize: windowSize
    criteria: criteria
    autoMitigate: false
    actions: actions
  }
}

With all files in place, it is time to deploy with Bicep.

Deployment
#

Deploy the Bicep template using Azure CLI:

az deployment group create `
  --name main-monitoringCluster-teknologi-eur1-prd-aks01 `
  --resource-group teknologi-eur1-prd-aks-rg `
  --template-file "bicep/deployments/monitoringCluster/main.bicep" `
  --verbose `
  --parameters `
    location="westeurope" `
    clusterName="teknologi-eur1-prd-aks01" `
    monitoringResourceGroupName="teknologi-eur1-prd-monitor-rg" `
    actionGroups="[{"name":"TeknologiActionGroup","emailAddress":"ashwin.sarimin@teknologi.nl"}]" `
    managedIdentitiesResourceGroupName="teknologi-eur1-prd-d-mi-rg"

The template create a new action group TeknologiActionGroup and then provisions two scheduled/log alert rules scoped to the AKS cluster that evaluate every 2 hours over a 2 hour window. The existing managed identity teknologi-eur1-prd-aks01-alertrules from the teknologi-eur1-prd-d-mi-rg resourceGroup is attached to the alert rules to enable access to the AKS cluster teknologi-eur1-prd-aks01.

To verify the alert, wait for the automatic upgrader to start. If setup correctly, an email will be sent to the action group.

If you don’t receive notifications, check that:

  • The AKS cluster has a maintenance window configured.
  • The managed identity has the correct permissions (Reader role assignment on cluster scope).
  • The KQL queries are returning results (this can be tested in the Resource Graph Explorer)

Conclusion
#

AKS Communication Manager addresses a critical operational blind spot: maintenance visibility. Without it, cluster upgrades and node updates happen with minimal warning and limited failure diagnostics. By combining Azure Resource Notifications, Azure Resource Graph, and automated alert rules, you gain visibility into maintenance events at every stage with full context about what happened and why.

The Bicep approach transforms manual portal configuration into repeatable, version-controlled deployments across multiple AKS clusters.