How to Set Up Private Managed Grafana and Prometheus for AKS with Terraform

azure kubernetes monitoring network security Apr 23, 2025

In most Azure Kubernetes Service (AKS) monitoring setups, resources like Prometheus and Grafana are configured by default with public endpoints. While this might be acceptable for development environments, production-grade clusters demand privacy and isolation, especially for organizations concerned about compliance, data sovereignty, and internal-only visibility.

This guide demonstrates how to set up a fully private monitoring solution using:

Azure Monitor Managed Prometheus
Managed Grafana
Private AKS cluster
Point-to-Site VPN with Private DNS

All communication is routed via private endpoints, and the solution uses Terraform for infrastructure provisioning.

By the end of this article, you will have a fully functional Terraform-based setup for deploying a private monitoring solution using Azure Managed Grafana and Prometheus.

Key Benefits:

Complete elimination of public internet exposure
Enterprise-grade security with private endpoints
Managed services reduce operational overhead
Terraform automation for reproducible deployments

Key Resources:

Architecture & Plan

The solution includes:

A private AKS cluster (no public API access)
Azure Monitor Workspace with Prometheus metrics enabled
Azure Monitor Private Link Scope (AMPLS) for private ingestion
Managed Grafana with public access disabled
Private Endpoints for Prometheus, Grafana, and AMPLS
Point-to-Site VPN for secure developer access (configured as per this VPN guide)
DNS Inbound Endpoint for private name resolution

Figure 1: Complete private monitoring architecture showing all components and connections

In this article, we’ll walk through the following in order:

Terraform setup and code overview
Deploying private AKS and monitoring stack
Approving the Grafana private endpoint
Retrieving DNS inbound IP
Connecting to the VPN
Verifying logs in the Monitor Workspace
Demonstrating VPN-only access to Grafana
Accessing dashboards securely

Prerequisites

Before beginning, ensure you have:

Azure CLI installed and authenticated
Terraform installed
Proper Azure permissions (Owner or Contributor + User Access Administrator)
VPN client software (Azure VPN Client recommended)

Terraform Code Structure

Clone the project and inspect the following files:

git clone https://github.com/rceraline/devops.git
cd devops/2025-04-aks-private-monitoring

Folder Contents:

Figure 2: Git repository with the Terraform configuration files

Step 1: Terraform Code Breakdown

Here’s a detailed overview of what each file configures:

main.tf

locals {
  aks_name = "aks-01"
  all_private_dns_zone_names = [
    "privatelink.canadacentral.azmk8s.io",
    "privatelink.canadacentral.prometheus.monitor.azure.com",
    "privatelink.grafana.azure.com",
    "privatelink.blob.core.windows.net",        ## require for AMPLS
    "privatelink.monitor.azure.com",            ## require for AMPLS
    "privatelink.oms.opinsights.azure.com",     ## require for AMPLS
    "privatelink.ods.opinsights.azure.com",     ## require for AMPLS
    "privatelink.agentsvc.azure-automation.net" ## require for AMPLS
  ]
}

resource "azurerm_resource_group" "rg" {
  name     = "rg-aks-private-monitoring-01"
  location = "Canada Central"
}

## Network
resource "azurerm_virtual_network" "vnet" {
  name                = "vnet-01"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  address_space       = ["10.0.0.0/16"]
}

resource "azurerm_subnet" "snet_01" {
  name                 = "snet-01"
  address_prefixes     = ["10.0.0.0/24"]
  virtual_network_name = azurerm_virtual_network.vnet.name
  resource_group_name  = azurerm_resource_group.rg.name
}

resource "azurerm_subnet" "snet_02" {
  name                 = "snet-02"
  address_prefixes     = ["10.0.2.0/24"]
  virtual_network_name = azurerm_virtual_network.vnet.name
  resource_group_name  = azurerm_resource_group.rg.name
}

## Dns
resource "azurerm_private_dns_zone" "zones" {
  for_each = toset(local.all_private_dns_zone_names)

  name                = each.key
  resource_group_name = azurerm_resource_group.rg.name
}

resource "azurerm_private_dns_zone_virtual_network_link" "zone_links" {
  for_each = toset(local.all_private_dns_zone_names)

  name                  = each.key
  resource_group_name   = azurerm_resource_group.rg.name
  private_dns_zone_name = each.value
  virtual_network_id    = azurerm_virtual_network.vnet.id

  depends_on = [azurerm_private_dns_zone.zones]
}

## Identities
resource "azurerm_user_assigned_identity" "controlplane" {
  location            = azurerm_resource_group.rg.location
  name                = "id-controlplane-01"
  resource_group_name = azurerm_resource_group.rg.name
}

resource "azurerm_user_assigned_identity" "kubelet" {
  location            = azurerm_resource_group.rg.location
  name                = "id-kubelet-01"
  resource_group_name = azurerm_resource_group.rg.name
}

resource "azurerm_role_assignment" "controlplane_identity_contributor" {
  scope                = azurerm_user_assigned_identity.kubelet.id
  role_definition_name = "Managed Identity Contributor"
  principal_id         = azurerm_user_assigned_identity.controlplane.principal_id
}

resource "azurerm_role_assignment" "controlplane_resourcegroup_contributor" {
  scope                = azurerm_resource_group.rg.id
  role_definition_name = "Contributor"
  principal_id         = azurerm_user_assigned_identity.controlplane.principal_id
}

# AKS
resource "azurerm_kubernetes_cluster" "aks" {
  name                       = local.aks_name
  location                   = azurerm_resource_group.rg.location
  resource_group_name        = azurerm_resource_group.rg.name
  private_cluster_enabled    = true
  dns_prefix_private_cluster = local.aks_name
  private_dns_zone_id        = azurerm_private_dns_zone.zones["privatelink.canadacentral.azmk8s.io"].id
  sku_tier                   = "Standard"

  default_node_pool {
    name           = "system"
    node_count     = 3
    vm_size        = "Standard_D2s_v3"
    vnet_subnet_id = azurerm_subnet.snet_01.id

    upgrade_settings {
      drain_timeout_in_minutes      = 0
      max_surge                     = "10%"
      node_soak_duration_in_minutes = 0
    }
  }

  identity {
    type         = "UserAssigned"
    identity_ids = [azurerm_user_assigned_identity.controlplane.id]
  }

  kubelet_identity {
    client_id                 = azurerm_user_assigned_identity.kubelet.client_id
    object_id                 = azurerm_user_assigned_identity.kubelet.principal_id
    user_assigned_identity_id = azurerm_user_assigned_identity.kubelet.id
  }

  network_profile {
    network_plugin = "azure"
    dns_service_ip = "10.0.1.4"
    service_cidr   = "10.0.1.0/24"
  }

  monitor_metrics {
    annotations_allowed = null
    labels_allowed      = null
  }

  depends_on = [
    azurerm_role_assignment.controlplane_identity_contributor,
    azurerm_role_assignment.controlplane_resourcegroup_contributor,
  ]
}

Declares AKS cluster and required infrastructure components:
- Resource group
- Virtual Network with 2 subnets (one for AKS and one for endpoints)
- Private DNS zones (8 in total, including Grafana and Azure Monitor zones ). Among these, 5 zones are mandatory for AMPLS to function correctly:
  - privatelink.blob.core.windows.net
  - privatelink.monitor.azure.com
  - privatelink.oms.opinsights.azure.com
  - privatelink.ods.opinsights.azure.com
  - privatelink.agentsvc.azure-automation.net
- User-assigned managed identities for control plane and kubelet
Deploys a private AKS cluster (with private_cluster_enabled = true)
Links the private DNS zones to the VNet for name resolution

This foundational block ensures a private, production-ready Kubernetes environment with DNS wired up for private endpoint resolution.

monitoring.tf

locals {
  monitor_workspace_name            = "amw-01"
  monitor_private_link_scope_name   = "ampls-01"
  monitor_data_collection_rule_name = "dcr-01"

  data_collection_endpoint_name = "dce-01"

  grafana_dashboard_name = "amg-01"
  grafana_version        = "11"

  ampls_private_dns_zones = [
    "privatelink.blob.core.windows.net",        ## require for AMPLS
    "privatelink.monitor.azure.com",            ## require for AMPLS
    "privatelink.oms.opinsights.azure.com",     ## require for AMPLS
    "privatelink.ods.opinsights.azure.com",     ## require for AMPLS
    "privatelink.agentsvc.azure-automation.net" ## require for AMPLS
  ]
}

##### Azure Monitor Workspace #######

resource "azurerm_monitor_workspace" "amw" {
  name                          = local.monitor_workspace_name
  resource_group_name           = azurerm_resource_group.rg.name
  location                      = azurerm_resource_group.rg.location
  public_network_access_enabled = false
}

resource "azurerm_monitor_data_collection_endpoint" "dce" {
  name                          = local.data_collection_endpoint_name
  resource_group_name           = azurerm_resource_group.rg.name
  location                      = azurerm_resource_group.rg.location
  kind                          = "Linux"
  public_network_access_enabled = false
}

resource "azurerm_private_endpoint" "amw" {
  name                = "pe-${local.monitor_workspace_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  subnet_id           = azurerm_subnet.snet_02.id

  private_service_connection {
    name                           = "psc-${local.monitor_workspace_name}"
    private_connection_resource_id = azurerm_monitor_workspace.amw.id
    subresource_names              = ["prometheusMetrics"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "pdzg-${local.monitor_workspace_name}"
    private_dns_zone_ids = [azurerm_private_dns_zone.zones["privatelink.canadacentral.prometheus.monitor.azure.com"].id]
  }
}

resource "azurerm_dashboard_grafana_managed_private_endpoint" "amw" {
  grafana_id                   = azurerm_dashboard_grafana.grafana.id
  name                         = "mpe-${local.monitor_workspace_name}"
  location                     = azurerm_resource_group.rg.location
  private_link_resource_id     = azurerm_monitor_workspace.amw.id
  group_ids                    = ["prometheusMetrics"]
  private_link_resource_region = azurerm_resource_group.rg.location
}

######### AMPLS ###############

resource "azurerm_monitor_private_link_scope" "ampls" {
  name                = local.monitor_private_link_scope_name
  resource_group_name = azurerm_resource_group.rg.name

  ingestion_access_mode = "PrivateOnly"
  query_access_mode     = "PrivateOnly"
}

resource "azurerm_monitor_private_link_scoped_service" "dce" {
  name                = "link-${local.monitor_private_link_scope_name}-dce"
  resource_group_name = azurerm_resource_group.rg.name
  scope_name          = azurerm_monitor_private_link_scope.ampls.name
  linked_resource_id  = azurerm_monitor_data_collection_endpoint.dce.id
}

## WORKAROUND: wait a couple of seconds before creating the AMPLS private endpoint
resource "time_sleep" "ampls_wait" {
  create_duration = "15s"
  depends_on      = [azurerm_monitor_private_link_scope.ampls]
}

resource "azurerm_private_endpoint" "ampls" {
  name                = "pe-${local.monitor_private_link_scope_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  subnet_id           = azurerm_subnet.snet_02.id

  private_service_connection {
    name                           = "psc-${local.monitor_private_link_scope_name}"
    private_connection_resource_id = azurerm_monitor_private_link_scope.ampls.id
    subresource_names              = ["azuremonitor"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "pdzg-${local.monitor_private_link_scope_name}"
    private_dns_zone_ids = [for zone in local.ampls_private_dns_zones : azurerm_private_dns_zone.zones[zone].id]
  }

  depends_on = [
    time_sleep.ampls_wait
  ]
}

#### Data Collection Rule and more... ####

resource "azurerm_monitor_data_collection_rule" "dcr" {
  name                        = local.monitor_data_collection_rule_name
  resource_group_name         = azurerm_resource_group.rg.name
  location                    = azurerm_resource_group.rg.location
  data_collection_endpoint_id = azurerm_monitor_data_collection_endpoint.dce.id
  kind                        = "Linux"

  destinations {
    monitor_account {
      monitor_account_id = azurerm_monitor_workspace.amw.id
      name               = "MonitoringAccount1"
    }
  }

  data_flow {
    streams      = ["Microsoft-PrometheusMetrics"]
    destinations = ["MonitoringAccount1"]
  }

  data_sources {
    prometheus_forwarder {
      streams = ["Microsoft-PrometheusMetrics"]
      name    = "PrometheusDataSource"
    }
  }

  description = "DCR for Azure Monitor Metrics Profile (Managed Prometheus)"

  depends_on = [azurerm_private_endpoint.ampls]
}

resource "azurerm_monitor_data_collection_rule_association" "dcra_rule" {
  name                    = "MSProm-${azurerm_resource_group.rg.location}-${local.aks_name}"
  target_resource_id      = azurerm_kubernetes_cluster.aks.id
  data_collection_rule_id = azurerm_monitor_data_collection_rule.dcr.id
  description             = "Association of data collection rule. Deleting this association will break the data collection for this AKS Cluster."
  depends_on = [
    azurerm_monitor_data_collection_rule.dcr
  ]
}

resource "azurerm_monitor_data_collection_rule_association" "dcra_endpoint" {
  name                        = "configurationAccessEndpoint"
  target_resource_id          = azurerm_kubernetes_cluster.aks.id
  data_collection_endpoint_id = azurerm_monitor_data_collection_endpoint.dce.id
  description                 = "Association of data collection endpoint. Deleting this association will break the data collection for this AKS Cluster."
}

######### Grafana ###########

resource "azurerm_dashboard_grafana" "grafana" {
  name                          = local.grafana_dashboard_name
  resource_group_name           = azurerm_resource_group.rg.name
  location                      = azurerm_resource_group.rg.location
  grafana_major_version         = local.grafana_version
  public_network_access_enabled = false

  identity {
    type = "SystemAssigned"
  }

  azure_monitor_workspace_integrations {
    resource_id = azurerm_monitor_workspace.amw.id
  }
}

resource "azurerm_private_endpoint" "grafana" {
  name                = "pe-${local.grafana_dashboard_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  subnet_id           = azurerm_subnet.snet_02.id

  private_service_connection {
    name                           = "psc-${local.grafana_dashboard_name}"
    private_connection_resource_id = azurerm_dashboard_grafana.grafana.id
    subresource_names              = ["grafana"]
    is_manual_connection           = false
  }

  private_dns_zone_group {
    name                 = "pdzg-${local.grafana_dashboard_name}"
    private_dns_zone_ids = [azurerm_private_dns_zone.zones["privatelink.grafana.azure.com"].id]
  }
}

resource "azurerm_role_assignment" "monitoring_reader" {
  scope                = azurerm_monitor_workspace.amw.id
  role_definition_name = "Monitoring Reader"
  principal_id         = azurerm_dashboard_grafana.grafana.identity.0.principal_id
}

resource "azurerm_role_assignment" "datareaderrole" {
  scope              = azurerm_monitor_workspace.amw.id
  role_definition_id = "/subscriptions/${split("/", azurerm_monitor_workspace.amw.id)[2]}/providers/Microsoft.Authorization/roleDefinitions/b0d8363b-8ddd-447d-831f-62ca05bff136"
  principal_id       = azurerm_dashboard_grafana.grafana.identity.0.principal_id
}

resource "azurerm_role_assignment" "grafana_admin" {
  scope                = azurerm_dashboard_grafana.grafana.id
  role_definition_name = "Grafana Admin"
  principal_id         = var.current_user_principal_id
}

######## Prometheus rules ###########

resource "azurerm_monitor_alert_prometheus_rule_group" "node_recording_rules_rule_group" {
  name                = "NodeRecordingRulesRuleGroup-${local.aks_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  cluster_name        = local.aks_name
  description         = "Node Recording Rules Rule Group"
  rule_group_enabled  = true
  interval            = "PT1M"
  scopes              = [azurerm_monitor_workspace.amw.id, azurerm_kubernetes_cluster.aks.id]

  rule {
    enabled    = true
    record     = "instance:node_num_cpu:sum"
    expression = <<EOF
count without (cpu, mode) (  node_cpu_seconds_total{job="node",mode="idle"})
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_cpu_utilisation:rate5m"
    expression = <<EOF
1 - avg without (cpu) (  sum without (mode) (rate(node_cpu_seconds_total{job="node", mode=~"idle|iowait|steal"}[5m])))
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_load1_per_cpu:ratio"
    expression = <<EOF
(  node_load1{job="node"}/  instance:node_num_cpu:sum{job="node"})
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_memory_utilisation:ratio"
    expression = <<EOF
1 - (  (    node_memory_MemAvailable_bytes{job="node"}    or    (      node_memory_Buffers_bytes{job="node"}      +      node_memory_Cached_bytes{job="node"}      +      node_memory_MemFree_bytes{job="node"}      +      node_memory_Slab_bytes{job="node"}    )  )/  node_memory_MemTotal_bytes{job="node"})
EOF
  }
  rule {
    enabled = true

    record     = "instance:node_vmstat_pgmajfault:rate5m"
    expression = <<EOF
rate(node_vmstat_pgmajfault{job="node"}[5m])
EOF
  }
  rule {
    enabled    = true
    record     = "instance_device:node_disk_io_time_seconds:rate5m"
    expression = <<EOF
rate(node_disk_io_time_seconds_total{job="node", device!=""}[5m])
EOF
  }
  rule {
    enabled    = true
    record     = "instance_device:node_disk_io_time_weighted_seconds:rate5m"
    expression = <<EOF
rate(node_disk_io_time_weighted_seconds_total{job="node", device!=""}[5m])
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_network_receive_bytes_excluding_lo:rate5m"
    expression = <<EOF
sum without (device) (  rate(node_network_receive_bytes_total{job="node", device!="lo"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_network_transmit_bytes_excluding_lo:rate5m"
    expression = <<EOF
sum without (device) (  rate(node_network_transmit_bytes_total{job="node", device!="lo"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_network_receive_drop_excluding_lo:rate5m"
    expression = <<EOF
sum without (device) (  rate(node_network_receive_drop_total{job="node", device!="lo"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "instance:node_network_transmit_drop_excluding_lo:rate5m"
    expression = <<EOF
sum without (device) (  rate(node_network_transmit_drop_total{job="node", device!="lo"}[5m]))
EOF
  }
}

resource "azurerm_monitor_alert_prometheus_rule_group" "kubernetes_recording_rules_rule_group" {
  name                = "KubernetesRecordingRulesRuleGroup-${local.aks_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  cluster_name        = local.aks_name
  description         = "Kubernetes Recording Rules Rule Group"
  rule_group_enabled  = true
  interval            = "PT1M"
  scopes              = [azurerm_monitor_workspace.amw.id, azurerm_kubernetes_cluster.aks.id]

  rule {
    enabled    = true
    record     = "node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate"
    expression = <<EOF
sum by (cluster, namespace, pod, container) (  irate(container_cpu_usage_seconds_total{job="cadvisor", image!=""}[5m])) * on (cluster, namespace, pod) group_left(node) topk by (cluster, namespace, pod) (  1, max by(cluster, namespace, pod, node) (kube_pod_info{node!=""}))
EOF
  }
  rule {
    enabled    = true
    record     = "node_namespace_pod_container:container_memory_working_set_bytes"
    expression = <<EOF
container_memory_working_set_bytes{job="cadvisor", image!=""}* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,  max by(namespace, pod, node) (kube_pod_info{node!=""}))
EOF
  }
  rule {
    enabled    = true
    record     = "node_namespace_pod_container:container_memory_rss"
    expression = <<EOF
container_memory_rss{job="cadvisor", image!=""}* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,  max by(namespace, pod, node) (kube_pod_info{node!=""}))
EOF
  }
  rule {
    enabled    = true
    record     = "node_namespace_pod_container:container_memory_cache"
    expression = <<EOF
container_memory_cache{job="cadvisor", image!=""}* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,  max by(namespace, pod, node) (kube_pod_info{node!=""}))
EOF
  }
  rule {
    enabled    = true
    record     = "node_namespace_pod_container:container_memory_swap"
    expression = <<EOF
container_memory_swap{job="cadvisor", image!=""}* on (namespace, pod) group_left(node) topk by(namespace, pod) (1,  max by(namespace, pod, node) (kube_pod_info{node!=""}))
EOF
  }
  rule {
    enabled    = true
    record     = "cluster:namespace:pod_memory:active:kube_pod_container_resource_requests"
    expression = <<EOF
kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"}  * on (namespace, pod, cluster)group_left() max by (namespace, pod, cluster) (  (kube_pod_status_phase{phase=~"Pending|Running"} == 1))
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_memory:kube_pod_container_resource_requests:sum"
    expression = <<EOF
sum by (namespace, cluster) (    sum by (namespace, pod, cluster) (        max by (namespace, pod, container, cluster) (          kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"}        ) * on(namespace, pod, cluster) group_left() max by (namespace, pod, cluster) (          kube_pod_status_phase{phase=~"Pending|Running"} == 1        )    ))
EOF
  }
  rule {
    enabled    = true
    record     = "cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests"
    expression = <<EOF
kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"}  * on (namespace, pod, cluster)group_left() max by (namespace, pod, cluster) (  (kube_pod_status_phase{phase=~"Pending|Running"} == 1))
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_cpu:kube_pod_container_resource_requests:sum"
    expression = <<EOF
sum by (namespace, cluster) (    sum by (namespace, pod, cluster) (        max by (namespace, pod, container, cluster) (          kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"}        ) * on(namespace, pod, cluster) group_left() max by (namespace, pod, cluster) (          kube_pod_status_phase{phase=~"Pending|Running"} == 1        )    ))
EOF
  }
  rule {
    enabled    = true
    record     = "cluster:namespace:pod_memory:active:kube_pod_container_resource_limits"
    expression = <<EOF
kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"}  * on (namespace, pod, cluster)group_left() max by (namespace, pod, cluster) (  (kube_pod_status_phase{phase=~"Pending|Running"} == 1))
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_memory:kube_pod_container_resource_limits:sum"
    expression = <<EOF
sum by (namespace, cluster) (    sum by (namespace, pod, cluster) (        max by (namespace, pod, container, cluster) (          kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"}        ) * on(namespace, pod, cluster) group_left() max by (namespace, pod, cluster) (          kube_pod_status_phase{phase=~"Pending|Running"} == 1        )    ))
EOF
  }
  rule {
    enabled    = true
    record     = "cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits"
    expression = <<EOF
kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"}  * on (namespace, pod, cluster)group_left() max by (namespace, pod, cluster) ( (kube_pod_status_phase{phase=~"Pending|Running"} == 1) )
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_cpu:kube_pod_container_resource_limits:sum"
    expression = <<EOF
sum by (namespace, cluster) (    sum by (namespace, pod, cluster) (        max by (namespace, pod, container, cluster) (          kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"}        ) * on(namespace, pod, cluster) group_left() max by (namespace, pod, cluster) (          kube_pod_status_phase{phase=~"Pending|Running"} == 1        )    ))
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_workload_pod:kube_pod_owner:relabel"
    expression = <<EOF
max by (cluster, namespace, workload, pod) (  label_replace(    label_replace(      kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},      "replicaset", "$1", "owner_name", "(.*)"    ) * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (      1, max by (replicaset, namespace, owner_name) (        kube_replicaset_owner{job="kube-state-metrics"}      )    ),    "workload", "$1", "owner_name", "(.*)"  ))
EOF
    labels = {
      workload_type = "deployment"
    }
  }
  rule {
    enabled    = true
    record     = "namespace_workload_pod:kube_pod_owner:relabel"
    expression = <<EOF
max by (cluster, namespace, workload, pod) (  label_replace(    kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},    "workload", "$1", "owner_name", "(.*)"  ))
EOF
    labels = {
      workload_type = "daemonset"
    }
  }
  rule {
    enabled    = true
    record     = "namespace_workload_pod:kube_pod_owner:relabel"
    expression = <<EOF
max by (cluster, namespace, workload, pod) (  label_replace(    kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},    "workload", "$1", "owner_name", "(.*)"  ))
EOF
    labels = {
      workload_type = "statefulset"
    }
  }
  rule {
    enabled    = true
    record     = "namespace_workload_pod:kube_pod_owner:relabel"
    expression = <<EOF
max by (cluster, namespace, workload, pod) (  label_replace(    kube_pod_owner{job="kube-state-metrics", owner_kind="Job"},    "workload", "$1", "owner_name", "(.*)"  ))
EOF
    labels = {
      workload_type = "job"
    }
  }
  rule {
    enabled    = true
    record     = ":node_memory_MemAvailable_bytes:sum"
    expression = <<EOF
sum(  node_memory_MemAvailable_bytes{job="node"} or  (    node_memory_Buffers_bytes{job="node"} +    node_memory_Cached_bytes{job="node"} +    node_memory_MemFree_bytes{job="node"} +    node_memory_Slab_bytes{job="node"}  )) by (cluster)
EOF
  }
  rule {
    enabled    = true
    record     = "cluster:node_cpu:ratio_rate5m"
    expression = <<EOF
sum(rate(node_cpu_seconds_total{job="node",mode!="idle",mode!="iowait",mode!="steal"}[5m])) by (cluster) /count(sum(node_cpu_seconds_total{job="node"}) by (cluster, instance, cpu)) by (cluster)
EOF
  }
}

resource "azurerm_monitor_alert_prometheus_rule_group" "node_and_kubernetes_recording_rules_rule_group_win" {
  name                = "NodeAndKubernetesRecordingRulesRuleGroup-Win-${local.aks_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  cluster_name        = local.aks_name
  description         = "Node and Kubernetes Recording Rules Rule Group for Windows Nodes"
  rule_group_enabled  = true
  interval            = "PT1M"
  scopes              = [azurerm_monitor_workspace.amw.id, azurerm_kubernetes_cluster.aks.id]

  rule {
    enabled    = true
    record     = "node:windows_node_filesystem_usage:"
    expression = <<EOF
max by (instance,volume)((windows_logical_disk_size_bytes{job="windows-exporter"} - windows_logical_disk_free_bytes{job="windows-exporter"}) / windows_logical_disk_size_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_filesystem_avail:"
    expression = <<EOF
max by (instance, volume) (windows_logical_disk_free_bytes{job="windows-exporter"} / windows_logical_disk_size_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_net_utilisation:sum_irate"
    expression = <<EOF
sum(irate(windows_net_bytes_total{job="windows-exporter"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_net_utilisation:sum_irate"
    expression = <<EOF
sum by (instance) ((irate(windows_net_bytes_total{job="windows-exporter"}[5m])))
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_net_saturation:sum_irate"
    expression = <<EOF
sum(irate(windows_net_packets_received_discarded_total{job="windows-exporter"}[5m])) + sum(irate(windows_net_packets_outbound_discarded_total{job="windows-exporter"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_net_saturation:sum_irate"
    expression = <<EOF
sum by (instance) ((irate(windows_net_packets_received_discarded_total{job="windows-exporter"}[5m]) + irate(windows_net_packets_outbound_discarded_total{job="windows-exporter"}[5m])))
EOF
  }
  rule {
    enabled    = true
    record     = "windows_pod_container_available"
    expression = <<EOF
windows_container_available{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "windows_container_total_runtime"
    expression = <<EOF
windows_container_cpu_usage_seconds_total{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "windows_container_memory_usage"
    expression = <<EOF
windows_container_memory_usage_commit_bytes{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "windows_container_private_working_set_usage"
    expression = <<EOF
windows_container_memory_usage_private_working_set_bytes{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "windows_container_network_received_bytes_total"
    expression = <<EOF
windows_container_network_receive_bytes_total{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "windows_container_network_transmitted_bytes_total"
    expression = <<EOF
windows_container_network_transmit_bytes_total{job="windows-exporter", container_id != ""} * on(container_id) group_left(container, pod, namespace) max(kube_pod_container_info{job="kube-state-metrics", container_id != ""}) by(container, container_id, pod, namespace)
EOF
  }
  rule {
    enabled    = true
    record     = "kube_pod_windows_container_resource_memory_request"
    expression = <<EOF
max by (namespace, pod, container) (kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"}) * on(container,pod,namespace) (windows_pod_container_available)
EOF
  }
  rule {
    enabled    = true
    record     = "kube_pod_windows_container_resource_memory_limit"
    expression = <<EOF
kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"} * on(container,pod,namespace) (windows_pod_container_available)
EOF
  }
  rule {
    enabled    = true
    record     = "kube_pod_windows_container_resource_cpu_cores_request"
    expression = <<EOF
max by (namespace, pod, container) ( kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"}) * on(container,pod,namespace) (windows_pod_container_available)
EOF
  }
  rule {
    enabled    = true
    record     = "kube_pod_windows_container_resource_cpu_cores_limit"
    expression = <<EOF
kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"} * on(container,pod,namespace) (windows_pod_container_available)
EOF
  }
  rule {
    enabled    = true
    record     = "namespace_pod_container:windows_container_cpu_usage_seconds_total:sum_rate"
    expression = <<EOF
sum by (namespace, pod, container) (rate(windows_container_total_runtime{}[5m]))
EOF
  }
}

resource "azurerm_monitor_alert_prometheus_rule_group" "node_recording_rules_rule_group_win" {
  name                = "NodeRecordingRulesRuleGroup-Win-${local.aks_name}"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  cluster_name        = local.aks_name
  description         = "Node and Kubernetes Recording Rules Rule Group for Windows Nodes"
  rule_group_enabled  = true
  interval            = "PT1M"
  scopes              = [azurerm_monitor_workspace.amw.id, azurerm_kubernetes_cluster.aks.id]

  rule {
    enabled    = true
    record     = "node:windows_node:sum"
    expression = <<EOF
count (windows_system_system_up_time{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_num_cpu:sum"
    expression = <<EOF
count by (instance) (sum by (instance, core) (windows_cpu_time_total{job="windows-exporter"}))
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_cpu_utilisation:avg5m"
    expression = <<EOF
1 - avg(rate(windows_cpu_time_total{job="windows-exporter",mode="idle"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_cpu_utilisation:avg5m"
    expression = <<EOF
1 - avg by (instance) (rate(windows_cpu_time_total{job="windows-exporter",mode="idle"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_memory_utilisation:"
    expression = <<EOF
1 -sum(windows_memory_available_bytes{job="windows-exporter"})/sum(windows_os_visible_memory_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_memory_MemFreeCached_bytes:sum"
    expression = <<EOF
sum(windows_memory_available_bytes{job="windows-exporter"} + windows_memory_cache_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_totalCached_bytes:sum"
    expression = <<EOF
(windows_memory_cache_bytes{job="windows-exporter"} + windows_memory_modified_page_list_bytes{job="windows-exporter"} + windows_memory_standby_cache_core_bytes{job="windows-exporter"} + windows_memory_standby_cache_normal_priority_bytes{job="windows-exporter"} + windows_memory_standby_cache_reserve_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_memory_MemTotal_bytes:sum"
    expression = <<EOF
sum(windows_os_visible_memory_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_bytes_available:sum"
    expression = <<EOF
sum by (instance) ((windows_memory_available_bytes{job="windows-exporter"}))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_bytes_total:sum"
    expression = <<EOF
sum by (instance) (windows_os_visible_memory_bytes{job="windows-exporter"})
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_utilisation:ratio"
    expression = <<EOF
(node:windows_node_memory_bytes_total:sum - node:windows_node_memory_bytes_available:sum) / scalar(sum(node:windows_node_memory_bytes_total:sum))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_utilisation:"
    expression = <<EOF
1 - (node:windows_node_memory_bytes_available:sum / node:windows_node_memory_bytes_total:sum)
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_memory_swap_io_pages:irate"
    expression = <<EOF
irate(windows_memory_swap_page_operations_total{job="windows-exporter"}[5m])
EOF
  }
  rule {
    enabled    = true
    record     = ":windows_node_disk_utilisation:avg_irate"
    expression = <<EOF
avg(irate(windows_logical_disk_read_seconds_total{job="windows-exporter"}[5m]) + irate(windows_logical_disk_write_seconds_total{job="windows-exporter"}[5m]))
EOF
  }
  rule {
    enabled    = true
    record     = "node:windows_node_disk_utilisation:avg_irate"
    expression = <<EOF
avg by (instance) ((irate(windows_logical_disk_read_seconds_total{job="windows-exporter"}[5m]) + irate(windows_logical_disk_write_seconds_total{job="windows-exporter"}[5m])))
EOF
  }
}

This is the central component of the private monitoring setup, responsible for configuring the full end-to-end telemetry flow between AKS, Azure Monitor, and Managed Grafana — entirely via private connectivity.

It performs the following:

Creates an Azure Monitor Workspace with public_network_access_enabled = false to store Prometheus metrics.

Deploys a Data Collection Endpoint (DCE) to receive metrics from the AKS cluster.
Creates a Private Endpoint for the Prometheus ingestion surface of the Monitor Workspace.
Deploys Managed Grafana with public access disabled, including a private endpoint for VPN-only access.
Establishes an Azure Monitor Private Link Scope (AMPLS) to route DCE ingestion privately, and links the DCE to it.
Adds a required 15-second delay (time_sleep) to avoid race conditions when creating the AMPLS private endpoint.
Configures Data Collection Rule (DCR) and associates it with the AKS cluster along with the DCE.
Creates a Managed Private Endpoint from Grafana to the Monitor Workspace, which must be manually approved post-deployment.
Assigns necessary RBAC roles:
- Monitoring Reader
- Azure Monitor Data Reader (custom role ID)
- Grafana Admin (to the current user)

Additionally, this file declares Prometheus recording rule groups to enrich observability across both Linux and Windows nodes, including:

azurerm_monitor_alert_prometheus_rule_group for:
- Node metrics: CPU, memory, disk I/O, network I/O
- Kubernetes metrics: container usage, limits, resource requests
- Windows-specific metrics: logical disk usage, swap, networking, and container stats

These rules are organized into dedicated rule groups for:

Linux Nodes
Kubernetes Workloads
Windows Nodes

Together, monitoring.tf orchestrates the full private observability backend, enabling AKS to push metrics into Azure Monitor, and Grafana to query them securely — all without public exposure

vpn.tf

locals {
  dns_subnet_name             = "snet-dnsinbound-01"
  dns_subnet_address_prefixes = ["10.0.101.0/24"]
  private_dns_resolver_name   = "dnspr-01"
  dns_inbound_endpoint_name   = "dnsinbound-endpoint-01"

  vpn_subnet_name             = "GatewaySubnet"
  vpn_subnet_address_prefixes = ["10.0.100.0/24"]
  vpn_pip_name                = "pip-vpn-01"
  vpn_gateway_name            = "vgw-vpn-01"
  vpn_client_address_space    = ["10.200.0.0/24"]
  vpn_public_cert_data        = <<EOF
MIIC9TCCAd2gAwIBAgIQHAzVMIc/aY5JxYQjgH0iXTANBgkqhkiG9w0BAQsFADAd
MRswGQYDVQQDDBJWcG5Sb290Q2VydGlmaWNhdGUwHhcNMjUwMTIwMjExNDE4WhcN
MjYwMTIwMjEzNDE4WjAdMRswGQYDVQQDDBJWcG5Sb290Q2VydGlmaWNhdGUwggEi
MA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQC62jOML+5NCw7O+U6PMc/zNMDV
5F+7qaJK08SnKy4iiD1rPq8TI5HGXJg+6wrhmgFYTLLzY8VpRhfJblrXNs5vgc2m
v+H9LjFBRc0QtMI8+2j+gmpYiaaQk3KNESP1JNvz6uOvPhAjwjCWE7wzJzUiBzL+
JShAhwG6e2DKYt8zyYWsy51Myeqx7hMCqTb/NZRSxKKUTqC4qU/NlGW+7oIkbsl1
2YZ0stgWnxx21uK6flG29C6sxfDPyENoRiCrw8t+YH0iJPKm9i98GM6qu/gALA2G
6wRtKGjFf5ZDpDMycUO6LekggF+h8Ry/PAOILFypTQpfZ6ybHXukHh2SdVh9AgMB
AAGjMTAvMA4GA1UdDwEB/wQEAwICBDAdBgNVHQ4EFgQUlfJPhTGCjK4x3q4iAi0Y
naKJNMowDQYJKoZIhvcNAQELBQADggEBAJyjEwGTzv0/6+UtesgPg0G+YIvgP7vj
ZMKANqeaypTdru3acsTcChJCmjtSA7ufBlZOhIwCWnzxY5b8Ugiuqnqv7oPWkr1u
Wl937ZLR+lHhywVQRGBzGKOo7JFlB/SiGr4F90Oq6fgZvBU0DlwPb8jo//t576R/
pt+616/9TcmWNu5Hoptb208e5x8sjMpCzalimxFx6xicSzJ2IDwEW5aaVKwJGho/
x4EkA6x5CAmBNbzSM4Xcptijkc125DNPzD8tn1PGgqu4xknNCW8XNbDhyi9WGsV8
t2XTIPZtX8fjUMlxxt72/FWRdm7ErQnz186/P+5zwAcIqvQjBSDfyzs=
EOF
}

resource "azurerm_subnet" "gateway" {
  name                 = local.vpn_subnet_name
  address_prefixes     = local.vpn_subnet_address_prefixes
  virtual_network_name = azurerm_virtual_network.vnet.name
  resource_group_name  = azurerm_resource_group.rg.name
}

resource "azurerm_public_ip" "vpn" {
  name                = local.vpn_pip_name
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  allocation_method   = "Static"
}

resource "azurerm_virtual_network_gateway" "vpn" {
  name                = local.vpn_gateway_name
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name

  type     = "Vpn"
  vpn_type = "RouteBased"

  active_active = false
  enable_bgp    = false
  sku           = "VpnGw1"

  ip_configuration {
    name                          = "vnetGatewayConfig"
    public_ip_address_id          = azurerm_public_ip.vpn.id
    private_ip_address_allocation = "Dynamic"
    subnet_id                     = azurerm_subnet.gateway.id
  }

  vpn_client_configuration {
    address_space        = local.vpn_client_address_space
    vpn_client_protocols = ["OpenVPN"]

    root_certificate {
      name = "VpnRoot"

      public_cert_data = local.vpn_public_cert_data
    }
  }
}

## DNS
resource "azurerm_subnet" "dns_inbound" {
  name                 = local.dns_subnet_name
  address_prefixes     = local.dns_subnet_address_prefixes
  virtual_network_name = azurerm_virtual_network.vnet.name
  resource_group_name  = azurerm_resource_group.rg.name

  delegation {
    name = "Microsoft.Network.dnsResolvers"
    service_delegation {
      actions = ["Microsoft.Network/virtualNetworks/subnets/join/action"]
      name    = "Microsoft.Network/dnsResolvers"
    }
  }
}

resource "azurerm_private_dns_resolver" "dnspr" {
  name                = local.private_dns_resolver_name
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location
  virtual_network_id  = azurerm_virtual_network.vnet.id
}

resource "azurerm_private_dns_resolver_inbound_endpoint" "inbound" {
  name                    = local.dns_inbound_endpoint_name
  private_dns_resolver_id = azurerm_private_dns_resolver.dnspr.id
  location                = azurerm_private_dns_resolver.dnspr.location

  ip_configurations {
    subnet_id = azurerm_subnet.dns_inbound.id
  }
}

Adds secure developer access via a Point-to-Site VPN
Creates:
- Gateway subnet and VPN gateway with OpenVPN config
- Static public IP for VPN gateway
- DNS subnet and inbound endpoint using Private DNS Resolver

The VPN client is configured with a certificate (provided in the script) for secure authentication
The DNS Inbound Endpoint is used for name resolution of private zones from your machine once VPN is connected

This setup ensures that external access to Grafana and metrics is possible only through VPN.

providers.tf

Defines the required Terraform provider (azurerm >= 4.26.0)

and sets up the Azure subscription context.

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">=4.26.0"
    }
  }
}

provider "azurerm" {
  subscription_id = "{subscription ID}" // update with your subscription ID
  features {
  }
}

variables.tf

Includes one required input:

variable "current_user_principal_id" {
  type        = string
  description = "Principal ID of the current user."
}

You can retrieve current_user_principal_id using:

az ad signed-in-user show --query id

Set the value of current_user_principal_id variable with PowerShell or add a terraform.tfvars to the solution:

$env:TF_VAR_current_user_principal_id = $(az ad signed-in-user show --query id)

Step 2: Deploy Infrastructure

Once everything is configured, initialize and deploy:

terraform init
terraform plan
terraform apply

Note: VPN Gateway creation may take 30–45 minutes to complete.

Step 3: Approve Grafana Managed Private Endpoint

After deployment, go to:

Azure Monitor Workspace → Networking → Private Access

You’ll see a Pending connection from Grafana. Click Approve as shown in Figure 3.

Figure 3: Grafana private endpoints to be approved

Step 4: Verify Prometheus Metrics

In Azure Portal:

Go to Monitor Workspace
Search for Prometheus Explorer
Confirm that metrics are visible

Figure 4: Prometheus connected successfully and data is coming

Step 5: Access Grafana

Before proceeding, ensure you are connected to the VPN, as access to Grafana is restricted to private network traffic. If you haven’t set up the VPN yet, follow this detailed VPN setup guide, which walks through configuring the client certificate and DNS resolution.

Connected to VPN → Grafana opens successfully

Figure 5: Azure VPN Client is connected

Go to Data Sources →Managed_prometheus_amw-01→ Save & Test

Figure 6: Grafana can successfully Prometheus

Then, access the AKS Dashboards to view insights.

Step 6: Verify Prometheus Metrics

In Grafana Portal:

Go to Dashboard
Search for Azure Managed Prometheus
Go to Kubernetes / Compute Resources / Cluster

Figure 7: Grafana Dashboard

Confirm that the data coming from Prometheus is visible

Figure 8: Data coming from Prometheus

Troubleshooting

Problem	Solution
Private endpoint creation fails	Ensure time_sleep is applied after AMPLS
No DNS resolution	Verify all private DNS zones are linked to VNet
Grafana can't query metrics	Make sure the managed private endpoint is approved
No Prometheus data	Confirm DCR/DCE associations are in place
VPN doesn’t connect	Re-download or reconfigure VPN client certs

Conclusion

By following this guide, you now have a fully private and secure monitoring stack for Azure Kubernetes Service using Terraform. The setup leverages Azure Managed Grafana and Azure Monitor Metrics (Prometheus) with private endpoints, which are further protected with a VPN-based access layer. This solution eliminates any public exposure, ensuring compliance and enhancing security for production-grade environments. With modular Terraform code, proper DNS setup, and VPN-secured access, you are fully equipped to monitor AKS workloads in a private cloud-native way.

References

Work With Me

Ready to take your Azure solutions to the next level and streamline your DevOps processes? Let's work together! As an experienced Azure solutions architect and DevOps expert, I can help you achieve your goals. Click the button below to get in touch.

Get In Touch