AKS Lessons Learned 1 of 2

In general, troubleshooting Kubernetes is tricky. That is because one has to get in and out of pods. I took two days to troubleshoot some networking issues with private AKS cluster. For the amount of of tricks I had to employ, I need to take some notes.

The issue

After writing the Terraform code, I used the following dummy service to test the private AKS cluster:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aks-helloworld-one
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aks-helloworld-one
  template:
    metadata:
      labels:
        app: aks-helloworld-one
    spec:
      containers:
      - name: aks-helloworld-one
        image: neilpeterson/aks-helloworld:v1
        ports:
        - containerPort: 80
        env:
        - name: TITLE
          value: "Welcome to Azure Kubernetes Service (AKS)"
---
apiVersion: v1
kind: Service
metadata:
  name: aks-helloworld-one
spec:
  type: LoadBalancer
  ports:
  - port: 80
  selector:
    app: aks-helloworld-one

The expected behaviour, is that the service object will tell cloud API to provision a load balancer, with public IP listing at port 80. I should be able to curl to the IP address and connect to the site in the Pod. However, I was not able to. On the bastion host, I was able to curl to nodePort of the node address. But anything on public IP does not work, no matter where I ran curl from. This feels like a basic issue, but is is quite annoying because the native troubleshooting tool for Azure Load Balancer is horrible. In and out of a bunch of components named “insights”, “diagnostic log”, or “Metrics”, I can’t simply find a way to trace whether it received an HTTP request. Most of the information I was able to see was irrelevant or useless.

The approach

The hard way to troubleshooting infrastructure as code, is configuration comparison approach: revert to a baseline configuration, and see if the expected function works. Then from the baseline, change one configuration at a time and see where it starts to break. This approach is very time consuming, and AKS cluster as a relatively large resource, with numerous attributes, takes this effort to extreme.

The baseline configuration I started with is:

az aks create -g AutomationTest -n orthCluster --generate-ssh-keys --node-count 1 --tags Owner=MyOwner

With this baseline, I simply use kubectl to apply the YAML file above. Then I can tell that the port is working. With a good start point, I started to apply one change at a time and repeat the test. I ran into a snug when I’m using the following configuration:

az aks create -g AutomationTest -n orthCluster --generate-ssh-keys --node-count 1 --tags Owner=MyOwner --enable-private-cluster --network-plugin azure --network-policy calico

With the cluster created from the command above, the variable introduced is –enable-private-cluster. This puts the cluster on a private network. I cannot connect to the cluster via a public endpoint anymore, and thus have to figure out some tricks to run the kubectl commands. I had to play with the Command Run feature of AKS cluster because I don’t have a bastion host when using AZ CLI command. The Command Run feature would not allow me to use any file from bastion host. So i had to create my test objects, the Deploy and the Service objects all by imperative commands. The equivalent commands I worked out is:

kubectl create deployment aks-helloworld-one --image=neilpeterson/aks-helloworld:v1 --replicas=1 --port=80
kubectl expose deploy aks-helloworld-one --port 80 --target-port 80 --type='LoadBalancer'

Then I realized a limitation with Command Run feature: it only supports basic command switches and doesn’t like switches such as –replicas. So I used the following commands:

az aks command invoke -g AutomationTest -n orthCluster -c "kubectl get no"
az aks command invoke -g AutomationTest -n orthCluster -c "kubectl create deployment aks-helloworld-one --image=neilpeterson/aks-helloworld:v1"
az aks command invoke -g AutomationTest -n orthCluster -c "kubectl get deploy"
az aks command invoke -g AutomationTest -n orthCluster -c "kubectl expose deploy aks-helloworld-one --port 80 --target-port 80 --type=LoadBalancer"
az aks command invoke -g AutomationTest -n orthCluster -c "kubectl get svc"

This trick allows me to continue with the testing eliminate Azure CNI and Calico policy as the cause. Testing after each cluster creation is painful because the cluster creation can take 10 minutes.I had to temporarily minimize the size of the cluster to speed up provisioning.

I finally came to the point that I can reproduce the issue using TF template. I realized that when I set the vnet_subnet_id attribute of azurerm_kubernetes_cluster’s default_node_pool, the problem came back. That’s the smoking gun that the node subnet is the issue.

The Network Security Group on Node Subnet

The node subnet has an associated network security group. I discovered that once I add an allow rule for port 80 to the security group, the curl test will work. I also noticed the security group rule change will take 60 sec to come to effect and load balancer will also take 60 sec to warm up.

This confuses me because port 80 is only listened by the load balancer and not by any of the nodes. It’s most likely when public load balancer is used the load balancer is placed on the node subnet. According to this note:

Inbound, external traffic flows from the load balancer to the virtual network for your AKS cluster. The virtual network has a Network Security Group (NSG) which allows all inbound traffic from the load balancer. This NSG uses a service tag of type LoadBalancer to allow traffic from the load balancer.

The external can travel to the VNet, but it was blocked at the NSG of node subnet.

DNS

When AKS cluster integrate with an external node network, it is likely to create weird issues that are hard to troubleshoot. Another example is with DNS. If the V-Net has a predefined DNS server (which is common for enterprises with hybrid network), then the cluster creation will fail with time out with misleading error messages (for example, this comment). This is because the DNS name of the newly created cluster is not resolvable within the V-NET. We need to understand where the cluster publishes the A-record, the option of DNS zone, and consider BYO DNS zone option.

Lessons Learned

We always need to have some dummy service ready to test what we need. We can use nginx dummy service like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 2
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
      - name: my-nginx
        image: nginx
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  labels:
    run: my-nginx
spec:
  type: LoadBalancer
  ports:
  - port: 80
    protocol: TCP
  selector:
    run: my-nginx

As discussed above, it’s also important to have a Bastion host able to access the control plane when the AKS cluster is private. Azure touts about CloudShell (and its ability to run in specified V-Net) but it’s pretty useless in troubleshooting. CloudShell sessions run inside of Kubernetes cluster and lacks common network troubleshooting tool such as nc. Azure has a managed service for Bastion but it requires a subnet with the exact name of AzureBastionSubnet.