A Portal to a Portal: Calico Node, more like Calico No

Monday, 17 May 2021

Calico Node, more like Calico No

I spent a happy few hours over the weekend, trying to work out why my Kubernetes 1.21 cluster wasn't behaving as expected.

I was seeing a bunch o' weirdness whereby certain pods weren't able to access certain services, manifested specifically when I was trying/failing to create a DataVolume using the KubeVirt Containerised Data Importer (CDI) capability.

I was seeing exceptions such as: -

Error from server (InternalError): error when creating "create_volume.yaml": Internal error occurred: failed calling webhook "datavolume-mutate.cdi.kubevirt.io": Post "https://cdi-api.cdi.svc:443/datavolume-mutate?timeout=30s": dial tcp 10.102.58.243:443: i/o timeout

from: -

kubectl apply -f create_volume.yaml

After much digging and DNS debugging including using BusyBox to resolve various K8s services: -

kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup cdi-api.cdi

Server: 10.96.0.10

Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: cdi-api.cdi

Address 1: 10.102.58.243 cdi-api.cdi.svc.cluster.local

pod "busybox" deleted

kubectl run -it --rm --restart=Never busybox --image=gcr.io/google-containers/busybox -- nslookup kubernetes.default

Server: 10.96.0.10

Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name: kubernetes.default

Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local

pod "busybox" deleted

which looked OK, something inspired me to look at Calico Node, which is the networking layer overlaying my cluster: -

kubectl get pods -A|grep calico

kube-system calico-kube-controllers-bf965bfd8-hg82b 1/1 Running 0 58m

kube-system calico-node-8zkvt 0/1 Running 0 7m24s

kube-system calico-node-srmj6 0/1 Running 0 7m47s

Noticing that both calico-node pods were showing 0/1 rather than 1/1, meaning that they weren't running on the Compute node, I dug further: -

kubectl describe pod `kubectl get pods -A|grep calico-node|awk '{print $2}'` --namespace kube-system

which, in part, showed: -

Warning Unhealthy 9m44s kubelet Readiness probe failed: calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp 127.0.0.1:9099: connect: connection refused

Warning Unhealthy 9m42s kubelet Readiness probe failed: calico/node is not ready: BIRD is not ready: Error querying BIRD: unable to connect to BIRDv4 socket: dial unix /var/run/calico/bird.ctl: connect: connection refused

Warning Unhealthy 9m32s kubelet Readiness probe failed: 2021-05-17 11:33:08.203 [INFO][197] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 9m22s kubelet Readiness probe failed: 2021-05-17 11:33:18.195 [INFO][231] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 9m12s kubelet Readiness probe failed: 2021-05-17 11:33:28.278 [INFO][268] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 9m2s kubelet Readiness probe failed: 2021-05-17 11:33:38.334 [INFO][301] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 8m52s kubelet Readiness probe failed: 2021-05-17 11:33:48.182 [INFO][319] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 8m42s kubelet Readiness probe failed: 2021-05-17 11:33:58.266 [INFO][356] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 8m32s kubelet Readiness probe failed: 2021-05-17 11:34:08.185 [INFO][377] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Warning Unhealthy 4m42s (x23 over 8m22s) kubelet (combined from similar events): Readiness probe failed: 2021-05-17 11:37:58.234 [INFO][1014] confd/health.go 180: Number of node(s) with BGP peering established = 0

calico/node is not ready: BIRD is not ready: BGP not established with 10.51.16.137

Knowing that my firewall configuration - iptables - was clean n' green, in that I'd opened up the Border Gateway Protocol (BGP) port 179 on both the Control Plane and Compute nodes: -

iptables -A INPUT -p tcp -m tcp --dport 179 -j ACCEPT

I looked back through my notes, and remembered the issue with IP_AUTODETECTION_METHOD and the Calico Node daemonset.

I checked the daemonset: -

kubectl get daemonset -A

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

kube-system calico-node 2 2 0 2 0 kubernetes.io/os=linux 64m

kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 112m

kubevirt virt-handler 1 1 1 1 1 kubernetes.io/os=linux 30m

and noticed that the calico-node daemonset was, like the pods, showing as unready ( 0 instead of 1+ in the READY column )

I inspected the offending daemonset: -

kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'

{

"name": "IP",

"value": "autodetect"

}

noting that the IP_AUTODETECTION_METHOD environment variable wasn't specifically set.

Given that the VMs that host my K8s nodes have TWO network adapters, eth0 and eth1, and that I want Calico Node to use eth0 which is the private IP, I explicitly set that: -

kubectl set env daemonset/calico-node -n kube-system IP_AUTODETECTION_METHOD=interface=eth0

daemonset.apps/calico-node env updated

and then validated the change: -

kubectl get daemonset/calico-node -n kube-system --output json | jq '.spec.template.spec.containers[].env[] | select(.name | startswith("IP"))'

{

"name": "IP",

"value": "autodetect"

}

{

"name": "IP_AUTODETECTION_METHOD",

"value": "interface=eth0"

}

More importantly, the Calico Node pods are happy: -

kubectl get pods -A

NAMESPACE NAME READY STATUS RESTARTS AGE

cdi cdi-apiserver-6b87945b8d-dww25 1/1 Running 0 120m

cdi cdi-deployment-86c6d76d98-7cxlv 1/1 Running 0 120m

cdi cdi-operator-5757c84894-xhw6r 1/1 Running 0 120m

cdi cdi-uploadproxy-79dd97b4d5-lvd72 1/1 Running 0 120m

kube-system calico-kube-controllers-bf965bfd8-hg82b 1/1 Running 0 154m

kube-system calico-node-8llqm 1/1 Running 0 75m

kube-system calico-node-j9rdb 1/1 Running 0 75m

kube-system coredns-558bd4d5db-fml9w 1/1 Running 0 3h21m

kube-system coredns-558bd4d5db-gg8dm 1/1 Running 0 3h21m

kube-system etcd-grouched1.fyre.ibm.com 1/1 Running 0 3h22m

kube-system kube-apiserver-grouched1.fyre.ibm.com 1/1 Running 0 3h22m

kube-system kube-controller-manager-grouched1.fyre.ibm.com 1/1 Running 1 3h22m

kube-system kube-proxy-47txj 1/1 Running 0 3h19m

kube-system kube-proxy-hg7f8 1/1 Running 0 3h21m

kube-system kube-scheduler-grouched1.fyre.ibm.com 1/1 Running 0 3h22m

kubevirt virt-api-58999dff54-c8mch 1/1 Running 0 120m

kubevirt virt-api-58999dff54-gs8pm 1/1 Running 0 120m

kubevirt virt-controller-5c68c56896-l2rp7 1/1 Running 0 120m

kubevirt virt-controller-5c68c56896-phrt9 1/1 Running 0 120m

kubevirt virt-handler-85dhc 1/1 Running 0 120m

kubevirt virt-operator-78f65c88d4-ldtgj 1/1 Running 0 123m

kubevirt virt-operator-78f65c88d4-tmxhs 1/1 Running 0 123m

as is the daemonset: -

kubectl get daemonset -A

NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE

kube-system calico-node 2 2 2 2 2 kubernetes.io/os=linux 154m

kube-system kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 3h23m

kubevirt virt-handler 1 1 1 1 1 kubernetes.io/os=linux 120m

and I can now create my DataVolume: -

kubectl apply -f create_volume.yaml

datavolume.cdi.kubevirt.io/registry-image-datavolume created

2 comments:

Unknown said...: Hello Dave

Thank you very much for this blog post, I had been stuck on this problem for several weeks now, I was unable to span my k8s cluster across multiple networks.

Thanks again; 10 August 2021 at 11:48
Dave Hay said...: Awesome news, thanks for letting us know; 10 August 2021 at 11:51

Subscribe to: Post Comments (Atom)