Sunday, 4 July 2021

Fun n' games breaking, and then fixing, containerd

By virtue of the fact that I was hacking around with my Kubernetes 1.21 / containerd 1.44 / Kata 2.0 environment today, I managed to rather impressively break the containerd runtime.

In even better news, I managed to fix it again, albeit with a lot of help from my friends - well, to be specific, from GitHub ....

I think I managed to break things by replacing a pair of binaries underlying containerd and Kata : -

containerd-shim-kata-v2

kata-runtime

whilst I had a pod running using the Kata runtime ...

By the time I realised I'd broken things, it was kinda too late ....

The containerd runtime refused to ... well, run, and things weren't looking too good for me - I was even considering cluster rebuild time .....

I'd tried: -

systemctl stop containerd.service

systemctl start containerd.service

systemctl stop kubelet.service

systemctl start kubelet.service

and even a reboot, but no dice....

Whilst debugging, I did check the containerd service: -

systemctl status containerd.service --no-pager --full

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: activating (start-pre) since Sun 2021-07-04 01:04:39 PDT; 1ms ago
       Docs: https://containerd.io
Cntrl PID: 6922 ((modprobe))
      Tasks: 0
     Memory: 0B
     CGroup: /system.slice/containerd.service
             └─6922 (modprobe)
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Stopped containerd container runtime.
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Starting containerd container runtime...

systemctl status containerd.service --no-pager --full

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:04:39 PDT; 2s ago
       Docs: https://containerd.io
    Process: 6922 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 6929 ExecStart=/usr/bin/containerd (code=exited, status=1/FAILURE)
   Main PID: 6929 (code=exited, status=1/FAILURE)

as well as the kubelet service: -

systemctl status kubelet.service --no-pager --full

   ● kubelet.service - kubelet: The Kubernetes Node Agent
        Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
       Drop-In: /etc/systemd/system/kubelet.service.d
                └─10-kubeadm.conf
        Active: active (running) since Sun 2021-07-04 01:18:53 PDT; 4s ago
          Docs: https://kubernetes.io/docs/home/
      Main PID: 11243 (kubelet)
         Tasks: 7 (limit: 4616)
        Memory: 15.0M
        CGroup: /system.slice/kubelet.service
                └─11243 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --pod-infra-container-image=k8s.gcr.io/pause:3.4.1

   Jul 04 01:18:53 sideling2.fyre.ibm.com systemd[1]: Started kubelet: The Kubernetes Node Agent.
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.014459   11243 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044083   11243 server.go:440] "Kubelet version" kubeletVersion="v1.21.0"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044870   11243 server.go:851] "Client rotation is on, will bootstrap in background"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.062425   11243 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.064001   11243 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
   root@sideling2:~# systemctl status kubelet.service --no-pager --full
   ● kubelet.service - kubelet: The Kubernetes Node Agent
        Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
       Drop-In: /etc/systemd/system/kubelet.service.d
                └─10-kubeadm.conf
        Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:18:59 PDT; 4s ago
          Docs: https://kubernetes.io/docs/home/
       Process: 11243 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
      Main PID: 11243 (code=exited, status=1/FAILURE)

   Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
   Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Failed with result 'exit-code'.

but didn't draw too many conclusions ...

However, I got more insight when I tried / failed to run containerd interactively: -

/usr/bin/containerd

which, in part, failed with: -

FATA[2021-07-04T01:04:22.466954937-07:00] Failed to run CRI service                     error="failed to recover state: failed to reserve sandbox name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\": name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\" is reserved for \"3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2\""

Now, given that the pod that had been running when I disabled to replace two of the in-use binaries was called nginx-kata I suspect that the problem might be that the pod was still running, despite containerd not working, and effectively locking the pod sandbox - as per the above.

Thankfully, this GH issue in the containerd project came to my rescue : -

'failed to reserve sandbox name' error after hard reboot #1014

Following the advice therein, I tried / failed to see what was going on using ctr : -

ctr --namespace k8s.io containers list

ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"

I even tried the Hail Mary approach of nuking the errant container: -

ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

but to no avail: -

ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"

I even got very brutal: -

find /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

/var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

rm -Rf /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

but again to no avail.

Reading and re-reading the aforementioned GH issues, I noted that one of the responders said, in part: -

you can start containerd with disable_plugins = [ cri ] in the
/etc/containerd/config.toml

to which someone else responded: -

PS: the config option for the workaround is disabled_plugins, not disable_plugins

I checked out the file: -

vi /etc/containerd/config.toml

and noted: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...


so I amended it: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["cri"]
required_plugins = []
oom_score = 0
...

and tried again to manually start containerd: -

/usr/bin/containerd

which borked with: -

containerd: invalid disabled plugin URI "cri" expect io.containerd.x.vx

Remembering that the URI had changed from plain old cri to io.containerd.grpc.v1.cri so I changed it again: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["io.containerd.grpc.v1.cri"]
required_plugins = []
oom_score = 0
...

and was able to start containerd manually: -

/usr/bin/containerd

INFO[2021-07-04T09:31:50.507911132-07:00] starting containerd                           revision= version="1.4.4-0ubuntu1~20.04.2"
INFO[2021-07-04T09:31:50.556285374-07:00] loading plugin "io.containerd.content.v1.content"...  type=io.containerd.content.v1
INFO[2021-07-04T09:31:50.556487705-07:00] loading plugin "io.containerd.snapshotter.v1.aufs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.568920159-07:00] loading plugin "io.containerd.snapshotter.v1.btrfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570026092-07:00] skip loading plugin "io.containerd.snapshotter.v1.btrfs"...  error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570139702-07:00] loading plugin "io.containerd.snapshotter.v1.devmapper"...  type=io.containerd.snapshotter.v1
WARN[2021-07-04T09:31:50.570350493-07:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
INFO[2021-07-04T09:31:50.570398653-07:00] loading plugin "io.containerd.snapshotter.v1.native"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570478494-07:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571028515-07:00] loading plugin "io.containerd.snapshotter.v1.zfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571796367-07:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"...  error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571859377-07:00] loading plugin "io.containerd.metadata.v1.bolt"...  type=io.containerd.metadata.v1
WARN[2021-07-04T09:31:50.571936777-07:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2021-07-04T09:31:50.571969917-07:00] metadata content store policy set             policy=shared
INFO[2021-07-04T09:31:50.572636609-07:00] loading plugin "io.containerd.differ.v1.walking"...  type=io.containerd.differ.v1
INFO[2021-07-04T09:31:50.572707090-07:00] loading plugin "io.containerd.gc.v1.scheduler"...  type=io.containerd.gc.v1
INFO[2021-07-04T09:31:50.573169651-07:00] loading plugin "io.containerd.service.v1.introspection-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573381772-07:00] loading plugin "io.containerd.service.v1.containers-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573488782-07:00] loading plugin "io.containerd.service.v1.content-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573642362-07:00] loading plugin "io.containerd.service.v1.diff-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573717172-07:00] loading plugin "io.containerd.service.v1.images-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573778242-07:00] loading plugin "io.containerd.service.v1.leases-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574107503-07:00] loading plugin "io.containerd.service.v1.namespaces-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574211764-07:00] loading plugin "io.containerd.service.v1.snapshots-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574268314-07:00] loading plugin "io.containerd.runtime.v1.linux"...  type=io.containerd.runtime.v1
INFO[2021-07-04T09:31:50.574727045-07:00] loading plugin "io.containerd.runtime.v2.task"...  type=io.containerd.runtime.v2
DEBU[2021-07-04T09:31:50.575140177-07:00] loading tasks in namespace                    namespace=k8s.io
INFO[2021-07-04T09:31:50.585638375-07:00] loading plugin "io.containerd.monitor.v1.cgroups"...  type=io.containerd.monitor.v1
INFO[2021-07-04T09:31:50.586886318-07:00] loading plugin "io.containerd.service.v1.tasks-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.587187039-07:00] loading plugin "io.containerd.internal.v1.restart"...  type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.587704810-07:00] loading plugin "io.containerd.grpc.v1.containers"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588049332-07:00] loading plugin "io.containerd.grpc.v1.content"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588408763-07:00] loading plugin "io.containerd.grpc.v1.diff"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588663644-07:00] loading plugin "io.containerd.grpc.v1.events"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588885984-07:00] loading plugin "io.containerd.grpc.v1.healthcheck"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589144744-07:00] loading plugin "io.containerd.grpc.v1.images"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589440975-07:00] loading plugin "io.containerd.grpc.v1.leases"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589764946-07:00] loading plugin "io.containerd.grpc.v1.namespaces"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590026197-07:00] loading plugin "io.containerd.internal.v1.opt"...  type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.590392338-07:00] loading plugin "io.containerd.grpc.v1.snapshots"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590632959-07:00] loading plugin "io.containerd.grpc.v1.tasks"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590854780-07:00] loading plugin "io.containerd.grpc.v1.version"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591069010-07:00] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591965292-07:00] serving...                                    address=/run/containerd/containerd.sock.ttrpc
INFO[2021-07-04T09:31:50.592328083-07:00] serving...                                    address=/run/containerd/containerd.sock
DEBU[2021-07-04T09:31:50.592592984-07:00] sd notification                               error="<nil>" notified=false state="READY=1"
INFO[2021-07-04T09:31:50.592869195-07:00] containerd successfully booted in 0.088386s  

Having "proved" that this worked, I quit the interactive containerd process, and instead started the service: -

systemctl start containerd.service

which started OK.

I was then able to use ctr : -

ctr --namespace k8s.io containers list

and: -

ctr --namespace k8s.io containers list|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

before using the nuke option: -

ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

Having nuked the errant container, I backed out the changes to the configuration file: -

vi /etc/containerd/config.toml 

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...

and restarted containerd : -

systemctl restart containerd.service

and things were back to normal.

Can you say "Yay!" ? I bet you can ....

No comments:

Yay, VMware Fusion and macOS Big Sur - no longer "NAT good friends" - forgive the double negative and the terrible pun ...

After macOS 11 Big Sur was released in 2020, VMware updated their Fusion product to v12 and, sadly, managed to break Network Address Trans...