Sunday, 4 July 2021

Fun n' games breaking, and then fixing, containerd

By virtue of the fact that I was hacking around with my Kubernetes 1.21 / containerd 1.44 / Kata 2.0 environment today, I managed to rather impressively break the containerd runtime.

In even better news, I managed to fix it again, albeit with a lot of help from my friends - well, to be specific, from GitHub ....

I think I managed to break things by replacing a pair of binaries underlying containerd and Kata : -

containerd-shim-kata-v2

kata-runtime

whilst I had a pod running using the Kata runtime ...

By the time I realised I'd broken things, it was kinda too late ....

The containerd runtime refused to ... well, run, and things weren't looking too good for me - I was even considering cluster rebuild time .....

I'd tried: -

systemctl stop containerd.service

systemctl start containerd.service

systemctl stop kubelet.service

systemctl start kubelet.service

and even a reboot, but no dice....

Whilst debugging, I did check the containerd service: -

systemctl status containerd.service --no-pager --full

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: activating (start-pre) since Sun 2021-07-04 01:04:39 PDT; 1ms ago
       Docs: https://containerd.io
Cntrl PID: 6922 ((modprobe))
      Tasks: 0
     Memory: 0B
     CGroup: /system.slice/containerd.service
             └─6922 (modprobe)
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Stopped containerd container runtime.
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Starting containerd container runtime...

systemctl status containerd.service --no-pager --full

● containerd.service - containerd container runtime
     Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:04:39 PDT; 2s ago
       Docs: https://containerd.io
    Process: 6922 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 6929 ExecStart=/usr/bin/containerd (code=exited, status=1/FAILURE)
   Main PID: 6929 (code=exited, status=1/FAILURE)

as well as the kubelet service: -

systemctl status kubelet.service --no-pager --full

   ● kubelet.service - kubelet: The Kubernetes Node Agent
        Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
       Drop-In: /etc/systemd/system/kubelet.service.d
                └─10-kubeadm.conf
        Active: active (running) since Sun 2021-07-04 01:18:53 PDT; 4s ago
          Docs: https://kubernetes.io/docs/home/
      Main PID: 11243 (kubelet)
         Tasks: 7 (limit: 4616)
        Memory: 15.0M
        CGroup: /system.slice/kubelet.service
                └─11243 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --pod-infra-container-image=k8s.gcr.io/pause:3.4.1

   Jul 04 01:18:53 sideling2.fyre.ibm.com systemd[1]: Started kubelet: The Kubernetes Node Agent.
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.014459   11243 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044083   11243 server.go:440] "Kubelet version" kubeletVersion="v1.21.0"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044870   11243 server.go:851] "Client rotation is on, will bootstrap in background"
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.062425   11243 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
   Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.064001   11243 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
   root@sideling2:~# systemctl status kubelet.service --no-pager --full
   ● kubelet.service - kubelet: The Kubernetes Node Agent
        Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
       Drop-In: /etc/systemd/system/kubelet.service.d
                └─10-kubeadm.conf
        Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:18:59 PDT; 4s ago
          Docs: https://kubernetes.io/docs/home/
       Process: 11243 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
      Main PID: 11243 (code=exited, status=1/FAILURE)

   Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
   Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Failed with result 'exit-code'.

but didn't draw too many conclusions ...

However, I got more insight when I tried / failed to run containerd interactively: -

/usr/bin/containerd

which, in part, failed with: -

FATA[2021-07-04T01:04:22.466954937-07:00] Failed to run CRI service                     error="failed to recover state: failed to reserve sandbox name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\": name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\" is reserved for \"3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2\""

Now, given that the pod that had been running when I disabled to replace two of the in-use binaries was called nginx-kata I suspect that the problem might be that the pod was still running, despite containerd not working, and effectively locking the pod sandbox - as per the above.

Thankfully, this GH issue in the containerd project came to my rescue : -

'failed to reserve sandbox name' error after hard reboot #1014

Following the advice therein, I tried / failed to see what was going on using ctr : -

ctr --namespace k8s.io containers list

ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"

I even tried the Hail Mary approach of nuking the errant container: -

ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

but to no avail: -

ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"

I even got very brutal: -

find /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

/var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

rm -Rf /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

but again to no avail.

Reading and re-reading the aforementioned GH issues, I noted that one of the responders said, in part: -

you can start containerd with disable_plugins = [ cri ] in the
/etc/containerd/config.toml

to which someone else responded: -

PS: the config option for the workaround is disabled_plugins, not disable_plugins

I checked out the file: -

vi /etc/containerd/config.toml

and noted: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...


so I amended it: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["cri"]
required_plugins = []
oom_score = 0
...

and tried again to manually start containerd: -

/usr/bin/containerd

which borked with: -

containerd: invalid disabled plugin URI "cri" expect io.containerd.x.vx

Remembering that the URI had changed from plain old cri to io.containerd.grpc.v1.cri so I changed it again: -

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["io.containerd.grpc.v1.cri"]
required_plugins = []
oom_score = 0
...

and was able to start containerd manually: -

/usr/bin/containerd

INFO[2021-07-04T09:31:50.507911132-07:00] starting containerd                           revision= version="1.4.4-0ubuntu1~20.04.2"
INFO[2021-07-04T09:31:50.556285374-07:00] loading plugin "io.containerd.content.v1.content"...  type=io.containerd.content.v1
INFO[2021-07-04T09:31:50.556487705-07:00] loading plugin "io.containerd.snapshotter.v1.aufs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.568920159-07:00] loading plugin "io.containerd.snapshotter.v1.btrfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570026092-07:00] skip loading plugin "io.containerd.snapshotter.v1.btrfs"...  error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570139702-07:00] loading plugin "io.containerd.snapshotter.v1.devmapper"...  type=io.containerd.snapshotter.v1
WARN[2021-07-04T09:31:50.570350493-07:00] failed to load plugin io.containerd.snapshotter.v1.devmapper  error="devmapper not configured"
INFO[2021-07-04T09:31:50.570398653-07:00] loading plugin "io.containerd.snapshotter.v1.native"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570478494-07:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571028515-07:00] loading plugin "io.containerd.snapshotter.v1.zfs"...  type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571796367-07:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"...  error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571859377-07:00] loading plugin "io.containerd.metadata.v1.bolt"...  type=io.containerd.metadata.v1
WARN[2021-07-04T09:31:50.571936777-07:00] could not use snapshotter devmapper in metadata plugin  error="devmapper not configured"
INFO[2021-07-04T09:31:50.571969917-07:00] metadata content store policy set             policy=shared
INFO[2021-07-04T09:31:50.572636609-07:00] loading plugin "io.containerd.differ.v1.walking"...  type=io.containerd.differ.v1
INFO[2021-07-04T09:31:50.572707090-07:00] loading plugin "io.containerd.gc.v1.scheduler"...  type=io.containerd.gc.v1
INFO[2021-07-04T09:31:50.573169651-07:00] loading plugin "io.containerd.service.v1.introspection-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573381772-07:00] loading plugin "io.containerd.service.v1.containers-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573488782-07:00] loading plugin "io.containerd.service.v1.content-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573642362-07:00] loading plugin "io.containerd.service.v1.diff-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573717172-07:00] loading plugin "io.containerd.service.v1.images-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573778242-07:00] loading plugin "io.containerd.service.v1.leases-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574107503-07:00] loading plugin "io.containerd.service.v1.namespaces-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574211764-07:00] loading plugin "io.containerd.service.v1.snapshots-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574268314-07:00] loading plugin "io.containerd.runtime.v1.linux"...  type=io.containerd.runtime.v1
INFO[2021-07-04T09:31:50.574727045-07:00] loading plugin "io.containerd.runtime.v2.task"...  type=io.containerd.runtime.v2
DEBU[2021-07-04T09:31:50.575140177-07:00] loading tasks in namespace                    namespace=k8s.io
INFO[2021-07-04T09:31:50.585638375-07:00] loading plugin "io.containerd.monitor.v1.cgroups"...  type=io.containerd.monitor.v1
INFO[2021-07-04T09:31:50.586886318-07:00] loading plugin "io.containerd.service.v1.tasks-service"...  type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.587187039-07:00] loading plugin "io.containerd.internal.v1.restart"...  type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.587704810-07:00] loading plugin "io.containerd.grpc.v1.containers"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588049332-07:00] loading plugin "io.containerd.grpc.v1.content"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588408763-07:00] loading plugin "io.containerd.grpc.v1.diff"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588663644-07:00] loading plugin "io.containerd.grpc.v1.events"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588885984-07:00] loading plugin "io.containerd.grpc.v1.healthcheck"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589144744-07:00] loading plugin "io.containerd.grpc.v1.images"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589440975-07:00] loading plugin "io.containerd.grpc.v1.leases"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589764946-07:00] loading plugin "io.containerd.grpc.v1.namespaces"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590026197-07:00] loading plugin "io.containerd.internal.v1.opt"...  type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.590392338-07:00] loading plugin "io.containerd.grpc.v1.snapshots"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590632959-07:00] loading plugin "io.containerd.grpc.v1.tasks"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590854780-07:00] loading plugin "io.containerd.grpc.v1.version"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591069010-07:00] loading plugin "io.containerd.grpc.v1.introspection"...  type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591965292-07:00] serving...                                    address=/run/containerd/containerd.sock.ttrpc
INFO[2021-07-04T09:31:50.592328083-07:00] serving...                                    address=/run/containerd/containerd.sock
DEBU[2021-07-04T09:31:50.592592984-07:00] sd notification                               error="<nil>" notified=false state="READY=1"
INFO[2021-07-04T09:31:50.592869195-07:00] containerd successfully booted in 0.088386s  

Having "proved" that this worked, I quit the interactive containerd process, and instead started the service: -

systemctl start containerd.service

which started OK.

I was then able to use ctr : -

ctr --namespace k8s.io containers list

and: -

ctr --namespace k8s.io containers list|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

before using the nuke option: -

ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2

Having nuked the errant container, I backed out the changes to the configuration file: -

vi /etc/containerd/config.toml 

version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...

and restarted containerd : -

systemctl restart containerd.service

and things were back to normal.

Can you say "Yay!" ? I bet you can ....

No comments:

Today I Learned - how to grep for two strings

So I do a lot of work with IBM Container Registry (ICR) via the IBM Cloud command-line tool. Having logged into my IBM Cloud account: - ic l...