By virtue of the fact that I was hacking around with my Kubernetes 1.21 / containerd 1.44 / Kata 2.0 environment today, I managed to rather impressively break the containerd runtime.
In even better news, I managed to fix it again, albeit with a lot of help from my friends - well, to be specific, from GitHub ....
I think I managed to break things by replacing a pair of binaries underlying containerd and Kata : -
containerd-shim-kata-v2
kata-runtime
whilst I had a pod running using the Kata runtime ...
By the time I realised I'd broken things, it was kinda too late ....
The containerd runtime refused to ... well, run, and things weren't looking too good for me - I was even considering cluster rebuild time .....
I'd tried: -
systemctl stop containerd.service
systemctl start containerd.service
systemctl stop kubelet.service
systemctl start kubelet.service
and even a reboot, but no dice....
Whilst debugging, I did check the containerd service: -
systemctl status containerd.service --no-pager --full
● containerd.service - containerd container runtime
Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
Active: activating (start-pre) since Sun 2021-07-04 01:04:39 PDT; 1ms ago
Docs: https://containerd.io
Cntrl PID: 6922 ((modprobe))
Tasks: 0
Memory: 0B
CGroup: /system.slice/containerd.service
└─6922 (modprobe)
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Stopped containerd container runtime.
Jul 04 01:04:39 sideling2.fyre.ibm.com systemd[1]: Starting containerd container runtime...
systemctl status containerd.service --no-pager --full
● containerd.service - containerd container runtime
Loaded: loaded (/lib/systemd/system/containerd.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:04:39 PDT; 2s ago
Docs: https://containerd.io
Process: 6922 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Process: 6929 ExecStart=/usr/bin/containerd (code=exited, status=1/FAILURE)
Main PID: 6929 (code=exited, status=1/FAILURE)
as well as the kubelet service: -
systemctl status kubelet.service --no-pager --full
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Sun 2021-07-04 01:18:53 PDT; 4s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 11243 (kubelet)
Tasks: 7 (limit: 4616)
Memory: 15.0M
CGroup: /system.slice/kubelet.service
└─11243 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --pod-infra-container-image=k8s.gcr.io/pause:3.4.1
Jul 04 01:18:53 sideling2.fyre.ibm.com systemd[1]: Started kubelet: The Kubernetes Node Agent.
Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.014459 11243 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044083 11243 server.go:440] "Kubelet version" kubeletVersion="v1.21.0"
Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.044870 11243 server.go:851] "Client rotation is on, will bootstrap in background"
Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.062425 11243 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Jul 04 01:18:54 sideling2.fyre.ibm.com kubelet[11243]: I0704 01:18:54.064001 11243 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
root@sideling2:~# systemctl status kubelet.service --no-pager --full
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: activating (auto-restart) (Result: exit-code) since Sun 2021-07-04 01:18:59 PDT; 4s ago
Docs: https://kubernetes.io/docs/home/
Process: 11243 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=1/FAILURE)
Main PID: 11243 (code=exited, status=1/FAILURE)
Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Jul 04 01:18:59 sideling2.fyre.ibm.com systemd[1]: kubelet.service: Failed with result 'exit-code'.
but didn't draw too many conclusions ...
However, I got more insight when I tried / failed to run containerd interactively: -
/usr/bin/containerd
which, in part, failed with: -
FATA[2021-07-04T01:04:22.466954937-07:00] Failed to run CRI service error="failed to recover state: failed to reserve sandbox name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\": name \"nginx-kata_default_ed1e725f-8bfc-4c8f-89e0-3477c6a4c451_0\" is reserved for \"3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2\""
Now, given that the pod that had been running when I disabled to replace two of the in-use binaries was called nginx-kata I suspect that the problem might be that the pod was still running, despite containerd not working, and effectively locking the pod sandbox - as per the above.
Following the advice therein, I tried / failed to see what was going on using ctr : -
ctr --namespace k8s.io containers list
ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
I even tried the Hail Mary approach of nuking the errant container: -
ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
but to no avail: -
ctr: failed to dial "/run/containerd/containerd.sock": connection error: desc = "transport: error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused"
I even got very brutal: -
find /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
/var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
rm -Rf /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
but again to no avail.
Reading and re-reading the aforementioned GH issues, I noted that one of the responders said, in part: -
you can start containerd with disable_plugins = [ cri ] in the
/etc/containerd/config.toml
to which someone else responded: -
PS: the config option for the workaround is disabled_plugins, not disable_plugins
I checked out the file: -
vi /etc/containerd/config.toml
and noted: -
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...
so I amended it: -
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["cri"]
required_plugins = []
oom_score = 0
...
and tried again to manually start containerd: -
/usr/bin/containerd
which borked with: -
containerd: invalid disabled plugin URI "cri" expect io.containerd.x.vx
Remembering that the URI had changed from plain old cri to io.containerd.grpc.v1.cri so I changed it again: -
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = ["io.containerd.grpc.v1.cri"]
required_plugins = []
oom_score = 0
...
and was able to start containerd manually: -
/usr/bin/containerd
INFO[2021-07-04T09:31:50.507911132-07:00] starting containerd revision= version="1.4.4-0ubuntu1~20.04.2"
INFO[2021-07-04T09:31:50.556285374-07:00] loading plugin "io.containerd.content.v1.content"... type=io.containerd.content.v1
INFO[2021-07-04T09:31:50.556487705-07:00] loading plugin "io.containerd.snapshotter.v1.aufs"... type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.568920159-07:00] loading plugin "io.containerd.snapshotter.v1.btrfs"... type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570026092-07:00] skip loading plugin "io.containerd.snapshotter.v1.btrfs"... error="path /var/lib/containerd/io.containerd.snapshotter.v1.btrfs (xfs) must be a btrfs filesystem to be used with the btrfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570139702-07:00] loading plugin "io.containerd.snapshotter.v1.devmapper"... type=io.containerd.snapshotter.v1
WARN[2021-07-04T09:31:50.570350493-07:00] failed to load plugin io.containerd.snapshotter.v1.devmapper error="devmapper not configured"
INFO[2021-07-04T09:31:50.570398653-07:00] loading plugin "io.containerd.snapshotter.v1.native"... type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.570478494-07:00] loading plugin "io.containerd.snapshotter.v1.overlayfs"... type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571028515-07:00] loading plugin "io.containerd.snapshotter.v1.zfs"... type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571796367-07:00] skip loading plugin "io.containerd.snapshotter.v1.zfs"... error="path /var/lib/containerd/io.containerd.snapshotter.v1.zfs must be a zfs filesystem to be used with the zfs snapshotter: skip plugin" type=io.containerd.snapshotter.v1
INFO[2021-07-04T09:31:50.571859377-07:00] loading plugin "io.containerd.metadata.v1.bolt"... type=io.containerd.metadata.v1
WARN[2021-07-04T09:31:50.571936777-07:00] could not use snapshotter devmapper in metadata plugin error="devmapper not configured"
INFO[2021-07-04T09:31:50.571969917-07:00] metadata content store policy set policy=shared
INFO[2021-07-04T09:31:50.572636609-07:00] loading plugin "io.containerd.differ.v1.walking"... type=io.containerd.differ.v1
INFO[2021-07-04T09:31:50.572707090-07:00] loading plugin "io.containerd.gc.v1.scheduler"... type=io.containerd.gc.v1
INFO[2021-07-04T09:31:50.573169651-07:00] loading plugin "io.containerd.service.v1.introspection-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573381772-07:00] loading plugin "io.containerd.service.v1.containers-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573488782-07:00] loading plugin "io.containerd.service.v1.content-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573642362-07:00] loading plugin "io.containerd.service.v1.diff-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573717172-07:00] loading plugin "io.containerd.service.v1.images-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.573778242-07:00] loading plugin "io.containerd.service.v1.leases-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574107503-07:00] loading plugin "io.containerd.service.v1.namespaces-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574211764-07:00] loading plugin "io.containerd.service.v1.snapshots-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.574268314-07:00] loading plugin "io.containerd.runtime.v1.linux"... type=io.containerd.runtime.v1
INFO[2021-07-04T09:31:50.574727045-07:00] loading plugin "io.containerd.runtime.v2.task"... type=io.containerd.runtime.v2
DEBU[2021-07-04T09:31:50.575140177-07:00] loading tasks in namespace namespace=k8s.io
INFO[2021-07-04T09:31:50.585638375-07:00] loading plugin "io.containerd.monitor.v1.cgroups"... type=io.containerd.monitor.v1
INFO[2021-07-04T09:31:50.586886318-07:00] loading plugin "io.containerd.service.v1.tasks-service"... type=io.containerd.service.v1
INFO[2021-07-04T09:31:50.587187039-07:00] loading plugin "io.containerd.internal.v1.restart"... type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.587704810-07:00] loading plugin "io.containerd.grpc.v1.containers"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588049332-07:00] loading plugin "io.containerd.grpc.v1.content"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588408763-07:00] loading plugin "io.containerd.grpc.v1.diff"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588663644-07:00] loading plugin "io.containerd.grpc.v1.events"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.588885984-07:00] loading plugin "io.containerd.grpc.v1.healthcheck"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589144744-07:00] loading plugin "io.containerd.grpc.v1.images"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589440975-07:00] loading plugin "io.containerd.grpc.v1.leases"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.589764946-07:00] loading plugin "io.containerd.grpc.v1.namespaces"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590026197-07:00] loading plugin "io.containerd.internal.v1.opt"... type=io.containerd.internal.v1
INFO[2021-07-04T09:31:50.590392338-07:00] loading plugin "io.containerd.grpc.v1.snapshots"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590632959-07:00] loading plugin "io.containerd.grpc.v1.tasks"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.590854780-07:00] loading plugin "io.containerd.grpc.v1.version"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591069010-07:00] loading plugin "io.containerd.grpc.v1.introspection"... type=io.containerd.grpc.v1
INFO[2021-07-04T09:31:50.591965292-07:00] serving... address=/run/containerd/containerd.sock.ttrpc
INFO[2021-07-04T09:31:50.592328083-07:00] serving... address=/run/containerd/containerd.sock
DEBU[2021-07-04T09:31:50.592592984-07:00] sd notification error="<nil>" notified=false state="READY=1"
INFO[2021-07-04T09:31:50.592869195-07:00] containerd successfully booted in 0.088386s
Having "proved" that this worked, I quit the interactive containerd process, and instead started the service: -
systemctl start containerd.service
which started OK.
I was then able to use ctr : -
ctr --namespace k8s.io containers list
and: -
ctr --namespace k8s.io containers list|grep 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
before using the nuke option: -
ctr --namespace k8s.io containers rm 3e08cb56351c79ec0e854f6fa0ac1e272c251a6dea29602a63a4cd994abf54a2
Having nuked the errant container, I backed out the changes to the configuration file: -
vi /etc/containerd/config.toml
version = 2
root = "/var/lib/containerd"
state = "/run/containerd"
plugin_dir = ""
disabled_plugins = []
required_plugins = []
oom_score = 0
...
and restarted containerd : -
systemctl restart containerd.service
and things were back to normal.
Can you say "Yay!" ? I bet you can ....