Specifically, they were able to build but NOT push the images to Docker Hub.
The main symptom that they were seeing was: -
you are not authorized to perform this operation: server returned 401.
which equates to HTTP 401.
Now we went down all sorts of paths to resolve this, including trying different Docker Hub credentials, changing the parent image ( as specified in the FROM tag within the Dockerfile ) to the most generic - and easily available - image: -
FROM alpine:3.9
but to no avail.
Long story very very short, but it transpired that the DNS configuration was ... misconfigured.
But in a very very subtle way ...
For this particular environment, the /etc/resolv.conf file was being used for DNS resolution, rather than, say, dnsmasq or systemd-resolved.
Which is absolutely fine.
This is what they had: -
nameserver 8.8.8.8
nameserver 4.4.4.4
which, at first glance, looks absolutely fine ....
Given that the issues were with the docker push and that the error was ALWAYS HTTP401 Not Authorised, we came to the conclusion that the issue arose when the Docker CLI was trying to connect to the Docker Notary Service as we'd enforced Docker Content Trust via DOCKER_CONTENT_TRUST=1.
We further worked out that there was some kind of timeout / latency issue, and were able to see that the connection FROM the Ubuntu box upon which we were running TO, specifically https://notary.docker.io was taking the longest time to resolve/connect. The traceroute command was definitely our friend here.
It appears that, by default, the Docker CLI has a built-in timeout of 30 seconds, which isn't user-configurable.
After much trial and a lot of error, we realised that the DNS / resolv.conf configuration was wrong ...
Specifically, that the Docker CLI would try and resolve the IP address for notary.docker.io via the first DNS server, which is Google's own 8.8.8.8 service, but that this would take longer than 30 seconds.
At which point, the Docker CLI would try again, but against the other DNS server, 4.4.4.4, which wouldn't respond particularly quickly.
It transpired that 4.4.4.4 is actually an ISP in the USA, namely Level 3, and ... the client was not in the USA ....
Therefore, the connectivity from their network to the Level 3 network was highly latent, probably because Level 3 focus upon serving their local US customers, rather than treating traffic from geographically remote hosts less favourably.
Now why did we miss this ?
Because 4.4.4.4 looks quite similar to 8.8.4.4 which is Google's second public DNS server, as per Google Public DNS
All of us tech-heads looked at resolv.conf many many many times, and missed this subtlety.
Once we changed it to 8.8.4.4 all was well: -
nameserver 8.8.8.8
nameserver 8.8.4.4
The morals of the story ....
- Never assume
- Check everything
- Check everything AGAIN
What fun!
No comments:
Post a Comment