Bitnami warned us, we still were affected;
As a good Kubernetes user, sometimes you rely on others to define a few of your resources using their well designed Helm Charts or even their hardened container images. This is the cautionary tale on why you should beware of such practices.
It is not a secret that here at FancyWhale we run Kubernetes. As a good Kubernetes user, sometimes you rely on others to define a few of your resources using their well designed Helm Charts or even their hardened container images. This is the cautionary tale on why you should beware of such practices.
Imagine yourself with the following requirement: your company needs a new environment to host your files that is not reliant on any of the major cloud providers. You have an easy setup for Kubernetes where all you need is a container image and the definition of the dependencies to get something running. You have already used NextCloud before and you know how to administer it with all the tools and plugins, so you decide that your best course of action is to host your own NextCloud instance.
Here comes the next step, how do you deploy it? That part is also easy, they already provide their own helm chart, which is great; the company that creates the product should be the one to trust running their product - even if it is community maintained in their own GitHub. You figure out their helm chart values and tweak to point their connection for a separate DB you already have running for this. You trust that their Helm Chart is going to be kept up to date, but you know that you should pin their version so you don't automatically receive breaking changes:
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: nextcloud
spec:
interval: 1m
provider: generic
timeout: 60s
url: https://nextcloud.github.io/helm/
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: nextcloud
spec:
chart:
spec:
chart: nextcloud
interval: 1m
reconcileStrategy: ChartVersion
sourceRef:
kind: HelmRepository
name: nextcloud
version: <CharVersionHere>
interval: 5m
values:
...This worked great for years. The chart was upgraded, we have followed the latest versions by keeping it up-to-date and ensuring that all migration steps were taken; everything worked. Until it didn't.
Broadcom notified everyone of this happening, which we had trusted our charts were all up-to-date and that things were completely fine. Most of our Redis dependencies have been moved to Valkey, and the NextCloud helm chart had already prepared for this move by using the bitnamilegacy images while they were planning for next steps. We should not have any downtime due to this move from Broadcom, or at least that is what I thought.
Comes the day of the deletion of all bitnami images, KuReD reboots and patches our nodes - as it should - and lo and behold, NextCloud down. But, how?
Well, here is where things get interesting. Remember when I said that we took our time ensuring that the upgrade version of the chart would not break everything before upgrading the images? That is standard practice, which is great. But the problem then became the issues we have had with the latest versions of their Chart; we had to downgrade the chart version to troubleshoot on a later date. Yes, our downgraded version of the chart was using the bitnami images that were deleted.
Here we are! NextCloud down, chart has to be upgraded and we need to fix the images of the Redis dependencies to ensure we can have our uptime back, but now the upgrade path was broken. We did solve the issue later by migrating to ValKey as well for NextCloud and managing an external Redis instance, but we were down for a little while until we could fix it. But here is what we could have done to prevent this issue:
Keep an Inventory
While it seems overkill - especially with companies that have GitOps as their method of maintaining their environments as we do - this helps you decide and act on the most important pieces of your infrastructure. Keeping a well documented environment could have prevented myself from lowering the priority of fixing NextCloud's dependencies on Bitnami's Redis image.
Mirrors are your friends.
There is a reason that you are using images from third parties. They know how to maintain their images and they are good at doing it. Most of the time you won't have the manpower to keep everything up-to-date, scanning and fixing all the issues related to the images you require to keep the lights running. But here is the thing, you should still - at least - retain your images.
Here is the irony, we already run a Dockerhub retention mirror with our own local container registry. That helps us ensure we are not rate limited whenever restarting our nodes and migrating pods during node upgrades. The problem was not using this cache layer and trusting the chart images from a third party.
Admission policies to enforce invariant
We should have enforced the last two by using Kyverno/Gatekeeper to block non-digest images and block registries outside an allow list. This would have prevented us from trusting a registry we do not control.
This is not only a good practice to prevent issues with images that are retagged/removed/moved, but also to prevent you running into supply-chain attacks. You must control the images that run in your environments.
Automated pre-flight pulls
We have also introduced in our infrastructure a cronjob that verifies that the images are available consistently. Using the inventory of our container images, we verify that all of them are available locally and that they are working. This helps us signal if there could be any issues if the remote container registries are out of commission.
This is to show that there are always places where we can improve in our playbook. We can always improve our operations guideline and processes to prevent things like this from happening. It is easy to get complacent and ease your policies to get things done, but we should always strive for better.
If you’ve built your own guardrails, I’d love to hear how you detect dependency risks early.