Ubuntu Noble ships with an enforcing rsyslogd apparmor profile. This
profile prevents our haproxy container from opening the syslog socket we
bind mount into the container. I discussed this in #ubuntu-security
which resulted in this issue:
https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/2098148
which includes many details on what is going on. This change implements
the suggested workaround for our haproxy nodes. I believe this is the
only place we are currently attempting to directly access rsyslog
sockets from within containers.
The tl;dr on the fix is that we have to tell rsyslogd to attach
disconnected connections as the container runs in a different filesystem
namespace which disconnects the paths for the socket. Unfortunately
sarnold indicates that we have to edit the primary profile configuration
file as this flag applies to the top level of the profile. We cannot use
one of the files this profile #includes.
Change-Id: I4e09211a1bdc4dfbf3012a66e79c181c6fb957a4
The old install-docker upstream.yaml tasks installed apparmor for docker
(it was origianlly a dependency but then docker removed it as an
explicit dependency while still explicitly depending on it so we
manually installed it). When we started deploying Noble nodes with
podman via the install-docker role we didn't get apparmor because podman
doesn't appear to depend on it. However when we got to production the
production images already come with apparmor which includes profiles for
things like podman and rsyslog which have caused problems for us
deploying services with podman.
Attempt to catch these issues in CI by explicitly installing apparmor.
This should be a noop for production beceaus apparmor is already
installed. This should help us catch problems with podman in CI before
we ever get to production.
To ensure that apparmor is working properly we capture apparmor_status
output as part of our system-config-run job log collection.
Note we remove the zuul lb test for haproxy.log being present as current
apparmor problems with the rsyslogd profile prevent that from occuring
on noble. The next change will correct that issue and reinstate the
test case.
Change-Id: Iea5966dbb2dcfbe1e51d9c00bad67a9d37e1b7e1
Rsyslog on Noble has apparmor rules that restrict rsyslog socket
creation to /var/lib/*/dev/log. Previously we were configuring haproxy
hosts to create an rsyslog socket for haproxy at /var/haproxy/dev/log
which doesn't match the apparmor rule so gets denied.
To address this we move all the host side haproxy config from
/var/haproxy to /var/lib/haproxy. This allows rsyslog to create the
socket. To avoid needing to update docker images (for haproxy statsd)
and to continue to make the haproxy container itself happy we don't
adjust paths on the target side of our bind mounts. This means some
things still refer to /var/haproxy but they should all be within
containers.
I don't believe this will be impactful to existing load balancer
servers. We should deploy new content to /var/lib/haproxy then
automatically restart services (rsyslog and haproxy container) because
their configs are updating. One potential problem with this is rsyslog
will restart before the containers do and its log path will have moved.
If we are concerned about this we can configure rsyslog to continue to
attempt to create the old path in addition to the new path (this will
fail on Noble).
Change-Id: I4582e6b2dda188583f76265ab78bcb00a302e375
Podman on Ubuntu Noble has apparmor config that prevents SIGHUP from
being delivered via `podman kill -s HUP` or `docker compose kill -s
HUP`. Attempting to do so results in:
kernel: audit: type=1400 audit(1739232042.996:129): apparmor="DENIED" operation="signal" class="signal" profile="containers-default-0.57.4-apparmor1" pid=17067 comm="runc" requested_mask="receive" denied_mask="receive" signal=hup peer="podman"
This appears to be due to issues with the apparmor configuration that
was edited to make other signals work:
https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483
We work around that by using kill to issue the signal instead which
seems to work based on some manual testing.
Change-Id: I49435fdda662e25c7192faf24e0ae4b527e943b9
This is a new Noble server to replace the existing zuul-lb01 server. As
part of this transition we switch to podman as the runtime container
runtime and docker compose replaces docker-compose. This requires a
small update to testing to check the new container name.
The depends on isn't strictly necessary but seems like good hygiene to
deploy a server with DNS records in place.
Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/941146
Change-Id: I2bb74809b00d4a554a26601c46a2aa4c3c75d4f1
Currently codesearch uses syslog logging with docker but podman
doesn't support syslog. Podman does support journald which is basically
equivalent for us since we have journald log to syslog too. Update for
podman compatibility in preparation for upgrades to Noble.
Change-Id: Id7da6b70faad9521da6a39eaa9543b97c0136d58
We are upgrading from 1.23.1 to 1.23.3. Both 1.23.2 and 1.23.3 are
bugfix releases but 1.23.2 includes a breaking change to webhooks. We
don't use webhooks so this shouldn't affect us. Complete changelog can
be found here:
https://github.com/go-gitea/gitea/blob/v1.23.3/CHANGELOG.md
There is also a minor update to one of the templates we override which I
have synced over.
Change-Id: I97ba30309da63ecb4fb4fc301209c60ea8dc8504
This handler used an incorrect path to the docker-compose file and
failed with no such file or directory errors. Update the handler to use
the correct path to the docker-compose file.
I also add a note that the check to avoid restarts when we just
restarted containers may not be working as we did restart at least the
mariadb container which is how I discovered this issue.
Change-Id: If004b72e3efc0d0d4665c6fd56e514a5cb6191c5
This sphinx internal ref was missing ``'s surrounding the token
identifier. Add them which should fix the reference.
Change-Id: I6261ab3a96cecbf63d0934441650d9d91baac798
Now that we have grafana02 up and running we need to remove grafana01
from management so that it can be deleted. A followup change will clean
up DNS for us.
Change-Id: Ib90dadf404eb24aed5673d2611584bd00a278d45
We just deployed grafana02 with 10.4.14 which was latest when I started
poking at this. Since then 10.4.15 has been released. Update to this
latest release.
Changelog can be seen here:
58a279e109/CHANGELOG.md (10415-2025-01-28)
Change-Id: I7a8fd7bc273e628475df8bfc492e8a0fdf480457
We have a testinfra test case for checking the launch tooling is
installed properly. Unfortunately, we weren't running that test case
when we make updates to the launch tooling. Fix that.
Change-Id: Ie497d60aaf1842a7478a8550d45608daeec4625a
This adds a grafana02 server to our inventory with associated LE host
vars. This should deploy grafana on our newly created noble grafana02
server.
Note we switch the system-config-run-grafana job over to interact with
02 to match production. To simplify this effort in the future we convert
the old grafana01 testing host var to a group var file. This change was
already done on bridge.
We will need to followup with at least one change to clean out grafana01
when we are happy with the new server.
Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/940653
Change-Id: Ifd7f83185fbd59935a63973642e9d165bd8105a2
This is what I get for not testing it before pushing. I've made this
minor edit in place in the venv contents on bridge and launching
grafana02 appears to have worked. This should be the only fixup needed.
Change-Id: Ief32094fb0b216dac99879a285a7dbd0fd005b49
This is the latest release of the 10.x series and we're currently stuck
on 10.2.2. We can update to 11.x after we're up to date on 10.x.
We bundle this change up with an update to run on Noble. The plan is
we'll put the old 10.2 focal server in the emergency file, land this
change, then add a new Noble server to inventory. This should allow us
to easily rollback to 10.2 if Noble with Grafana 10.4.14 don't work for
some reason. Basically killing two birds with one stone here and getting
a safer upgrade process out of it.
Depends-On: https://review.opendev.org/c/openstack/project-config/+/940276
Change-Id: Icc5e02d4b80cb1f8524ab3dde888aba7db430ffe
We've seen Noble nodes booting in rax legacy come up with a single vcpu
even when we've requested 8. Avoid unexpected reduction in CPU counts
when booting new noble nodes by explicitly checking for at least 2 CPUs.
We don't want to discover a month after replacing a server that we need
to replace it again because it booted on the wrong hypervisor.
Change-Id: I043dc8d6eb1131d0fec49734c7959e6c123f8f8f
This will pull the haproxy:lts image from the mirror we have at
quay.io/opendevmirror/haproxy rather than docker hub directly. This
should improve reliability in CI in particular when pulling that image.
One fewer image to pull from docker hub also means more rate limit to
spend where we are still pulling from docker hub.
Note this will affect the gitea and zuul web front ends as they are both
fronted by haproxy. Expect a minor blip while the container "updates"
(hashes should match) and is restarted.
Change-Id: Ic242ea3975ada1c7a698be8e41b9c5c8f8d07ed3
This adds a new daily job to mirror haproxy to our quay.io hosted
opendevmirror set of images. We'll be able to use this to update the
location we pull haproxy from for zuul and gitea once the image is
mirrored.
Change-Id: Iba17aacdfbfede00ac09aea7c57325a09c7da9f2
One fewer image to pull from docker hub eating into our rate limits.
Note that Gerrit its db container are not automatically updated by
ansible. This change will need manual intervention to get reflected in
production.
Change-Id: Ibbfbf2ecfb7f972720bfc0f7b97831231d217633
One fewer image to pull from docker hub eating into our rate limits.
Note that this will restart the standalone mariadb container when
deployed. This may impact Zuul's ability to record jobs temporarily.
Change-Id: I4f46c63f3002740c2246f11d1ad69bd43e61036c
One fewer image to pull from docker hub eating into our rate limits.
Note that this update will restart at least the mariadb container on the
mailman list server.
Change-Id: I8f90956d945baa1826783ed8a6de6b1ce24a84d2
One fewer image to pull from docker hub that eats into our rate limits.
Note that deployment of this change will restart at least the mariadb
container on the server.
Change-Id: I21e7f707f0876aeb348af14efe57fe327ab594a9
One fewer image to pull from docker hub that eats into our rate limits.
Note this will restart at least the mariadb service on the refstack
server when it deploys.
Change-Id: I15eb36bc570fe22e2e2b85b3bf321bb254636410
This appears to be a very minor update from 2.2.6 (as far as I can tell
dockerfile and settings haven't changed). The changelog indicates that
important changes were rewritten to use react 19 and react router v7.
Other than that only dependency updates were made.
https://github.com/ether/etherpad-lite/blob/v2.2.7/CHANGELOG.md
Change-Id: I48e8914ffa7026e35b6341628a709301c6a61c26
This is all in an effort to reduce our total dependency on docker hub as
rate limits there are quite low. Every image we can pull from somewhere
else is more rate limit bandwidht we can use for images still on docker
hub.
Change-Id: I3566383acf43e556fcd5854f6dfb70af8ffa1ba2
Newer grafana sends an options request that graphite responds to with a
400 response. This response did not include allowed origin headers
because it is a failure case. Update this header and the allowed methods
header to always be included even on 400 or other error responses.
This should ideally address the CORS errors we see with updated grafana.
An alternative is to update grafana to proxy the requests for us, but
this is less flexible as other tools may not have built in proxies.
The suggestion comes from this stackoverflow question and answer:
https://stackoverflow.com/questions/20414669/nginx-add-headers-when-returning-400-codes
Change-Id: Icf1179d35e420384da72af839ca329548226ee63
Exim 4.95 on Ubuntu Jammy started enforcing an outbound line length
limit of 998 bytes, easily exceeded by some badly-behaved MUAs.
Unfortunately, because Exim only checks this in its remote_smtp
transport, it results in mass bounces back for Mailman mailing
lists, incrementing all subscribers bounce scores on lists where
bounce processing is enabled. The telltale indicator is that the
messages are returned to Mailman citing a delivery error of "message
has lines too long for transport".
Ubuntu added a workaround in later versions of their packages, but
did not backport that to Jammy. Regardless, it's overridden by a
config option and we replace the default Exim config entirely, so
need to incorporate it into ours directly anyway. Because this
message_linelength_limit option to the remote_smtp transport is only
supported by exim versions on Jammy and newer, exclude it for our
older platforms so that it won't result in a configuration loading
error.
This copies the override value used in Ubuntu Noble's
exim4.conf.template file.
Change-Id: I38e169dc14e7fc3c5c1d43b5f147e6b35b718bb2
The zuul-web component now needs to read the cloud config in order
to fully parse the cloud provider information.
Change-Id: I4b1356bb118afa317e49898b5cf40191e5f0955d