19411 Commits

Author SHA1 Message Date
Clark Boylan
8cd8784825 Fix haproxy access to rsyslogd on Noble
Ubuntu Noble ships with an enforcing rsyslogd apparmor profile. This
profile prevents our haproxy container from opening the syslog socket we
bind mount into the container. I discussed this in #ubuntu-security
which resulted in this issue:

  https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/2098148

which includes many details on what is going on. This change implements
the suggested workaround for our haproxy nodes. I believe this is the
only place we are currently attempting to directly access rsyslog
sockets from within containers.

The tl;dr on the fix is that we have to tell rsyslogd to attach
disconnected connections as the container runs in a different filesystem
namespace which disconnects the paths for the socket. Unfortunately
sarnold indicates that we have to edit the primary profile configuration
file as this flag applies to the top level of the profile. We cannot use
one of the files this profile #includes.

Change-Id: I4e09211a1bdc4dfbf3012a66e79c181c6fb957a4
2025-02-13 08:30:37 -08:00
Clark Boylan
170c003bc7 Install apparmor when installing podman
The old install-docker upstream.yaml tasks installed apparmor for docker
(it was origianlly a dependency but then docker removed it as an
explicit dependency while still explicitly depending on it so we
manually installed it). When we started deploying Noble nodes with
podman via the install-docker role we didn't get apparmor because podman
doesn't appear to depend on it. However when we got to production the
production images already come with apparmor which includes profiles for
things like podman and rsyslog which have caused problems for us
deploying services with podman.

Attempt to catch these issues in CI by explicitly installing apparmor.
This should be a noop for production beceaus apparmor is already
installed. This should help us catch problems with podman in CI before
we ever get to production.

To ensure that apparmor is working properly we capture apparmor_status
output as part of our system-config-run job log collection.

Note we remove the zuul lb test for haproxy.log being present as current
apparmor problems with the rsyslogd profile prevent that from occuring
on noble. The next change will correct that issue and reinstate the
test case.

Change-Id: Iea5966dbb2dcfbe1e51d9c00bad67a9d37e1b7e1
2025-02-13 08:12:55 -08:00
Clark Boylan
15e0d6c7df Move haproxy config into /var/lib/haproxy
Rsyslog on Noble has apparmor rules that restrict rsyslog socket
creation to /var/lib/*/dev/log. Previously we were configuring haproxy
hosts to create an rsyslog socket for haproxy at /var/haproxy/dev/log
which doesn't match the apparmor rule so gets denied.

To address this we move all the host side haproxy config from
/var/haproxy to /var/lib/haproxy. This allows rsyslog to create the
socket. To avoid needing to update docker images (for haproxy statsd)
and to continue to make the haproxy container itself happy we don't
adjust paths on the target side of our bind mounts. This means some
things still refer to /var/haproxy but they should all be within
containers.

I don't believe this will be impactful to existing load balancer
servers. We should deploy new content to /var/lib/haproxy then
automatically restart services (rsyslog and haproxy container) because
their configs are updating. One potential problem with this is rsyslog
will restart before the containers do and its log path will have moved.
If we are concerned about this we can configure rsyslog to continue to
attempt to create the old path in addition to the new path (this will
fail on Noble).

Change-Id: I4582e6b2dda188583f76265ab78bcb00a302e375
2025-02-12 09:10:08 -08:00
Clark Boylan
681088951b Perform haproxy HUP signals with kill
Podman on Ubuntu Noble has apparmor config that prevents SIGHUP from
being delivered via `podman kill -s HUP` or `docker compose kill -s
HUP`. Attempting to do so results in:

  kernel: audit: type=1400 audit(1739232042.996:129): apparmor="DENIED" operation="signal" class="signal" profile="containers-default-0.57.4-apparmor1" pid=17067 comm="runc" requested_mask="receive" denied_mask="receive" signal=hup peer="podman"

This appears to be due to issues with the apparmor configuration that
was edited to make other signals work:

  https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483

We work around that by using kill to issue the signal instead which
seems to work based on some manual testing.

Change-Id: I49435fdda662e25c7192faf24e0ae4b527e943b9
2025-02-11 08:04:55 -08:00
Zuul
fe75c3b194 Merge "Switch codesearch to journald logging" 2025-02-10 23:26:14 +00:00
Clark Boylan
61fd8dd59d Deploy zuul-lb02
This is a new Noble server to replace the existing zuul-lb01 server. As
part of this transition we switch to podman as the runtime container
runtime and docker compose replaces docker-compose. This requires a
small update to testing to check the new container name.

The depends on isn't strictly necessary but seems like good hygiene to
deploy a server with DNS records in place.

Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/941146
Change-Id: I2bb74809b00d4a554a26601c46a2aa4c3c75d4f1
2025-02-10 10:24:15 -08:00
Clark Boylan
db409d6f75 Switch codesearch to journald logging
Currently codesearch uses syslog logging with docker but podman
doesn't support syslog. Podman does support journald which is basically
equivalent for us since we have journald log to syslog too. Update for
podman compatibility in preparation for upgrades to Noble.

Change-Id: Id7da6b70faad9521da6a39eaa9543b97c0136d58
2025-02-10 08:29:29 -08:00
Zuul
81bdbb3ded Merge "Switch standalone mariadb to opendevmirror hosted mariadb image" 2025-02-07 17:25:04 +00:00
Zuul
ac8c76c831 Merge "Update to gitea 1.23.3" 2025-02-07 17:25:02 +00:00
Zuul
a176217c26 Merge "Switch Gerrit to opendevmirror hosted mariadb image" 2025-02-06 17:20:45 +00:00
Clark Boylan
97cc7520da Update to gitea 1.23.3
We are upgrading from 1.23.1 to 1.23.3. Both 1.23.2 and 1.23.3 are
bugfix releases but 1.23.2 includes a breaking change to webhooks. We
don't use webhooks so this shouldn't affect us. Complete changelog can
be found here:

  https://github.com/go-gitea/gitea/blob/v1.23.3/CHANGELOG.md

There is also a minor update to one of the templates we override which I
have synced over.

Change-Id: I97ba30309da63ecb4fb4fc301209c60ea8dc8504
2025-02-06 08:02:54 -08:00
Zuul
4764268d29 Merge "Remove grafana01 from config management" 2025-02-05 21:38:16 +00:00
Zuul
fa22a0b1ce Merge "Add AI openstack working group list to lists.openinfra.org" 2025-02-05 20:37:47 +00:00
Clark Boylan
dc47b469b7 Add AI openstack working group list to lists.openinfra.org
This has been requested by Jimmy at the foundation.

Change-Id: I7bcbca594f42287b6219704e1797a2e2c5d2b1d5
2025-02-05 10:13:09 -08:00
Clark Boylan
4ed1d2c63a Fix keycloak container restart handler
This handler used an incorrect path to the docker-compose file and
failed with no such file or directory errors. Update the handler to use
the correct path to the docker-compose file.

I also add a note that the check to avoid restarts when we just
restarted containers may not be working as we did restart at least the
mariadb container which is how I discovered this issue.

Change-Id: If004b72e3efc0d0d4665c6fd56e514a5cb6191c5
2025-02-05 09:29:24 -08:00
Zuul
6648760059 Merge "Switch keycloak to opendevmirror hosted mariadb image" 2025-02-05 17:20:30 +00:00
Clark Boylan
fb0674d402 Fix zuul-admins doc ref
This sphinx internal ref was missing ``'s surrounding the token
identifier. Add them which should fix the reference.

Change-Id: I6261ab3a96cecbf63d0934441650d9d91baac798
2025-02-05 07:56:43 -08:00
Clark Boylan
eade0b35b6 Remove grafana01 from config management
Now that we have grafana02 up and running we need to remove grafana01
from management so that it can be deleted. A followup change will clean
up DNS for us.

Change-Id: Ib90dadf404eb24aed5673d2611584bd00a278d45
2025-02-04 14:44:28 -08:00
Clark Boylan
d92410106e Update to the even more recent grafana 10.4.15
We just deployed grafana02 with 10.4.14 which was latest when I started
poking at this. Since then 10.4.15 has been released. Update to this
latest release.

Changelog can be seen here:

  58a279e109/CHANGELOG.md (10415-2025-01-28)

Change-Id: I7a8fd7bc273e628475df8bfc492e8a0fdf480457
2025-02-03 16:22:34 -08:00
Zuul
2a04501e45 Merge "Deploy grafana02" 2025-02-03 23:02:52 +00:00
Zuul
db987a3d09 Merge "Update grafana to 10.4.14" 2025-02-03 22:19:55 +00:00
Clark Boylan
193151adc9 Test launch installation on launch edits
We have a testinfra test case for checking the launch tooling is
installed properly. Unfortunately, we weren't running that test case
when we make updates to the launch tooling. Fix that.

Change-Id: Ie497d60aaf1842a7478a8550d45608daeec4625a
2025-02-03 13:34:53 -08:00
Clark Boylan
b0e396dc21 Deploy grafana02
This adds a grafana02 server to our inventory with associated LE host
vars. This should deploy grafana on our newly created noble grafana02
server.

Note we switch the system-config-run-grafana job over to interact with
02 to match production. To simplify this effort in the future we convert
the old grafana01 testing host var to a group var file. This change was
already done on bridge.

We will need to followup with at least one change to clean out grafana01
when we are happy with the new server.

Depends-On: https://review.opendev.org/c/opendev/zone-opendev.org/+/940653
Change-Id: Ifd7f83185fbd59935a63973642e9d165bd8105a2
2025-02-03 11:47:02 -08:00
Clark Boylan
42445986e7 Fix launch node string quoting
This is what I get for not testing it before pushing. I've made this
minor edit in place in the venv contents on bridge and launching
grafana02 appears to have worked. This should be the only fixup needed.

Change-Id: Ief32094fb0b216dac99879a285a7dbd0fd005b49
2025-02-03 11:08:18 -08:00
Clark Boylan
680a120bf1 Update grafana to 10.4.14
This is the latest release of the 10.x series and we're currently stuck
on 10.2.2. We can update to 11.x after we're up to date on 10.x.

We bundle this change up with an update to run on Noble. The plan is
we'll put the old 10.2 focal server in the emergency file, land this
change, then add a new Noble server to inventory. This should allow us
to easily rollback to 10.2 if Noble with Grafana 10.4.14 don't work for
some reason. Basically killing two birds with one stone here and getting
a safer upgrade process out of it.

Depends-On: https://review.opendev.org/c/openstack/project-config/+/940276
Change-Id: Icc5e02d4b80cb1f8524ab3dde888aba7db430ffe
2025-02-03 10:10:14 -08:00
Clark Boylan
59ca749779 Add cpu count check to launch node
We've seen Noble nodes booting in rax legacy come up with a single vcpu
even when we've requested 8. Avoid unexpected reduction in CPU counts
when booting new noble nodes by explicitly checking for at least 2 CPUs.
We don't want to discover a month after replacing a server that we need
to replace it again because it booted on the wrong hypervisor.

Change-Id: I043dc8d6eb1131d0fec49734c7959e6c123f8f8f
2025-02-03 09:55:44 -08:00
Zuul
dbb13be886 Merge "Switch our haproxy image to quay opendevmirror location" 2025-02-03 17:32:55 +00:00
Zuul
6ce41794a9 Merge "Switch refstack to opendevmirror hosted mariadb image" 2025-01-31 17:19:37 +00:00
Clark Boylan
fa115e59dd Switch our haproxy image to quay opendevmirror location
This will pull the haproxy:lts image from the mirror we have at
quay.io/opendevmirror/haproxy rather than docker hub directly. This
should improve reliability in CI in particular when pulling that image.
One fewer image to pull from docker hub also means more rate limit to
spend where we are still pulling from docker hub.

Note this will affect the gitea and zuul web front ends as they are both
fronted by haproxy. Expect a minor blip while the container "updates"
(hashes should match) and is restarted.

Change-Id: Ic242ea3975ada1c7a698be8e41b9c5c8f8d07ed3
2025-01-31 08:25:26 -08:00
Zuul
9de8bf3489 Merge "Mirror haproxy container image to opendevmirror on quay.io" 2025-01-30 19:29:48 +00:00
Zuul
d0725f9927 Merge "Mirror node 23 container image" 2025-01-30 00:19:58 +00:00
James E. Blair
595400dfe6 Mirror node 23 container image
This is the current version of node; mirror it so that Zuul can
consider upgrading.

Change-Id: I666f91cfac755839cbfa3cc1034dea99d9e964e0
2025-01-29 16:10:48 -08:00
Zuul
39c86d6c36 Merge "Switch mailman3 to opendevmirror hosted mariadb image" 2025-01-29 22:59:15 +00:00
Zuul
94d061788d Merge "Increase message_linelength_limit to 1G" 2025-01-29 22:01:27 +00:00
Clark Boylan
66aea07548 Mirror haproxy container image to opendevmirror on quay.io
This adds a new daily job to mirror haproxy to our quay.io hosted
opendevmirror set of images. We'll be able to use this to update the
location we pull haproxy from for zuul and gitea once the image is
mirrored.

Change-Id: Iba17aacdfbfede00ac09aea7c57325a09c7da9f2
2025-01-29 11:03:05 -08:00
Zuul
e992e5a747 Merge "Upgrade etherpad to v2.2.7" 2025-01-29 18:20:29 +00:00
Zuul
883faa1325 Merge "Deploy mariadb for etherpad from opendev's quay mirror" 2025-01-29 18:18:54 +00:00
Clark Boylan
d2bc4fdb9c Switch Gerrit to opendevmirror hosted mariadb image
One fewer image to pull from docker hub eating into our rate limits.
Note that Gerrit its db container are not automatically updated by
ansible. This change will need manual intervention to get reflected in
production.

Change-Id: Ibbfbf2ecfb7f972720bfc0f7b97831231d217633
2025-01-28 15:48:27 -08:00
Clark Boylan
b0fff9ccbd Switch standalone mariadb to opendevmirror hosted mariadb image
One fewer image to pull from docker hub eating into our rate limits.
Note that this will restart the standalone mariadb container when
deployed. This may impact Zuul's ability to record jobs temporarily.

Change-Id: I4f46c63f3002740c2246f11d1ad69bd43e61036c
2025-01-28 15:46:51 -08:00
Clark Boylan
15ee7d46e6 Switch mailman3 to opendevmirror hosted mariadb image
One fewer image to pull from docker hub eating into our rate limits.
Note that this update will restart at least the mariadb container on the
mailman list server.

Change-Id: I8f90956d945baa1826783ed8a6de6b1ce24a84d2
2025-01-28 15:45:11 -08:00
Clark Boylan
35e7b10f23 Switch keycloak to opendevmirror hosted mariadb image
One fewer image to pull from docker hub that eats into our rate limits.
Note that deployment of this change will restart at least the mariadb
container on the server.

Change-Id: I21e7f707f0876aeb348af14efe57fe327ab594a9
2025-01-28 15:41:31 -08:00
Clark Boylan
545b91ec52 Switch refstack to opendevmirror hosted mariadb image
One fewer image to pull from docker hub that eats into our rate limits.
Note this will restart at least the mariadb service on the refstack
server when it deploys.

Change-Id: I15eb36bc570fe22e2e2b85b3bf321bb254636410
2025-01-28 15:39:43 -08:00
Clark Boylan
20d61cd8b9 Upgrade etherpad to v2.2.7
This appears to be a very minor update from 2.2.6 (as far as I can tell
dockerfile and settings haven't changed). The changelog indicates that
important changes were rewritten to use react 19 and react router v7.
Other than that only dependency updates were made.

  https://github.com/ether/etherpad-lite/blob/v2.2.7/CHANGELOG.md

Change-Id: I48e8914ffa7026e35b6341628a709301c6a61c26
2025-01-28 12:59:37 -08:00
Clark Boylan
1d951ccf1d Deploy mariadb for etherpad from opendev's quay mirror
This is all in an effort to reduce our total dependency on docker hub as
rate limits there are quite low. Every image we can pull from somewhere
else is more rate limit bandwidht we can use for images still on docker
hub.

Change-Id: I3566383acf43e556fcd5854f6dfb70af8ffa1ba2
2025-01-28 12:56:36 -08:00
Clark Boylan
088df00edc Update graphite to send CORS headers even on 400 responses
Newer grafana sends an options request that graphite responds to with a
400 response. This response did not include allowed origin headers
because it is a failure case. Update this header and the allowed methods
header to always be included even on 400 or other error responses.

This should ideally address the CORS errors we see with updated grafana.
An alternative is to update grafana to proxy the requests for us, but
this is less flexible as other tools may not have built in proxies.

The suggestion comes from this stackoverflow question and answer:

  https://stackoverflow.com/questions/20414669/nginx-add-headers-when-returning-400-codes

Change-Id: Icf1179d35e420384da72af839ca329548226ee63
2025-01-28 12:47:22 -08:00
Zuul
c0d930ca25 Merge "Retire paste01 backups on the smaller backup server" 2025-01-27 21:59:01 +00:00
Zuul
86403ab6d2 Merge "Log grafana to /var/log/containers with journald/syslog" 2025-01-27 17:45:04 +00:00
Zuul
95b3a0aa97 Merge "Take screenshots of all grafana dashboards" 2025-01-27 17:35:16 +00:00
Jeremy Stanley
1b38a18473 Increase message_linelength_limit to 1G
Exim 4.95 on Ubuntu Jammy started enforcing an outbound line length
limit of 998 bytes, easily exceeded by some badly-behaved MUAs.
Unfortunately, because Exim only checks this in its remote_smtp
transport, it results in mass bounces back for Mailman mailing
lists, incrementing all subscribers bounce scores on lists where
bounce processing is enabled. The telltale indicator is that the
messages are returned to Mailman citing a delivery error of "message
has lines too long for transport".

Ubuntu added a workaround in later versions of their packages, but
did not backport that to Jammy. Regardless, it's overridden by a
config option and we replace the default Exim config entirely, so
need to incorporate it into ours directly anyway. Because this
message_linelength_limit option to the remote_smtp transport is only
supported by exim versions on Jammy and newer, exclude it for our
older platforms so that it won't result in a configuration loading
error.

This copies the override value used in Ubuntu Noble's
exim4.conf.template file.

Change-Id: I38e169dc14e7fc3c5c1d43b5f147e6b35b718bb2
2025-01-27 17:15:53 +00:00
James E. Blair
24d300347a Mount /etc/openstack in zuul-web
The zuul-web component now needs to read the cloud config in order
to fully parse the cloud provider information.

Change-Id: I4b1356bb118afa317e49898b5cf40191e5f0955d
2025-01-26 06:46:22 -08:00