bootc deploy interface - for bootable containers

Adds a ``bootc`` deployment interface which can be enabled to
perform deployment of bootable containers. This enables a streamlined
workflow where an operator/user can push container updates and does not
need to build intermediate disk images and then post those disk images
to facilitate the deployment of a bare metal node.

Closes-Bug: 2085801
Change-Id: Iedb93fe47162abe0bd9391921792203301bfc456
This commit is contained in:
Julia Kreger 2024-12-17 09:10:04 -08:00
parent db4412d570
commit c7fa447ab6
6 changed files with 352 additions and 32 deletions
doc/source/admin/interfaces
ironic
drivers
tests/unit/drivers/modules
releasenotes/notes
setup.cfg

@ -190,3 +190,116 @@ completely orchestrate writing the instance image using
responsible to provide all necessary deploy steps with priorities between
61 and 99 (see :ref:`node-deployment-core-steps` for information on
priorities).
Bootc Agent Deploy
==================
The ``bootc`` deploy interface is designed to enable operators to deploy
containers directly from a container image registry without intermediate
conversion steps, such as creating custom disk images for modifications.
This deployment interface utilizes the
`bootc project <https://containers.github.io/bootc/>`_.
Ultimately this enables a streamlined flow, where a user of the deployment
interface *can* create updated containers rapidly and the deployment interface
will deploy that container image in a streamlined fashion without the need
to create intermediate disk images and post the disk images in a location
where they can be accessed for deployment.
Ultimately this interface enables a streamlined flow, and offers
limited flexibility in the model of deployment. As a result, this
interface consumes the entire target disk on the host being deployed
and offers no customization in terms of partitioning. This is largely
because the overall security model of a bootc deployment, which leverages
os-tree, is also fundamentally different than the model to leverage
partition separation.
.. NOTE::
This interface should be considered experimental and may evolve
to include additional features as the Ironic project maintainers
receive additional feedback.
.. NOTE::
This interface is dependent upon the existence of ``bootc`` within a
container image along with sufficient memory on the baremetal
node being deployed to enable a complete download and extraction of image
contents within system memory. It is this memory constraint which is
why this interface is not actively tested in upstream CI.
The possible failure modes of this interface are mainly focused upon
the ability of the ramdisk being able to download, launch, and
run bootc to trigger the installation which also isolates most risk
to the actual bootc process execution.
Features
--------
While this ``deploy_interface`` supports deploying configuration drives
like most other Ironic supplied deploy interfaces, some additional
parameters can be supplied via ``instance_info`` to enable
tuning of deploy-time behavior by the user which cannot be modified
post-deployment.
* ``bootc_authorized_keys`` - This option allows injection of a
root user authorized keys file which is preserved inside of the deployed
container on the host. This option is for actual key file content and can
be one or more keys with a new line character.
* ``bootc_tpm2_luks`` - A boolean option, default False, enabling bootc
to attempt to utilize auto-encryption of the deployed host filesystem
upon which the container is deployed. This is not enabled by default
due to a lack of software TPMs in Ironic CI. If operators would like
this setting default changed, please discuss with Ironic developers.
Additionally, this interface also supports the passing of a pull secret
to enable download from the remote image registry, which is part of the
support for retrieval of artifacts from OCI Container registires.
This parameter is ``image_pull_secret``.
Caveats
-------
* This deployment interface was not designed to be compatible with the
OpenStack Compute service. This is because OpenStack focuses on
disk images from Glance as to what to deploy, where as this interface
is modeled to utilize a container image registry.
* Performance wise, this deployment interface performs many smaller actions,
which at some times need to performed in a specific sequence, such as
when unpacking layers. As a result, when comparing similar size
containers to disk images, this interface is slower than the ``direct``
deploy interface.
* Container Images *must* have the bootc command present along with
the applicable bootloader and artifacts required for whatever platform
is being deployed.
* Because of how `bootc <https://containers.github.io/bootc/>`_ works,
there is no concept of "image streaming" directly to disk. This is because
the way this interface works, `podman <https://podman.io/>`_ is used to
download all container image layer artifacts, along with extracting the
layers. At which point ``bootc`` is executed and it begins to setup the
disk for the host. As a result, most of the time a deploy is in progress
will be observable as ``deploy wait`` while ``bootc`` executes.
* The memory requirements of the ramdisk, due to the way this interface
works, requires the ability to download a container image, copy, and
ultimatley extract all layers into the in-memory filesystem. Due to the way
the kernel launches and allocates ramdisk memory for filesystem usage,
a 600MB container image may require upwards of 10GB of RAM to be available
on the overall host.
* This deployment interface explicitly signals to ``bootc`` that it should
not execute it's internal post-deployment "fetch check" to ensure upgrades
are working. This is because this action may require authentication
to succeed, **and** thus require credentials in the container to
work. Configuration of credentials for **day-2** operations
such as the execution of ``bootc upgrade``, must be addressed
post-deployment.
* If you intend SELinux to be enabled on the deployed host, it must also
be enabled inside of the ironic-python-agent ramdisk. This is a design
limitation of bootc outside of Ironic's control.
Limitations
-----------
* At present, this interface does not support use of caching proxies. This
may be addressed in the future.
* This deployment interface directly downloads artifacts from the requested
Container Registry. Caching the container artifacts on the
``ironic-conductor`` host is not available. If you need the contaitainer
content localized to the conductor, consider utilizing your own container
registry.

@ -51,7 +51,7 @@ class GenericHardware(hardware_type.AbstractHardwareType):
"""List of supported deploy interfaces."""
return [agent.AgentDeploy, ansible_deploy.AnsibleDeploy,
ramdisk.RamdiskDeploy, pxe.PXEAnacondaDeploy,
agent.CustomAgentDeploy]
agent.BootcAgentDeploy, agent.CustomAgentDeploy]
@property
def supported_inspect_interfaces(self):

@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import base64
from urllib import parse as urlparse
from oslo_log import log
@ -21,12 +22,14 @@ import tenacity
from ironic.common import async_steps
from ironic.common import boot_devices
from ironic.common import boot_modes
from ironic.common import exception
from ironic.common.glance_service import service_utils
from ironic.common.i18n import _
from ironic.common import image_service
from ironic.common import images
from ironic.common import metrics_utils
from ironic.common import oci_registry as oci
from ironic.common import raid
from ironic.common import states
from ironic.common import utils
@ -248,6 +251,51 @@ def soft_power_off(task, client=None):
manager_utils.node_power_action(task, states.POWER_OFF)
def set_boot_to_disk(task, target_boot_mode=None):
"""Boot a node to disk.
This is a helper method to reduce duplication of code around
handling vendor specifics for setting boot modes between multiple
deployment interfaces inside of Ironic.
:param task: A Taskmanager object.
:param target_boot_mode: The target boot_mode, defaults to UEFI.
"""
if not target_boot_mode:
target_boot_mode = boot_modes.UEFI
node = task.node
try:
persistent = True
# NOTE(TheJulia): We *really* only should be doing this in bios
# boot mode. In UEFI this might just get disregarded, or cause
# issues/failures.
if node.driver_info.get('force_persistent_boot_device',
'Default') == 'Never':
persistent = False
vendor = task.node.properties.get('vendor', None)
if not (vendor and vendor.lower() == 'lenovo'
and target_boot_mode == 'uefi'):
# Lenovo hardware is modeled on a "just update"
# UEFI nvram model of use, and if multiple actions
# get requested, you can end up in cases where NVRAM
# changes are deleted as the host "restores" to the
# backup. For more information see
# https://bugs.launchpad.net/ironic/+bug/2053064
# NOTE(TheJulia): We likely just need to do this with
# all hosts in uefi mode, but libvirt VMs don't handle
# nvram only changes *and* this pattern is known to generally
# work for Ironic operators.
deploy_utils.try_set_boot_device(task, boot_devices.DISK,
persistent=persistent)
except Exception as e:
msg = (_("Failed to change the boot device to %(boot_dev)s "
"when deploying node %(node)s: %(error)s") %
{'boot_dev': boot_devices.DISK, 'node': node.uuid,
'error': e})
agent_base.log_and_raise_deployment_error(task, msg, exc=e)
class CustomAgentDeploy(agent_base.AgentBaseMixin,
agent_base.HeartbeatMixin,
agent_base.AgentOobStepsMixin,
@ -910,40 +958,94 @@ class AgentDeploy(CustomAgentDeploy):
'error': agent_client.get_command_error(result)})
agent_base.log_and_raise_deployment_error(task, msg)
try:
persistent = True
# NOTE(TheJulia): We *really* only should be doing this in bios
# boot mode. In UEFI this might just get disregarded, or cause
# issues/failures.
if node.driver_info.get('force_persistent_boot_device',
'Default') == 'Never':
persistent = False
vendor = task.node.properties.get('vendor', None)
if not (vendor and vendor.lower() == 'lenovo'
and target_boot_mode == 'uefi'):
# Lenovo hardware is modeled on a "just update"
# UEFI nvram model of use, and if multiple actions
# get requested, you can end up in cases where NVRAM
# changes are deleted as the host "restores" to the
# backup. For more information see
# https://bugs.launchpad.net/ironic/+bug/2053064
# NOTE(TheJulia): We likely just need to do this with
# all hosts in uefi mode, but libvirt VMs don't handle
# nvram only changes *and* this pattern is known to generally
# work for Ironic operators.
deploy_utils.try_set_boot_device(task, boot_devices.DISK,
persistent=persistent)
except Exception as e:
msg = (_("Failed to change the boot device to %(boot_dev)s "
"when deploying node %(node)s: %(error)s") %
{'boot_dev': boot_devices.DISK, 'node': node.uuid,
'error': e})
agent_base.log_and_raise_deployment_error(task, msg, exc=e)
set_boot_to_disk(task, target_boot_mode)
LOG.info('Local boot successfully configured for node %s', node.uuid)
class BootcAgentDeploy(CustomAgentDeploy):
"""Interface for deploy-related actions."""
@METRICS.timer('AgentBootcDeploy.validate')
def validate(self, task):
"""Validate the driver-specific Node deployment info.
This method validates whether the properties of the supplied node
contain the required information for this driver to deploy images to
the node.
:param task: a TaskManager instance
:raises: MissingParameterValue, if any of the required parameters are
missing.
:raises: InvalidParameterValue, if any of the parameters have invalid
value.
"""
super().validate(task)
node = task.node
image_source = node.instance_info.get('image_source')
if not image_source or not image_source.startswith('oci://'):
raise exception.InvalidImageRef(image_href=image_source)
@METRICS.timer('AgentBootcDeploy.execute_bootc_install')
@base.deploy_step(priority=80)
@task_manager.require_exclusive_lock
def execute_bootc_install(self, task):
node = task.node
image_source = node.instance_info.get('image_source')
# FIXME(TheJulia): We likely, either need to grab/collect creds
# and pass them along in the step call, or initialize the client.
# bootc runs in the target container as well, so ... hmmm
configdrive = manager_utils.get_configdrive_image(node)
img_auth = image_service.get_image_service_auth_override(task.node)
if not img_auth:
fqdn = urlparse.urlparse(image_source).netloc
img_auth = oci.RegistrySessionHelper.get_token_from_config(
fqdn)
else:
# Internally, image data is a username and password, and we
# only currently support pull secrets which are just transmitted
# via the password value.
img_auth = img_auth.get('password')
if img_auth:
# This is not encryption, but obfustication.
img_auth = base64.standard_b64encode(img_auth.encode())
# Now switch into the corresponding in-band deploy step and let the
# result be polled normally.
new_step = {'interface': 'deploy',
'step': 'execute_bootc_install',
'args': {'image_source': image_source,
'configdrive': configdrive,
'oci_pull_secret': img_auth}}
client = agent_client.get_client(task)
return agent_base.execute_step(task, new_step, 'deploy',
client=client)
@METRICS.timer('AgentBootcDeploy.set_boot_to_disk')
@base.deploy_step(priority=60)
@task_manager.require_exclusive_lock
def set_boot_to_disk(self, task):
"""Sets the node to boot from disk.
In some cases, other steps may handle aspects like bootloaders
and UEFI NVRAM entries required to boot. That leaves one last
aspect, resetting the node to boot from disk.
This primarily exists for compatibility reasons of flow
for Ironic, but we know some BMCs *really* need to be
still told to boot from disk. The exception to this is
Lenovo hardware, where we skip the action because it
can create a UEFI NVRAM update failure case, which
reverts the NVRAM state to "last known good configuration".
:param task: A Taskmanager object.
"""
# Call the helper to de-duplicate code.
set_boot_to_disk(task)
class AgentRAID(base.RAIDInterface):
"""Implementation of RAIDInterface which uses agent ramdisk."""

@ -548,6 +548,100 @@ class TestCustomAgentDeploy(CommonTestsMixin, db_base.DbTestCase):
node_power_action_mock.assert_not_called()
class TestBootcAgentDeploy(db_base.DbTestCase):
def setUp(self):
super().setUp()
self.deploy = agent.BootcAgentDeploy()
self.node = object_utils.create_test_node(
self.context,
instance_info={
'image_source': 'oci://localhost/user/container:tag',
'image_pull_secret': 'f00'})
def test_validate(self):
with task_manager.acquire(self.context, self.node['uuid'],
shared=False) as task:
self.deploy.validate(task)
def test_validate_fails_with_non_oci(self):
i_info = self.node.instance_info
i_info['image_source'] = 'http://foo/bar'
self.node.instance_info = i_info
self.node.save()
with task_manager.acquire(self.context, self.node['uuid'],
shared=False) as task:
self.assertRaises(exception.InvalidImageRef,
self.deploy.validate, task)
def test_validate_fails_image_source_not_set(self):
i_info = self.node.instance_info
i_info.pop('image_source')
self.node.instance_info = i_info
self.node.save()
with task_manager.acquire(self.context, self.node['uuid'],
shared=False) as task:
self.assertRaises(exception.InvalidImageRef,
self.deploy.validate, task)
@mock.patch.object(agent_base, 'execute_step', autospec=True)
def test_execute_bootc_install(self, execute_mock):
src = self.node.instance_info.get('image_source')
expected_step = {
'interface': 'deploy',
'step': 'execute_bootc_install',
'args': {'image_source': src,
'configdrive': None,
'oci_pull_secret': b'ZjAw'}
}
with task_manager.acquire(self.context, self.node.uuid) as task:
execute_mock.return_value = states.DEPLOYWAIT
res = self.deploy.execute_bootc_install(task)
self.assertEqual(states.DEPLOYWAIT, res)
execute_mock.assert_called_once_with(task, expected_step,
'deploy', client=mock.ANY)
@mock.patch.object(agent_client.AgentClient, 'install_bootloader',
autospec=True)
@mock.patch.object(deploy_utils, 'try_set_boot_device', autospec=True)
@mock.patch.object(boot_mode_utils, 'get_boot_mode', autospec=True,
return_value='whatever')
def test_set_boot_to_disk(self, boot_mode_mock,
try_set_boot_device_mock,
install_bootloader_mock):
with task_manager.acquire(self.context, self.node['uuid'],
shared=False) as task:
self.deploy.set_boot_to_disk(task)
try_set_boot_device_mock.assert_called_once_with(
task, boot_devices.DISK, persistent=True)
boot_mode_mock.assert_not_called()
# While not referenced, just want to make sure somehow
# we don't again wire this together, since it is not needed
# in the bootc case as it does it for us as part of deploy.
install_bootloader_mock.assert_not_called()
@mock.patch.object(agent_client.AgentClient, 'install_bootloader',
autospec=True)
@mock.patch.object(deploy_utils, 'try_set_boot_device', autospec=True)
@mock.patch.object(boot_mode_utils, 'get_boot_mode', autospec=True,
return_value='uefi')
def test_set_boot_to_disk_lenovo(self, boot_mode_mock,
try_set_boot_device_mock,
install_bootloader_mock):
props = self.node.properties
props['vendor'] = 'Lenovo'
props['capabilities'] = 'boot_mode:uefi'
self.node.properties = props
self.node.save()
with task_manager.acquire(self.context, self.node['uuid'],
shared=False) as task:
self.deploy.set_boot_to_disk(task)
try_set_boot_device_mock.assert_not_called()
boot_mode_mock.assert_not_called()
install_bootloader_mock.assert_not_called()
class TestAgentDeploy(CommonTestsMixin, db_base.DbTestCase):
def setUp(self):
super(TestAgentDeploy, self).setUp()

@ -0,0 +1,10 @@
---
features:
- |
Adds a ``bootc`` deploy interface which can be enabled by an Ironic
deployment administrator, which can then enable users of the ``bootc``
deploy interface to have a streamlined path for the deployment of
bootc supporting container images to a host directly,
without additional intermediate steps. More information about
bootc can be found on the
`bootc website <https://containers.github.io/bootc/>`_.

@ -94,6 +94,7 @@ ironic.hardware.interfaces.console =
ironic.hardware.interfaces.deploy =
anaconda = ironic.drivers.modules.pxe:PXEAnacondaDeploy
ansible = ironic.drivers.modules.ansible.deploy:AnsibleDeploy
bootc = ironic.drivers.modules.agent:BootcAgentDeploy
custom-agent = ironic.drivers.modules.agent:CustomAgentDeploy
direct = ironic.drivers.modules.agent:AgentDeploy
fake = ironic.drivers.modules.fake:FakeDeploy