
Updated Patchset 2 comments Updated Patchset 1 comments Change-Id: I63483f3da8e443709656115b218657b57626dc99 Signed-off-by: Juanita Balaraj <juanita.balaraj@windriver.com>
437 lines
18 KiB
ReStructuredText
437 lines
18 KiB
ReStructuredText
.. Greg updates required for -High Security Vulnerability Document Updates
|
|
|
|
.. uzk1552923967458
|
|
.. _restoring-starlingx-system-data-and-storage:
|
|
|
|
========================================
|
|
Restore Platform System Data and Storage
|
|
========================================
|
|
|
|
You can perform a system restore (controllers, workers, including or excluding
|
|
storage nodes) of a |prod| cluster from a previous system backup and bring it
|
|
back to the operational state it was when the backup procedure took place.
|
|
|
|
There are two restore modes- optimized restore and legacy restore. Optimized restore
|
|
must be used on |AIO-SX| and legacy restore must be used on systems that are not |AIO-SX|.
|
|
|
|
.. rubric:: |context|
|
|
|
|
Kubernetes configuration will be restored and pods that are started from
|
|
repositories accessible from the internet or from external repositories will
|
|
start immediately. |prod| specific applications must be re-applied once a
|
|
storage cluster is configured.
|
|
|
|
Everything is restored as it was when the backup was created, except for
|
|
optional data if not defined.-
|
|
|
|
See :ref:`Back Up System Data <backing-up-starlingx-system-data>` for more
|
|
details on the backup.
|
|
|
|
.. warning::
|
|
|
|
The system backup file can only be used to restore the system from which
|
|
the backup was made. You cannot use this backup file to restore the system
|
|
to different hardware.
|
|
|
|
To restore the backup, use the same version of the boot image (ISO) and
|
|
patches that were installed at the time of the backup.
|
|
|
|
The |prod| restore supports the following optional modes:
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ol-tw4-kvc-4jb:
|
|
|
|
- To keep the Ceph cluster data intact (false - default option), use the
|
|
following parameter, when passing the extra arguments to the Ansible Restore
|
|
playbook command:
|
|
|
|
.. code-block:: none
|
|
|
|
wipe_ceph_osds=false
|
|
|
|
- To wipe the Ceph cluster entirely (true), where the Ceph cluster will need
|
|
to be recreated, or if the Ceph partition was previously wiped, such as
|
|
during a fresh install between backup and restore or during reinstall, use
|
|
the following parameter:
|
|
|
|
.. code-block:: none
|
|
|
|
wipe_ceph_osds=true
|
|
|
|
Restoring a |prod| cluster from a backup file is done by re-installing the
|
|
ISO on controller-0, applying updates (patches), running the Ansible Restore
|
|
Playbook, unlocking controller-0, and then powering on, and unlocking the
|
|
remaining hosts, one host at a time, starting with the controllers, and then
|
|
the storage hosts, ONLY if required, and lastly the compute (worker) hosts.
|
|
Lastly, running :command:`system restore-complete` command.
|
|
|
|
.. rubric:: |prereq|
|
|
|
|
Before you start the restore procedure you must ensure the following
|
|
conditions are in place:
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ul-rfq-qfg-mp:
|
|
|
|
- All cluster hosts must be prepared for network boot and then powered
|
|
down. You can prepare a host for network boot.
|
|
|
|
.. note::
|
|
If you are restoring system data only, do not lock, power off or
|
|
prepare the storage hosts to be reinstalled.
|
|
|
|
- The backup file is accessible locally, if restore is done by running
|
|
Ansible Restore playbook locally on the controller. The backup file is
|
|
accessible remotely, if restore is done by running Ansible Restore playbook
|
|
remotely.
|
|
|
|
- You have the original |prod| ISO installation image available on a USB
|
|
flash drive. It is mandatory that you use the exact same version of the
|
|
software used during the original installation, otherwise the restore
|
|
procedure will fail.
|
|
|
|
- The restore procedure requires all hosts but controller-0 to boot
|
|
over the internal management network using the |PXE| protocol. Ideally, the
|
|
old boot images are no longer present, so that the hosts boot from the
|
|
network when powered on. If this is not the case, you must configure each
|
|
host manually for network boot immediately after powering it on.
|
|
|
|
- If you are restoring a |prod-dc| subcloud first, ensure it is in
|
|
an **unmanaged** state on the Central Cloud (SystemController) by using
|
|
the following commands:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
~(keystone_admin)]$ dcmanager subcloud unmanage <subcloud-name>
|
|
|
|
where ``<subcloud-name>`` is the name of the subcloud to be unmanaged.
|
|
|
|
For more information, see:
|
|
|
|
- `Backup a Subcloud/Group of Subclouds using DCManager CLI <backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42>`
|
|
|
|
- `Restore a Subcloud/Group of Subclouds from Backup Data Using DCManager CLI <restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e>`
|
|
|
|
.. rubric:: |proc|
|
|
|
|
#. Power down all hosts.
|
|
|
|
If you have a storage host and want to retain Ceph data, then power down
|
|
all the nodes except the storage hosts; the cluster has to be functional
|
|
during a restore operation.
|
|
|
|
.. caution::
|
|
Do not use :command:`wipedisk` before a restore operation. This will
|
|
lead to data loss on your Ceph cluster. It is safe to use
|
|
:command:`wipedisk` during an initial installation, while reinstalling
|
|
a host, or during an upgrade.
|
|
|
|
#. Install the |prod| ISO software on controller-0 from the USB flash
|
|
drive.
|
|
|
|
You can now log in using the host's console.
|
|
|
|
#. Log in to the console as user **sysadmin** with password **sysadmin**.
|
|
|
|
#. Install network connectivity required for the subcloud.
|
|
|
|
#. Ensure that the system is at the same patch level as it was when the backup
|
|
was taken. **You must manually reinstall any previous patches and reboot the
|
|
system** (for reboot-required patches) to prevent restore failures due to
|
|
mismatched patch levels.
|
|
|
|
.. note::
|
|
|
|
This is mandatory for |AIO-SX| (optimized) deployments. For legacy
|
|
patches it is only mandatory if either the exclude_patches
|
|
(backup) or skip_patches_restore (restore) flags are used.
|
|
|
|
It is recommended to restore subclouds only when there is an existing
|
|
backup taken at the same patch level as the system controller.
|
|
|
|
For steps on how to install patches using the :command:`sw-patch install-local`
|
|
command, see :ref:`aio_simplex_install_kubernetes_r7`; ``Install Software on Controller-0``.
|
|
|
|
After the reboot, you can verify that the updates were applied.
|
|
|
|
.. only:: partner
|
|
|
|
.. include:: /_includes/restore-platform-system-data-and-storage-b92b8bdaf16d.rest
|
|
:start-after: sw-patch-query-begin
|
|
:end-before: sw-patch-query-end
|
|
|
|
.. note::
|
|
|
|
On the systems that are not |AIO-SX|, you can skip this step if
|
|
``skip_patching=true`` is not used. Patches are automatically
|
|
reinstalled from the backup by default.
|
|
|
|
#. Ensure that the backup files are available on the controller. Run both
|
|
Ansible Restore playbooks, restore_platform.yml and restore_user_images.yml.
|
|
For more information on restoring the back up file, see :ref:`Run Restore
|
|
Playbook Locally on the Controller
|
|
<running-restore-playbook-locally-on-the-controller>`, and :ref:`Run
|
|
Ansible Restore Playbook Remotely
|
|
<system-backup-running-ansible-restore-playbook-remotely>`.
|
|
|
|
.. note::
|
|
|
|
The backup files contain the system data and updates.
|
|
|
|
The restore operation will pull missing images from the upstream registries.
|
|
|
|
|
|
#. Restore the local registry using the file restore_user_images.yml.
|
|
|
|
Example:
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_user_images.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_user_images_backup_2023_07_15_21_24_22.tgz ansible_become_pass=St8rlingXCloud*"
|
|
|
|
.. note::
|
|
|
|
- This step applies only if it was created during the backup operation.
|
|
|
|
- The ``user_images_backup*.tgz`` file is created during backup only if
|
|
``backup_user_images`` is true.
|
|
|
|
This must be done before unlocking controller-0.
|
|
|
|
#. Unlock Controller-0.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-unlock controller-0
|
|
|
|
After you unlock controller-0, storage nodes become available and Ceph
|
|
becomes operational.
|
|
|
|
#. If the system is a Distributed Cloud system controller, restore the **dc-vault**
|
|
using the restore_dc_vault.yml playbook. Perform this step after unlocking
|
|
controller-0:
|
|
|
|
.. code-block:: none
|
|
|
|
$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_dc_vault_backup_2020_07_15_21_24_22.tgz ansible_become_pass=St0rlingX*"
|
|
|
|
.. note::
|
|
The dc-vault backup archive is created by the backup.yml playbook.
|
|
|
|
#. Authenticate the system as Keystone user **admin**.
|
|
|
|
Source the **admin** user environment as follows:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
|
|
#. Apps transition from 'restore-requested' to 'applying' state, and
|
|
from 'applying' state to 'applied' state.
|
|
|
|
If apps are transitioned from 'applying' to 'restore-requested' state,
|
|
ensure there is network access and access to the docker registry.
|
|
|
|
The process is repeated once per minute until all apps are transitioned to
|
|
'applied'.
|
|
|
|
#. If you have a Duplex system, restore the **controller-1** host.
|
|
|
|
#. List the current state of the hosts.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-list
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| id | hostname | personality| administrative|operational|availability|
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| 1 | controller-0| controller | unlocked |enabled |available |
|
|
| 2 | controller-1| controller | locked |disabled |offline |
|
|
| 3 | storage-0 | storage | locked |disabled |offline |
|
|
| 4 | storage-1 | storage | locked |disabled |offline |
|
|
| 5 | compute-0 | worker | locked |disabled |offline |
|
|
| 6 | compute-1 | worker | locked |disabled |offline |
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
|
|
#. Power on the host.
|
|
|
|
Ensure that the host boots from the network, and not from any disk
|
|
image that may be present.
|
|
|
|
The software is installed on the host, and then the host is
|
|
rebooted. Wait for the host to be reported as **locked**, **disabled**,
|
|
and **online**.
|
|
|
|
#. Unlock controller-1.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-unlock controller-1
|
|
+-----------------+--------------------------------------+
|
|
| Property | Value |
|
|
+-----------------+--------------------------------------+
|
|
| action | none |
|
|
| administrative | locked |
|
|
| availability | online |
|
|
| ... | ... |
|
|
| uuid | 5fc4904a-d7f0-42f0-991d-0c00b4b74ed0 |
|
|
+-----------------+--------------------------------------+
|
|
|
|
#. Verify the state of the hosts.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-list
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| id | hostname | personality| administrative|operational|availability|
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| 1 | controller-0| controller | unlocked |enabled |available |
|
|
| 2 | controller-1| controller | unlocked |enabled |available |
|
|
| 3 | storage-0 | storage | locked |disabled |offline |
|
|
| 4 | storage-1 | storage | locked |disabled |offline |
|
|
| 5 | compute-0 | worker | locked |disabled |offline |
|
|
| 6 | compute-1 | worker | locked |disabled |offline |
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
|
|
#. Restore storage configuration. If :command:`wipe_ceph_osds` is set to
|
|
**True**, follow the same procedure used to restore **controller-1**,
|
|
beginning with host **storage-0** and proceeding in sequence.
|
|
|
|
.. note::
|
|
This step should be performed ONLY if you are restoring storage hosts.
|
|
|
|
#. For storage hosts, there are two options:
|
|
|
|
With the controller software installed and updated to the same level
|
|
that was in effect when the backup was performed, you can perform
|
|
the restore procedure without interruption.
|
|
|
|
Standard with Controller Storage install or reinstall depends on the
|
|
:command:`wipe_ceph_osds` configuration:
|
|
|
|
#. If :command:`wipe_ceph_osds` is set to **true**, reinstall the
|
|
storage hosts.
|
|
|
|
#. If :command:`wipe_ceph_osds` is set to **false** (default
|
|
option), do not reinstall the storage hosts.
|
|
|
|
.. caution::
|
|
Do not reinstall or power off the storage hosts if you want to
|
|
keep previous Ceph cluster data. A reinstall of storage hosts
|
|
will lead to data loss.
|
|
|
|
#. Ensure that the Ceph cluster is healthy. Verify that the three Ceph
|
|
monitors (controller-0, controller-1, storage-0) are running in
|
|
quorum.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ ceph -s
|
|
cluster:
|
|
id: 3361e4ef-b0b3-4f94-97c6-b384f416768d
|
|
health: HEALTH_OK
|
|
|
|
services:
|
|
mon: 3 daemons, quorum controller-0,controller-1,storage-0
|
|
mgr: controller-0(active), standbys: controller-1
|
|
osd: 10 osds: 10 up, 10 in
|
|
|
|
data:
|
|
pools: 5 pools, 600 pgs
|
|
objects: 636 objects, 2.7 GiB
|
|
usage: 6.5 GiB used, 2.7 TiB / 2.7 TiB avail
|
|
pgs: 600 active+clean
|
|
|
|
io:
|
|
client: 85 B/s rd, 336 KiB/s wr, 0 op/s rd, 67 op/s wr
|
|
|
|
.. caution::
|
|
Do not proceed until the Ceph cluster is healthy and the message
|
|
HEALTH_OK appears.
|
|
|
|
If the message HEALTH_WARN appears, wait a few minutes and then try
|
|
again. If the warning condition persists, consult the public
|
|
documentation for troubleshooting Ceph monitors (for example
|
|
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/).
|
|
|
|
#. Restore the compute (worker) hosts, one at a time.
|
|
|
|
Restore the compute (worker) hosts following the same procedure used to
|
|
restore controller-1.
|
|
|
|
#. Allow Calico and Coredns pods to be recovered by Kubernetes. They should
|
|
all be in 'N/N Running' state.
|
|
|
|
The state of the hosts when the restore operation is complete is as
|
|
follows:
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ kubectl get pods -n kube-system | grep -e calico -e coredns
|
|
calico-kube-controllers-5cd4695574-d7zwt 1/1 Running
|
|
calico-node-6km72 1/1 Running
|
|
calico-node-c7xnd 1/1 Running
|
|
coredns-6d64d47ff4-99nhq 1/1 Running
|
|
coredns-6d64d47ff4-nhh95 1/1 Running
|
|
|
|
#. If **wipe_ceph_osds** is set to true and all the system hosts are in an
|
|
unlocked/enabled/available state, do the following:
|
|
|
|
#. Remove and reapply **platform-integ-apps**. This step will re-create
|
|
the default ceph pools (they were deleted):
|
|
|
|
.. code-block:: none
|
|
|
|
$ system application-remove platform-integ-apps
|
|
$ system application-apply platform-integ-apps
|
|
|
|
#. Delete completely and reapply all the applications that have
|
|
persistent volumes (OpenStack or custom apps). For example for
|
|
OpenStack, run the following commands
|
|
|
|
.. parsed-literal::
|
|
|
|
$ system application-remove |prefix|-openstack
|
|
$ system application-delete |prefix|-openstack
|
|
$ system application-upload |prefix|-openstack-20.12-0.tgz
|
|
$ system application-apply |prefix|-openstack
|
|
|
|
#. Run the :command:`system restore-complete` command.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system restore-complete
|
|
|
|
#. Alarms 750.006 alarms disappear one at a time, as the apps are auto applied.
|
|
|
|
.. rubric:: |postreq|
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ul-b2b-shg-plb:
|
|
|
|
- Passwords for local user accounts must be restored manually since they
|
|
are not included as part of the backup and restore procedures.
|
|
|
|
- After restoring a |prod-dc| subcloud, you need to bring it back
|
|
to the **managed** state on the Central Cloud (SystemController), by
|
|
using the following commands:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
~(keystone_admin)]$ dcmanager subcloud manage <subcloud-name>
|
|
|
|
where ``<subcloud-name>`` is the name of the subcloud to be managed.
|
|
|
|
|
|
.. comments in steps seem to throw numbering off.
|
|
|
|
.. xreflink removed from step 'Install the |prod| ISO software on controller-0 from the USB flash
|
|
drive.':
|
|
For details, refer to the |inst-doc|: :ref:`Installing Software on
|
|
controller-0 <installing-software-on-controller-0>`. Perform the
|
|
installation procedure for your system and *stop* at the step that
|
|
requires you to configure the host as a controller.
|
|
|
|
.. xreflink removed from step 'Install network connectivity required for the subcloud.':
|
|
For details, refer to the |distcloud-doc|: :ref:`Installing and
|
|
Provisioning a Subcloud <installing-and-provisioning-a-subcloud>`.
|