
Removed duplicate abbrev definitions Signed-off-by: Rafael Jardim <rafaeljordao.jardim@windriver.com> Change-Id: I7910d9f54e158250004abd7e17a4e119f8064252 Signed-off-by: Ron Stone <ronald.stone@windriver.com>
364 lines
15 KiB
ReStructuredText
364 lines
15 KiB
ReStructuredText
|
|
.. uzk1552923967458
|
|
.. _restoring-starlingx-system-data-and-storage:
|
|
|
|
========================================
|
|
Restore Platform System Data and Storage
|
|
========================================
|
|
|
|
You can perform a system restore \(controllers, workers, including or
|
|
excluding storage nodes\) of a |prod| cluster from available system data and
|
|
bring it back to the operational state it was when the backup procedure took
|
|
place.
|
|
|
|
.. rubric:: |context|
|
|
|
|
This procedure takes a snapshot of the etcd database at the time of backup,
|
|
stores it in the system data backup, and then uses it to initialize the
|
|
Kubernetes cluster during a restore. Kubernetes configuration will be
|
|
restored and pods that are started from repositories accessible from the
|
|
internet or from external repositories will start immediately. StarlingX
|
|
specific applications must be re-applied once a storage cluster is configured.
|
|
|
|
.. warning::
|
|
The system data backup file can only be used to restore the system from
|
|
which the backup was made. You cannot use this backup file to restore
|
|
the system to different hardware.
|
|
|
|
To restore the data, use the same version of the boot image \(ISO\) that
|
|
was used at the time of the original installation.
|
|
|
|
The |prod| restore supports two modes:
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ol-tw4-kvc-4jb:
|
|
|
|
#. To keep the Ceph cluster data intact \(false - default option\), use the
|
|
following syntax, when passing the extra arguments to the Ansible Restore
|
|
playbook command:
|
|
|
|
.. code-block:: none
|
|
|
|
wipe_ceph_osds=false
|
|
|
|
#. To wipe the Ceph cluster entirely \(true\), where the Ceph cluster will
|
|
need to be recreated, use the following syntax:
|
|
|
|
.. code-block:: none
|
|
|
|
wipe_ceph_osds=true
|
|
|
|
Restoring a |prod| cluster from a backup file is done by re-installing the
|
|
ISO on controller-0, running the Ansible Restore Playbook, applying updates
|
|
\(patches\), unlocking controller-0, and then powering on, and unlocking the
|
|
remaining hosts, one host at a time, starting with the controllers, and then
|
|
the storage hosts, ONLY if required, and lastly the compute \(worker\) hosts.
|
|
|
|
.. rubric:: |prereq|
|
|
|
|
Before you start the restore procedure you must ensure the following
|
|
conditions are in place:
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ul-rfq-qfg-mp:
|
|
|
|
- All cluster hosts must be prepared for network boot and then powered
|
|
down. You can prepare a host for network boot.
|
|
|
|
.. note::
|
|
If you are restoring system data only, do not lock, power off or
|
|
prepare the storage hosts to be reinstalled.
|
|
|
|
- The backup file is accessible locally, if restore is done by running
|
|
Ansible Restore playbook locally on the controller. The backup file is
|
|
accessible remotely, if restore is done by running Ansible Restore playbook
|
|
remotely.
|
|
|
|
- You have the original |prod| ISO installation image available on a USB
|
|
flash drive. It is mandatory that you use the exact same version of the
|
|
software used during the original installation, otherwise the restore
|
|
procedure will fail.
|
|
|
|
- The restore procedure requires all hosts but controller-0 to boot
|
|
over the internal management network using the |PXE| protocol. Ideally, the
|
|
old boot images are no longer present, so that the hosts boot from the
|
|
network when powered on. If this is not the case, you must configure each
|
|
host manually for network boot immediately after powering it on.
|
|
|
|
- If you are restoring a |prod-dc| subcloud first, ensure it is in
|
|
an **unmanaged** state on the Central Cloud \(SystemController\) by using
|
|
the following commands:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
~(keystone_admin)]$ dcmanager subcloud unmanage <subcloud-name>
|
|
|
|
where <subcloud-name> is the name of the subcloud to be unmanaged.
|
|
|
|
.. rubric:: |proc|
|
|
|
|
#. Power down all hosts.
|
|
|
|
If you have a storage host and want to retain Ceph data, then power down
|
|
all the nodes except the storage hosts; the cluster has to be functional
|
|
during a restore operation.
|
|
|
|
.. caution::
|
|
Do not use :command:`wipedisk` before a restore operation. This will
|
|
lead to data loss on your Ceph cluster. It is safe to use
|
|
:command:`wipedisk` during an initial installation, while reinstalling
|
|
a host, or during an upgrade.
|
|
|
|
#. Install the |prod| ISO software on controller-0 from the USB flash
|
|
drive.
|
|
|
|
You can now log in using the host's console.
|
|
|
|
#. Log in to the console as user **sysadmin** with password **sysadmin**.
|
|
|
|
#. Install network connectivity required for the subcloud.
|
|
|
|
#. Ensure that the backup file are available on the controller. Run both
|
|
Ansible Restore playbooks, restore\_platform.yml and restore\_user\_images.yml.
|
|
For more information on restoring the back up file, see :ref:`Run Restore
|
|
Playbook Locally on the Controller
|
|
<running-restore-playbook-locally-on-the-controller>`, and :ref:`Run
|
|
Ansible Restore Playbook Remotely
|
|
<system-backup-running-ansible-restore-playbook-remotely>`.
|
|
|
|
.. note::
|
|
The backup files contains the system data and updates.
|
|
|
|
#. If the backup file contains patches, Ansible Restore playbook
|
|
restore\_platform.yml will apply the patches and prompt you to reboot the
|
|
system, you will need to re-run Ansible Restore playbook
|
|
|
|
The current software version on the controller is compared against the
|
|
version available in the backup file. If the backed-up version includes
|
|
updates, the restore process automatically applies the updates and
|
|
forces an additional reboot of the controller to make them effective.
|
|
|
|
After the reboot, you can verify that the updates were applied, as
|
|
illustrated in the following example:
|
|
|
|
.. code-block:: none
|
|
|
|
$ sudo sw-patch query
|
|
Patch ID RR Release Patch State
|
|
======================== ========== ===========
|
|
COMPUTECONFIG Available 20.06 n/a
|
|
LIBCUNIT_CONTROLLER_ONLY Applied 20.06 n/a
|
|
STORAGECONFIG Applied 20.06 n/a
|
|
|
|
Rerun the Ansible Playbook if there were patches applied and you were
|
|
prompted to reboot the system.
|
|
|
|
#. Restore the local registry using the file restore\_user\_images.yml.
|
|
|
|
This must be done before unlocking controller-0.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-unlock controller-0
|
|
|
|
After you unlock controller-0, storage nodes become available and Ceph
|
|
becomes operational.
|
|
|
|
#. Authenticate the system as Keystone user **admin**.
|
|
|
|
Source the **admin** user environment as follows:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
|
|
#. Apps transition from 'restore-requested' to 'applying' state, and
|
|
from 'applying' state to 'applied' state.
|
|
|
|
If apps are transitioned from 'applying' to 'restore-requested' state,
|
|
ensure there is network access and access to the docker registry.
|
|
|
|
The process is repeated once per minute until all apps are transitioned to
|
|
'applied'.
|
|
|
|
#. If you have a Duplex system, restore the **controller-1** host.
|
|
|
|
#. List the current state of the hosts.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-list
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| id | hostname | personality| administrative|operational|availability|
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| 1 | controller-0| controller | unlocked |enabled |available |
|
|
| 2 | controller-1| controller | locked |disabled |offline |
|
|
| 3 | storage-0 | storage | locked |disabled |offline |
|
|
| 4 | storage-1 | storage | locked |disabled |offline |
|
|
| 5 | compute-0 | worker | locked |disabled |offline |
|
|
| 6 | compute-1 | worker | locked |disabled |offline |
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
|
|
#. Power on the host.
|
|
|
|
Ensure that the host boots from the network, and not from any disk
|
|
image that may be present.
|
|
|
|
The software is installed on the host, and then the host is
|
|
rebooted. Wait for the host to be reported as **locked**, **disabled**,
|
|
and **offline**.
|
|
|
|
#. Unlock controller-1.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-unlock controller-1
|
|
+-----------------+--------------------------------------+
|
|
| Property | Value |
|
|
+-----------------+--------------------------------------+
|
|
| action | none |
|
|
| administrative | locked |
|
|
| availability | online |
|
|
| ... | ... |
|
|
| uuid | 5fc4904a-d7f0-42f0-991d-0c00b4b74ed0 |
|
|
+-----------------+--------------------------------------+
|
|
|
|
#. Verify the state of the hosts.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system host-list
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| id | hostname | personality| administrative|operational|availability|
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
| 1 | controller-0| controller | unlocked |enabled |available |
|
|
| 2 | controller-1| controller | unlocked |enabled |available |
|
|
| 3 | storage-0 | storage | locked |disabled |offline |
|
|
| 4 | storage-1 | storage | locked |disabled |offline |
|
|
| 5 | compute-0 | worker | locked |disabled |offline |
|
|
| 6 | compute-1 | worker | locked |disabled |offline |
|
|
+----+-------------+------------+---------------+-----------+------------+
|
|
|
|
#. Restore storage configuration. If :command:`wipe\_ceph\_osds` is set to
|
|
**True**, follow the same procedure used to restore **controller-1**,
|
|
beginning with host **storage-0** and proceeding in sequence.
|
|
|
|
.. note::
|
|
This step should be performed ONLY if you are restoring storage hosts.
|
|
|
|
#. For storage hosts, there are two options:
|
|
|
|
With the controller software installed and updated to the same level
|
|
that was in effect when the backup was performed, you can perform
|
|
the restore procedure without interruption.
|
|
|
|
Standard with Controller Storage install or reinstall depends on the
|
|
:command:`wipe\_ceph\_osds` configuration:
|
|
|
|
#. If :command:`wipe\_ceph\_osds` is set to **true**, reinstall the
|
|
storage hosts.
|
|
|
|
#. If :command:`wipe\_ceph\_osds` is set to **false** \(default
|
|
option\), do not reinstall the storage hosts.
|
|
|
|
.. caution::
|
|
Do not reinstall or power off the storage hosts if you want to
|
|
keep previous Ceph cluster data. A reinstall of storage hosts
|
|
will lead to data loss.
|
|
|
|
#. Ensure that the Ceph cluster is healthy. Verify that the three Ceph
|
|
monitors \(controller-0, controller-1, storage-0\) are running in
|
|
quorum.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ ceph -s
|
|
cluster:
|
|
id: 3361e4ef-b0b3-4f94-97c6-b384f416768d
|
|
health: HEALTH_OK
|
|
|
|
services:
|
|
mon: 3 daemons, quorum controller-0,controller-1,storage-0
|
|
mgr: controller-0(active), standbys: controller-1
|
|
osd: 10 osds: 10 up, 10 in
|
|
|
|
data:
|
|
pools: 5 pools, 600 pgs
|
|
objects: 636 objects, 2.7 GiB
|
|
usage: 6.5 GiB used, 2.7 TiB / 2.7 TiB avail
|
|
pgs: 600 active+clean
|
|
|
|
io:
|
|
client: 85 B/s rd, 336 KiB/s wr, 0 op/s rd, 67 op/s wr
|
|
|
|
.. caution::
|
|
Do not proceed until the Ceph cluster is healthy and the message
|
|
HEALTH\_OK appears.
|
|
|
|
If the message HEALTH\_WARN appears, wait a few minutes and then try
|
|
again. If the warning condition persists, consult the public
|
|
documentation for troubleshooting Ceph monitors \(for example,
|
|
`http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshootin
|
|
g-mon/
|
|
<http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshootin
|
|
g-mon/>`__\).
|
|
|
|
#. Restore the compute \(worker\) hosts, one at a time.
|
|
|
|
Restore the compute \(worker\) hosts following the same procedure used to
|
|
restore controller-1.
|
|
|
|
#. Allow Calico and Coredns pods to be recovered by Kubernetes. They should
|
|
all be in 'N/N Running' state.
|
|
|
|
The state of the hosts when the restore operation is complete is as
|
|
follows:
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ kubectl get pods -n kube-system | grep -e calico -e coredns
|
|
calico-kube-controllers-5cd4695574-d7zwt 1/1 Running
|
|
calico-node-6km72 1/1 Running
|
|
calico-node-c7xnd 1/1 Running
|
|
coredns-6d64d47ff4-99nhq 1/1 Running
|
|
coredns-6d64d47ff4-nhh95 1/1 Running
|
|
|
|
#. Run the :command:`system restore-complete` command.
|
|
|
|
.. code-block:: none
|
|
|
|
~(keystone_admin)]$ system restore-complete
|
|
|
|
#. Alarms 750.006 alarms disappear one at a time, as the apps are auto
|
|
applied.
|
|
|
|
.. rubric:: |postreq|
|
|
|
|
.. _restoring-starlingx-system-data-and-storage-ul-b2b-shg-plb:
|
|
|
|
- Passwords for local user accounts must be restored manually since they
|
|
are not included as part of the backup and restore procedures.
|
|
|
|
- After restoring a |prod-dc| subcloud, you need to bring it back
|
|
to the **managed** state on the Central Cloud \(SystemController\), by
|
|
using the following commands:
|
|
|
|
.. code-block:: none
|
|
|
|
$ source /etc/platform/openrc
|
|
~(keystone_admin)]$ dcmanager subcloud manage <subcloud-name>
|
|
|
|
where <subcloud-name> is the name of the subcloud to be managed.
|
|
|
|
|
|
.. comments in steps seem to throw numbering off.
|
|
|
|
.. xreflink removed from step 'Install the |prod| ISO software on controller-0 from the USB flash
|
|
drive.':
|
|
For details, refer to the |inst-doc|: :ref:`Installing Software on
|
|
controller-0 <installing-software-on-controller-0>`. Perform the
|
|
installation procedure for your system and *stop* at the step that
|
|
requires you to configure the host as a controller.
|
|
|
|
.. xreflink removed from step 'Install network connectivity required for the subcloud.':
|
|
For details, refer to the |distcloud-doc|: :ref:`Installing and
|
|
Provisioning a Subcloud <installing-and-provisioning-a-subcloud>`. |