docs/doc/source/backup/kubernetes/restoring-starlingx-system-data-and-storage.rst

.. Greg updates required for -High Security Vulnerability Document Updates

.. uzk1552923967458
.. _restoring-starlingx-system-data-and-storage:

========================================
Restore Platform System Data and Storage
========================================

You can perform a system restore (controllers, workers, including or excluding
storage nodes) of a |prod| cluster from a previous system backup and bring it
back to the operational state it was when the backup procedure took place.

There are two restore modes- optimized restore and legacy restore. Optimized restore
must be used on |AIO-SX| and legacy restore must be used on systems that are not |AIO-SX|.

.. rubric:: |context|

Kubernetes configuration will be restored and pods that are started from
repositories accessible from the internet or from external repositories will
start immediately. |prod| specific applications must be re-applied once a
storage cluster is configured.

Everything is restored as it was when the backup was created, except for
optional data if not defined.-

See :ref:`Back Up System Data <backing-up-starlingx-system-data>` for more
details on the backup.

.. warning::

    The system backup file can only be used to restore the system from which
    the backup was made. You cannot use this backup file to restore the system
    to different hardware.

    To restore the backup, use the same version of the boot image (ISO) and
    patches that were installed at the time of the backup.

The |prod| restore supports the following optional modes:

.. _restoring-starlingx-system-data-and-storage-ol-tw4-kvc-4jb:

-   To keep the Ceph cluster data intact (false - default option), use the
    following parameter, when passing the extra arguments to the Ansible Restore
    playbook command:

    .. code-block:: none

       wipe_ceph_osds=false

-   To wipe the Ceph cluster entirely (true), where the Ceph cluster will need
    to be recreated, or if the Ceph partition was previously wiped, such as
    during a fresh install between backup and restore or during reinstall, use
    the following parameter:

    .. code-block:: none

        wipe_ceph_osds=true

Restoring a |prod| cluster from a backup file is done by re-installing the
ISO on controller-0, applying updates (patches), running the Ansible Restore
Playbook, unlocking controller-0, and then powering on, and unlocking the
remaining hosts, one host at a time, starting with the controllers, and then
the storage hosts, ONLY if required, and lastly the compute (worker) hosts.
Lastly, running :command:`system restore-complete` command.

.. rubric:: |prereq|

Before you start the restore procedure you must ensure the following
conditions are in place:

.. _restoring-starlingx-system-data-and-storage-ul-rfq-qfg-mp:

-   All cluster hosts must be prepared for network boot and then powered
    down. You can prepare a host for network boot.

    .. note::
        If you are restoring system data only, do not lock, power off or
        prepare the storage hosts to be reinstalled.

-   The backup file is accessible locally, if restore is done by running
    Ansible Restore playbook locally on the controller. The backup file is
    accessible remotely, if restore is done by running Ansible Restore playbook
    remotely.

-   You have the original |prod| ISO installation image available on a USB
    flash drive. It is mandatory that you use the exact same version of the
    software used during the original installation, otherwise the restore
    procedure will fail.

-   The restore procedure requires all hosts but controller-0 to boot
    over the internal management network using the |PXE| protocol. Ideally, the
    old boot images are no longer present, so that the hosts boot from the
    network when powered on. If this is not the case, you must configure each
    host manually for network boot immediately after powering it on.

-   If you are restoring a |prod-dc| subcloud first, ensure it is in
    an **unmanaged** state on the Central Cloud (SystemController) by using
    the following commands:

    .. code-block:: none

        $ source /etc/platform/openrc
        ~(keystone_admin)]$ dcmanager subcloud unmanage <subcloud-name>

    where ``<subcloud-name>`` is the name of the subcloud to be unmanaged.

    For more information, see:

    -  `Backup a Subcloud/Group of Subclouds using DCManager CLI <backup-a-subcloud-group-of-subclouds-using-dcmanager-cli-f12020a8fc42>`

    -  `Restore a Subcloud/Group of Subclouds from Backup Data Using DCManager CLI <restore-a-subcloud-group-of-subclouds-from-backup-data-using-dcmanager-cli-f10c1b63a95e>`

.. rubric:: |proc|

#.  Power down all hosts.

    If you have a storage host and want to retain Ceph data, then power down
    all the nodes except the storage hosts; the cluster has to be functional
    during a restore operation.

    .. caution::
        Do not use :command:`wipedisk` before a restore operation. This will
        lead to data loss on your Ceph cluster. It is safe to use
        :command:`wipedisk` during an initial installation, while reinstalling
        a host, or during an upgrade.

#.  Install the |prod| ISO software on controller-0 from the USB flash
    drive.

    You can now log in using the host's console.

#.  Log in to the console as user **sysadmin** with password **sysadmin**.

#.  Install network connectivity required for the subcloud.

#.  Ensure that the system is at the same patch level as it was when the backup
    was taken. **You must manually reinstall any previous patches and reboot the
    system** (for reboot-required patches) to prevent restore failures due to
    mismatched patch levels.

    .. note::

        This is mandatory for |AIO-SX| (optimized) deployments. For legacy
        patches it is only mandatory if either the exclude_patches
        (backup) or skip_patches_restore (restore) flags are used.

        It is recommended to restore subclouds only when there is an existing
        backup taken at the same patch level as the system controller.

    For steps on how to install patches using the :command:`sw-patch install-local`
    command, see :ref:`aio_simplex_install_kubernetes_r7`; ``Install Software on Controller-0``.

    After the reboot, you can verify that the updates were applied.

    .. only:: partner

       .. include:: /_includes/restore-platform-system-data-and-storage-b92b8bdaf16d.rest
           :start-after: sw-patch-query-begin
           :end-before: sw-patch-query-end

    .. note::

        On the systems that are not |AIO-SX|, you can skip this step if
        ``skip_patching=true`` is not used. Patches are automatically
        reinstalled from the backup by default.

#.  Ensure that the backup files are available on the controller. Run both
    Ansible Restore playbooks, restore_platform.yml and restore_user_images.yml.
    For more information on restoring the back up file, see :ref:`Run Restore
    Playbook Locally on the Controller
    <running-restore-playbook-locally-on-the-controller>`, and :ref:`Run
    Ansible Restore Playbook Remotely
    <system-backup-running-ansible-restore-playbook-remotely>`.

    .. note::

        The backup files contain the system data and updates.

        The restore operation will pull missing images from the upstream registries.


#.  Restore the local registry using the file restore_user_images.yml.

    Example:

    .. code-block:: none

        ~(keystone_admin)]$ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_user_images.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_user_images_backup_2023_07_15_21_24_22.tgz ansible_become_pass=St8rlingXCloud*"

    .. note::

        - This step applies only if it was created during the backup operation.

        - The ``user_images_backup*.tgz`` file is created during backup only if
          ``backup_user_images`` is true.

    This must be done before unlocking controller-0.

#.  Unlock Controller-0.

    .. code-block:: none

        ~(keystone_admin)]$ system host-unlock controller-0

    After you unlock controller-0, storage nodes become available and Ceph
    becomes operational.

#.  If the system is a Distributed Cloud system controller, restore the **dc-vault**
    using the restore_dc_vault.yml playbook. Perform this step after unlocking
    controller-0:

    .. code-block:: none

        $ ansible-playbook /usr/share/ansible/stx-ansible/playbooks/restore_dc_vault.yml -e "initial_backup_dir=/home/sysadmin backup_filename=localhost_dc_vault_backup_2020_07_15_21_24_22.tgz ansible_become_pass=St0rlingX*"

    .. note::
       The dc-vault backup archive is created by the backup.yml playbook.

#.  Authenticate the system as Keystone user **admin**.

    Source the **admin** user environment as follows:

    .. code-block:: none

        $ source /etc/platform/openrc

#.  Apps transition from 'restore-requested' to 'applying' state, and
    from 'applying' state to 'applied' state.

    If apps are transitioned from 'applying' to 'restore-requested' state,
    ensure there is network access and access to the docker registry.

    The process is repeated once per minute until all apps are transitioned to
    'applied'.

#. If you have a Duplex system, restore the **controller-1** host.

   #.  List the current state of the hosts.

       .. code-block:: none

            ~(keystone_admin)]$ system host-list
            +----+-------------+------------+---------------+-----------+------------+
            | id | hostname    | personality| administrative|operational|availability|
            +----+-------------+------------+---------------+-----------+------------+
            | 1  | controller-0| controller | unlocked      |enabled    |available   |
            | 2  | controller-1| controller | locked        |disabled   |offline     |
            | 3  | storage-0   | storage    | locked        |disabled   |offline     |
            | 4  | storage-1   | storage    | locked        |disabled   |offline     |
            | 5  | compute-0   | worker     | locked        |disabled   |offline     |
            | 6  | compute-1   | worker     | locked        |disabled   |offline     |
            +----+-------------+------------+---------------+-----------+------------+

   #.  Power on the host.

       Ensure that the host boots from the network, and not from any disk
       image that may be present.

       The software is installed on the host, and then the host is
       rebooted. Wait for the host to be reported as **locked**, **disabled**,
       and **online**.

   #.  Unlock controller-1.

       .. code-block:: none

            ~(keystone_admin)]$ system host-unlock controller-1
            +-----------------+--------------------------------------+
            | Property        | Value                                |
            +-----------------+--------------------------------------+
            | action          | none                                 |
            | administrative  | locked                               |
            | availability    | online                               |
            | ...             | ...                                  |
            | uuid            | 5fc4904a-d7f0-42f0-991d-0c00b4b74ed0 |
            +-----------------+--------------------------------------+

   #.  Verify the state of the hosts.

       .. code-block:: none

            ~(keystone_admin)]$ system host-list
            +----+-------------+------------+---------------+-----------+------------+
            | id | hostname    | personality| administrative|operational|availability|
            +----+-------------+------------+---------------+-----------+------------+
            | 1  | controller-0| controller | unlocked      |enabled    |available   |
            | 2  | controller-1| controller | unlocked      |enabled    |available   |
            | 3  | storage-0   | storage    | locked        |disabled   |offline     |
            | 4  | storage-1   | storage    | locked        |disabled   |offline     |
            | 5  | compute-0   | worker     | locked        |disabled   |offline     |
            | 6  | compute-1   | worker     | locked        |disabled   |offline     |
            +----+-------------+------------+---------------+-----------+------------+

#. Restore storage configuration. If :command:`wipe_ceph_osds` is set to
   **True**, follow the same procedure used to restore **controller-1**,
   beginning with host **storage-0** and proceeding in sequence.

   .. note::
      This step should be performed ONLY if you are restoring storage hosts.

   #.  For storage hosts, there are two options:

       With the controller software installed and updated to the same level
       that was in effect when the backup was performed, you can perform
       the restore procedure without interruption.

       Standard with Controller Storage install or reinstall depends on the
       :command:`wipe_ceph_osds` configuration:

       #.  If :command:`wipe_ceph_osds` is set to **true**, reinstall the
           storage hosts.

       #.  If :command:`wipe_ceph_osds` is set to **false** (default
           option), do not reinstall the storage hosts.

           .. caution::
                Do not reinstall or power off the storage hosts if you want to
                keep previous Ceph cluster data. A reinstall of storage hosts
                will lead to data loss.

   #.  Ensure that the Ceph cluster is healthy. Verify that the three Ceph
       monitors (controller-0, controller-1, storage-0) are running in
       quorum.

       .. code-block:: none

            ~(keystone_admin)]$ ceph -s
            cluster:
                id:     3361e4ef-b0b3-4f94-97c6-b384f416768d
                health: HEALTH_OK

              services:
                mon: 3 daemons, quorum controller-0,controller-1,storage-0
                mgr: controller-0(active), standbys: controller-1
                osd: 10 osds: 10 up, 10 in

              data:
                pools:   5 pools, 600 pgs
                objects: 636  objects, 2.7 GiB
                usage:   6.5 GiB used, 2.7 TiB / 2.7 TiB avail
                pgs:     600 active+clean

              io:
                client:   85 B/s rd, 336 KiB/s wr, 0 op/s rd, 67 op/s wr

       .. caution::
           Do not proceed until the Ceph cluster is healthy and the message
           HEALTH_OK appears.

       If the message HEALTH_WARN appears, wait a few minutes and then try
       again. If the warning condition persists, consult the public
       documentation for troubleshooting Ceph monitors (for example
       http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-mon/).

#. Restore the compute (worker) hosts, one at a time.

   Restore the compute (worker) hosts following the same procedure used to
   restore controller-1.

#. Allow Calico and Coredns pods to be recovered by Kubernetes. They should
   all be in 'N/N Running' state.

   The state of the hosts when the restore operation is complete is as
   follows:

   .. code-block:: none

        ~(keystone_admin)]$ kubectl get pods -n kube-system | grep -e calico -e coredns
        calico-kube-controllers-5cd4695574-d7zwt  1/1     Running
        calico-node-6km72                         1/1     Running
        calico-node-c7xnd                         1/1     Running
        coredns-6d64d47ff4-99nhq                  1/1     Running
        coredns-6d64d47ff4-nhh95                  1/1     Running

#. If **wipe_ceph_osds** is set to true and all the system hosts are in an
   unlocked/enabled/available state, do the following:

   #.  Remove and reapply **platform-integ-apps**. This step will re-create
       the default ceph pools (they were deleted):

       .. code-block:: none

            $ system application-remove platform-integ-apps
            $ system application-apply platform-integ-apps

   #.  Delete completely and reapply all the applications that have
       persistent volumes (OpenStack or custom apps). For example for
       OpenStack, run the following commands

       .. parsed-literal::

            $ system application-remove |prefix|-openstack
            $ system application-delete |prefix|-openstack
            $ system application-upload |prefix|-openstack-20.12-0.tgz
            $ system application-apply |prefix|-openstack

#. Run the :command:`system restore-complete` command.

   .. code-block:: none

       ~(keystone_admin)]$ system restore-complete

#. Alarms 750.006 alarms disappear one at a time, as the apps are auto applied.

.. rubric:: |postreq|

.. _restoring-starlingx-system-data-and-storage-ul-b2b-shg-plb:

-   Passwords for local user accounts must be restored manually since they
    are not included as part of the backup and restore procedures.

-   After restoring a |prod-dc| subcloud, you need to bring it back
    to the **managed** state on the Central Cloud (SystemController), by
    using the following commands:

    .. code-block:: none

        $ source /etc/platform/openrc
        ~(keystone_admin)]$ dcmanager subcloud manage <subcloud-name>

    where ``<subcloud-name>`` is the name of the subcloud to be managed.


.. comments in steps seem to throw numbering off.

.. xreflink removed from step 'Install the |prod| ISO software on controller-0 from the USB flash
    drive.':
    For details, refer to the |inst-doc|: :ref:`Installing Software on
    controller-0 <installing-software-on-controller-0>`. Perform the
    installation procedure for your system and *stop* at the step that
    requires you to configure the host as a controller.

..  xreflink  removed from step 'Install network connectivity required for the subcloud.':
    For details, refer to the |distcloud-doc|: :ref:`Installing and
    Provisioning a Subcloud <installing-and-provisioning-a-subcloud>`.