Eric MacDonald 53719fe07f Report tool support for subcloud collect bundles
Subcloud collect bundles have an extra level of directory heirarchy.

This update refactors the report.py bundle search and extraction
handling to support both single and multi host and subcloud collect
bundles.

Typical used is

    report.py <bundle pointer option> /path/to/bundle

Bundle pointer options

--bundle    Use this option to point to a 'directory' that 'contains'
            host tarball files.

--directory Use this option when a collect bundle 'tar file' is in a
            in a specific 'directory'.

--file      Use this option to point to a specific collect bundle
            tar file to analyze.

The following additional changes / improvements were made:

- improved report.py code structure
- improved management of the input and output dirs
- improved debug and error logging (new --state option)
- removed --clean option that can fail due to bundle file permissions
- added --bundle option to support pointing to a directory
  containing a set of host tarballs.
- modified collect to use the new --bundle option when --report
  option is used.
- implement tool logfile migration from /tmp to bundle output_dir
- create report_analysis dir in final output_dir only
- fix file permissions to allow execution from git
- order plugin analysis output based on size
- added additional error checking and handling

Test Plan:

PASS: Verify collect --report (std system, AIO and subcloud)
PASS: Verify report analysis
PASS: Verify report run on-system, git and cached copy

PASS: Verify on and off system analysis of
PASS: ... single-host collect bundle with --file option
PASS: ... multi-host collect bundle with --file option
PASS: ... single-subcloud collect bundle with --file option
PASS: ... multi-subcloud collect bundle with --file option
PASS: ... single-host collect bundle with --directory option
PASS: ... multi-host collect bundle with --directory option
PASS: ... single-subcloud collect bundle with --directory option
PASS: ... multi-subcloud collect bundle with --directory option
PASS: ... single-host collect bundle with --bundle option
PASS: ... multi-host collect bundle with --bundle option

PASS: Verify --directory option handling when
PASS: ... there are multiple bundles to select from (pass)
PASS: ... there are is a bundle without the date_time (prompt)
PASS: ... there are extra non-bundle files in target dir (ignore)
PASS: ... the target dir only contains host tarballs (fail)
PASS: ... the target dir has no tar files or extracted bundle (fail)
PASS: ... the target dir does not exist (fail)

PASS: Verify --bundle option handling when
PASS: ... there are host tarballs in the target directory (pass)
PASS: ... there are only extracted host dirs in target dir (pass)
PASS: ... there are no host tarballs or dirs in target dir (fail)
PASS: ... the target dir does not have a dated host dir (fail)
PASS: ... the target dir does not exist (fail)
PASS: ... the target is a file rather than a dir (fail)

PASS: Verify --file option handling when
PASS: ... the target tar file is found (pass)
PASS: ... the target tar file is not date_time named (prompt)
PASS: ... the target tar file does not exists (fail)
PASS: ... the target tar is not a collect bundle (fail)

PASS: Verify tar file(s) in a single and multi-subcloud collect
      with the --report option each include a report analysis.
PASS: Verify logging with and without --debug and --state options
PASS: Verify error handling when no -b, -f or -d option is specified

Story: 2010533
Task: 48187
Change-Id: I4924034aa27577f94e97928265c752c204a447c7
Signed-off-by: Eric MacDonald <eric.macdonald@windriver.com>
2023-06-13 21:59:03 +00:00
..

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

The Report tool is used to gather relevant log, events
and information about the system from a collect bundle
and present that data for quick / easy issue analysis.

Report can run directly from a cloned starlingX utilities git

    ${MY_REPO}/stx/utilities/tools/collector/debian-scripts/report/report.py {options}

Report is installed and can be run on any 22.12 POSTGA system node.

    /usr/local/bin/report/report.py --directory /scratch

Report can also be commanded to automatically run during a collect operation

    collect all --report

See Report's --help option for additional optional command line arguments.

    report.py --help

Selecting the right command option for your collect bundle:

   Report is designed to analyze a host or subcloud 'collect bundle'.
   Report needs to be told where to find the collect bundle to analyze
   using one of three options

   Analyze Host Bundle: --bundle or -b option
   -------------------

      Use this option to point to a 'directory' that 'contains'
      host tarball files.

          report.py --bundle /scratch/ALL_NODES_YYYYMMDD_hhmmss

      Point to a directory containing host tarballs.
      Such directory contains hostname's tarballs ; ending in tgz

          /scratch/ALL_NODES_YYYYMMDD_hhmmss
          ├── controller-0_YYYMMDD_hhmmss.tgz
          └── controller-1_YYYMMDD_hhmmss.tgz

      This is the option collect uses to auto analyze a just
      collected bundle with the collect --report option.

    Analyze Directory: --directory or -d option
    -----------------

       Use this option when a collect bundle 'tar file' is in a
       in a specific 'directory'. If there are multiple collect
       bundles in that directory then the tool will prompt the
       user to select one from a list.

           report.py --directory /scratch

           0 - exit
           1 - ALL_NODES_20230608.235225
           2 - ALL_NODES_20230609.004604
           Please select bundle to analyze:

       Analysis proceeds automatically if there is only a
       single collect bundle found.

    Analyze Specific Collect Bundle tar file: --file or -f option
    ----------------------------------------

        Use this option to point to a specific collect bundle
        tar file to analyze.

            report.py --file /scratch/ALL_NODES_YYYYMMDD_hhmmss.tar

Host vs Subcloud Collect Bundles:

Expected Host Bundle Format:

    ├── SELECT_NODES_YYYYMMDD.hhmmss.tar
    ├── SELECT_NODES_YYYYMMDD.hhmmss
         ├── controller-0_YYYYMMDD.hhmmss
         ├── controller-0_YYYYMMDD.hhmmss.tgz
         ├── controller-1_YYYYMMDD.hhmmss
         ├── controller-1_YYYYMMDD.hhmmss.tgz
         ├── worker-0_YYYYMMDD.hhmmss
         └── worker-1_YYYYMMDD.hhmmss.tgz

Expected Subcloud Bundle Format

    ├── SELECT_SUBCLOUDS_YYYYMMDD.hhmmss.tar
    └── SELECT_SUBCLOUDS_YYYYMMDD.hhmmss
        ├── subcloudX_YYYYMMDD.hhmmss.tar
        ├── subcloudX_YYYYMMDD.hhmmss
        ├── controller-0_YYYYMMDD.hhmmss
        ├── controller-0_YYYYMMDD.hhmmss.tgz
        │   ├── report_analysis
        │   └── report_tool.tgz
        ├── subcloudY_YYYYMMDD.hhmmss.tar
        ├── subcloudY_YYYYMMDD.hhmmss
         ├── controller-0_YYYYMMDD.hhmmss
         ├── controller-0_YYYYMMDD.hhmmss.tgz
         ├── report_analysis
         └── report_tool.tgz
        ├── subcloudZ_YYYYMMDD.hhmmss.tar
        └── subcloudZ_YYYYMMDD.hhmmss
            ├── controller-0_YYYYMMDD.hhmmss
            └── controller-0_YYYYMMDD.hhmmss.tgz

If there are multiple bundles found at the specified --directory
then the list is displayed and the user is prompted to select a
bundle from the list.

This would be typical when analyzing a selected subcloud collect
bundle like in the example below

        $ report -d /localdisk/issues/SELECT_SUBCLOUDS_YYYYMMDD.hhmmss.tar

    Report will extract the subcloud tar file and if it sees more
    than one tar file it will prompt the user to select which one
    to analyze

        0 - exit
        1 - subcloudX_YYYYMMDD.hhmmss
        2 - subcloudY_YYYYMMDD.hhmmss
        3 - subcloudZ_YYYYMMDD.hhmmss
        Please select the bundle to analyze:

Refer to report.py file header for a description of the tool

Report places the report analysis in the bundle itself.
Consider the following collect bundle structure and notice 
the 'report_analysis' folder which contians the Report analysis.

    SELECT_NODES_20220527.193605
    ├── controller-0_20220527.193605
    │   ├── etc
    │   ├── root
    │   └── var
    ├── controller-1_20220527.193605
    │   ├── etc
    │   ├── root
    │   └── var
    └── report_analysis (where the output files will be placed)

Pass a collect bundle to Report's CLI for two phases of processing ...

    Phase 1: Process algorithm specific plugins to collect plugin
             specific 'report logs'. Basically fault, event,
             alarm and state change logs.

    Phase 2: Run the correlator against the plugin found 'report logs'
             to produce descriptive strings that represent failures
             that were found in the collect bundle and to summarize
             the events, alarms and state change data.

Report then produces a report analysis that gets stored with
the original bundle.

Example Analysis:

$ report -d /localdisk/CGTS-44887

extracting /localdisk/CGTS-44887/ALL_NODES_20230307.183540.tar

Report: /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis

extracting : /localdisk/CGTS-44887/ALL_NODES_20230307.183540/controller-1_20230307.183540.tgz
extracting : /localdisk/CGTS-44887/ALL_NODES_20230307.183540/compute-0_20230307.183540.tgz
extracting : /localdisk/CGTS-44887/ALL_NODES_20230307.183540/controller-0_20230307.183540.tgz
extracting : /localdisk/CGTS-44887/ALL_NODES_20230307.183540/compute-1_20230307.183540.tgz

Active Ctrl: controller-1
System Type: All-in-one
S/W Version: 22.12
System Mode: duplex
DC Role    : systemcontroller
Node Type  : controller
subfunction: controller,worker
Mgmt Iface : vlan809
Clstr Iface: vlan909
OAM Iface  : eno8403
OS Release : Debian GNU/Linux 11 (bullseye)
Build Type : Formal
Build Date : 2023-03-01 23:00:06 +0000
controllers: controller-1,controller-0
workers    : compute-1,compute-0

Plugin Results:

  621 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/log
  221 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/swact_activity
  132 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/alarm
   85 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/substring_controller-0
   60 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/system_info
   54 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/maintenance_errors
   36 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/heartbeat_loss
   26 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/process_failures
   16 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/state_changes
   13 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/substring_controller-1
    2 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/plugins/puppet_errors

... nothing found by plugins: daemon_failures

Correlated Results:

Events       : 8  /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/events
Alarms       : 26 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/alarms
State Changes: 16 /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/state_changes
Failures     : 4  /localdisk/CGTS-44887/ALL_NODES_20230307.183540/report_analysis/failures
2023-03-07T05:00:11 controller-0 uncontrolled swact
2023-03-07T05:01:52 controller-0 heartbeat loss failure
2023-03-07T17:42:35 controller-0 configuration failure
2023-03-07T17:58:06 controller-0 goenabled failure

Inspect the Correlated and Plugin results files for failures,
alarms, events and state changes.