
At now, an official name for Telemetry uses the word "service". Change-Id: Ic768c8ee2bc45c97421a6be04dafa59a617cd7d9
778 lines
30 KiB
ReStructuredText
778 lines
30 KiB
ReStructuredText
======================
|
||
Logging and Monitoring
|
||
======================
|
||
|
||
As an OpenStack cloud is composed of so many different services, there
|
||
are a large number of log files. This chapter aims to assist you in
|
||
locating and working with them and describes other ways to track the
|
||
status of your deployment.
|
||
|
||
Where Are the Logs?
|
||
~~~~~~~~~~~~~~~~~~~
|
||
|
||
Most services use the convention of writing their log files to
|
||
subdirectories of the ``/var/log directory``, as listed in the
|
||
below table.
|
||
|
||
.. list-table:: OpenStack log locations
|
||
:widths: 33 33 33
|
||
:header-rows: 1
|
||
|
||
* - Node type
|
||
- Service
|
||
- Log location
|
||
* - Cloud controller
|
||
- ``nova-*``
|
||
- ``/var/log/nova``
|
||
* - Cloud controller
|
||
- ``glance-*``
|
||
- ``/var/log/glance``
|
||
* - Cloud controller
|
||
- ``cinder-*``
|
||
- ``/var/log/cinder``
|
||
* - Cloud controller
|
||
- ``keystone-*``
|
||
- ``/var/log/keystone``
|
||
* - Cloud controller
|
||
- ``neutron-*``
|
||
- ``/var/log/neutron``
|
||
* - Cloud controller
|
||
- horizon
|
||
- ``/var/log/apache2/``
|
||
* - All nodes
|
||
- misc (swift, dnsmasq)
|
||
- ``/var/log/syslog``
|
||
* - Compute nodes
|
||
- libvirt
|
||
- ``/var/log/libvirt/libvirtd.log``
|
||
* - Compute nodes
|
||
- Console (boot up messages) for VM instances:
|
||
- ``/var/lib/nova/instances/instance-<instance id>/console.log``
|
||
* - Block Storage nodes
|
||
- cinder-volume
|
||
- ``/var/log/cinder/cinder-volume.log``
|
||
|
||
|
||
Reading the Logs
|
||
~~~~~~~~~~~~~~~~
|
||
|
||
OpenStack services use the standard logging levels, at increasing
|
||
severity: DEBUG, INFO, AUDIT, WARNING, ERROR, CRITICAL, and TRACE. That
|
||
is, messages only appear in the logs if they are more "severe" than the
|
||
particular log level, with DEBUG allowing all log statements through.
|
||
For example, TRACE is logged only if the software has a stack trace,
|
||
while INFO is logged for every message including those that are only for
|
||
information.
|
||
|
||
To disable DEBUG-level logging, edit ``/etc/nova/nova.conf`` file as follows:
|
||
|
||
.. code-block:: ini
|
||
|
||
debug=false
|
||
|
||
Keystone is handled a little differently. To modify the logging level,
|
||
edit the ``/etc/keystone/logging.conf`` file and look at the
|
||
``logger_root`` and ``handler_file`` sections.
|
||
|
||
Logging for horizon is configured in
|
||
``/etc/openstack_dashboard/local_settings.py``. Because horizon is
|
||
a Django web application, it follows the `Django Logging framework
|
||
conventions <https://docs.djangoproject.com/en/dev/topics/logging/>`_.
|
||
|
||
The first step in finding the source of an error is typically to search
|
||
for a CRITICAL, TRACE, or ERROR message in the log starting at the
|
||
bottom of the log file.
|
||
|
||
Here is an example of a CRITICAL log message, with the corresponding
|
||
TRACE (Python traceback) immediately following:
|
||
|
||
.. code-block:: console
|
||
|
||
2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
|
||
cinder-volumes doesn't exist
|
||
2013-02-25 21:05:51 17409 TRACE cinder Traceback (most recent call last):
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/bin/cinder-volume", line 48, in <module>
|
||
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 422, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder _launcher.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 127, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder return self._exit_event.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder return hubs.get_hub().switch()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
|
||
2013-02-25 21:05:51 17409 TRACE cinder return self.greenlet.switch()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
|
||
2013-02-25 21:05:51 17409 TRACE cinder result = function(*args, **kwargs)
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 88, in run_server
|
||
2013-02-25 21:05:51 17409 TRACE cinder server.start()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 159, in start
|
||
2013-02-25 21:05:51 17409 TRACE cinder self.manager.init_host()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 95,
|
||
in init_host
|
||
2013-02-25 21:05:51 17409 TRACE cinder self.driver.check_for_setup_error()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 116,
|
||
in check_for_setup_error
|
||
2013-02-25 21:05:51 17409 TRACE cinder raise exception.VolumeBackendAPIException(data=exception_message)
|
||
2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume
|
||
backend API: volume group cinder-volumes doesn't exist
|
||
2013-02-25 21:05:51 17409 TRACE cinder
|
||
|
||
In this example, ``cinder-volumes`` failed to start and has provided a
|
||
stack trace, since its volume back end has been unable to set up the
|
||
storage volume—probably because the LVM volume that is expected from the
|
||
configuration does not exist.
|
||
|
||
Here is an example error log:
|
||
|
||
.. code-block:: console
|
||
|
||
2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
|
||
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.
|
||
|
||
In this error, a nova service has failed to connect to the RabbitMQ
|
||
server because it got a connection refused error.
|
||
|
||
Tracing Instance Requests
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
When an instance fails to behave properly, you will often have to trace
|
||
activity associated with that instance across the log files of various
|
||
``nova-*`` services and across both the cloud controller and compute
|
||
nodes.
|
||
|
||
The typical way is to trace the UUID associated with an instance across
|
||
the service logs.
|
||
|
||
Consider the following example:
|
||
|
||
.. code-block:: console
|
||
|
||
$ nova list
|
||
+--------------------------------+--------+--------+--------------------------+
|
||
| ID | Name | Status | Networks |
|
||
+--------------------------------+--------+--------+--------------------------+
|
||
| fafed8-4a46-413b-b113-f1959ffe | cirros | ACTIVE | novanetwork=192.168.100.3|
|
||
+--------------------------------------+--------+--------+--------------------+
|
||
|
||
Here, the ID associated with the instance is
|
||
``faf7ded8-4a46-413b-b113-f19590746ffe``. If you search for this string
|
||
on the cloud controller in the ``/var/log/nova-*.log`` files, it appears
|
||
in ``nova-api.log`` and ``nova-scheduler.log``. If you search for this
|
||
on the compute nodes in ``/var/log/nova-*.log``, it appears in
|
||
``nova-network.log`` and ``nova-compute.log``. If no ERROR or CRITICAL
|
||
messages appear, the most recent log entry that reports this may provide
|
||
a hint about what has gone wrong.
|
||
|
||
Adding Custom Logging Statements
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
If there is not enough information in the existing logs, you may need to
|
||
add your own custom logging statements to the ``nova-*``
|
||
services.
|
||
|
||
The source files are located in
|
||
``/usr/lib/python2.7/dist-packages/nova``.
|
||
|
||
To add logging statements, the following line should be near the top of
|
||
the file. For most files, these should already be there:
|
||
|
||
.. code-block:: python
|
||
|
||
from nova.openstack.common import log as logging
|
||
LOG = logging.getLogger(__name__)
|
||
|
||
To add a DEBUG logging statement, you would do:
|
||
|
||
.. code-block:: python
|
||
|
||
LOG.debug("This is a custom debugging statement")
|
||
|
||
You may notice that all the existing logging messages are preceded by an
|
||
underscore and surrounded by parentheses, for example:
|
||
|
||
.. code-block:: python
|
||
|
||
LOG.debug(_("Logging statement appears here"))
|
||
|
||
This formatting is used to support translation of logging messages into
|
||
different languages using the
|
||
`gettext <https://docs.python.org/2/library/gettext.html>`_
|
||
internationalization library. You don't need to do this for your own
|
||
custom log messages. However, if you want to contribute the code back to
|
||
the OpenStack project that includes logging statements, you must
|
||
surround your log messages with underscores and parentheses.
|
||
|
||
RabbitMQ Web Management Interface or rabbitmqctl
|
||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Aside from connection failures, RabbitMQ log files are generally not
|
||
useful for debugging OpenStack related issues. Instead, we recommend you
|
||
use the RabbitMQ web management interface.RabbitMQlogging/monitoring
|
||
RabbitMQ web management interface Enable it on your cloud
|
||
controller:
|
||
|
||
.. code-block:: console
|
||
|
||
# /usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management
|
||
|
||
.. code-block:: console
|
||
|
||
# service rabbitmq-server restart
|
||
|
||
The RabbitMQ web management interface is accessible on your cloud
|
||
controller at *http://localhost:55672*.
|
||
|
||
.. note::
|
||
|
||
Ubuntu 12.04 installs RabbitMQ version 2.7.1, which uses port 55672.
|
||
RabbitMQ versions 3.0 and above use port 15672 instead. You can
|
||
check which version of RabbitMQ you have running on your local
|
||
Ubuntu machine by doing:
|
||
|
||
.. code-block:: console
|
||
|
||
$ dpkg -s rabbitmq-server | grep "Version:"
|
||
Version: 2.7.1-0ubuntu4
|
||
|
||
An alternative to enabling the RabbitMQ web management interface is to
|
||
use the ``rabbitmqctl`` commands. For example,
|
||
:command:`rabbitmqctl list_queues| grep cinder` displays any messages left in
|
||
the queue. If there are messages, it's a possible sign that cinder
|
||
services didn't connect properly to rabbitmq and might have to be
|
||
restarted.
|
||
|
||
Items to monitor for RabbitMQ include the number of items in each of the
|
||
queues and the processing time statistics for the server.
|
||
|
||
Centrally Managing Logs
|
||
~~~~~~~~~~~~~~~~~~~~~~~
|
||
|
||
Because your cloud is most likely composed of many servers, you must
|
||
check logs on each of those servers to properly piece an event together.
|
||
A better solution is to send the logs of all servers to a central
|
||
location so that they can all be accessed from the same
|
||
area.
|
||
|
||
Ubuntu uses rsyslog as the default logging service. Since it is natively
|
||
able to send logs to a remote location, you don't have to install
|
||
anything extra to enable this feature, just modify the configuration
|
||
file. In doing this, consider running your logging over a management
|
||
network or using an encrypted VPN to avoid interception.
|
||
|
||
rsyslog Client Configuration
|
||
----------------------------
|
||
|
||
To begin, configure all OpenStack components to log to syslog in
|
||
addition to their standard log file location. Also configure each
|
||
component to log to a different syslog facility. This makes it easier to
|
||
split the logs into individual components on the central server:
|
||
|
||
``nova.conf``:
|
||
|
||
.. code-block:: ini
|
||
|
||
use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL0
|
||
|
||
``glance-api.conf`` and ``glance-registry.conf``:
|
||
|
||
.. code-block:: ini
|
||
|
||
use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL1
|
||
|
||
``cinder.conf``:
|
||
|
||
.. code-block:: ini
|
||
|
||
use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL2
|
||
|
||
``keystone.conf``:
|
||
|
||
.. code-block:: ini
|
||
|
||
use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL3
|
||
|
||
By default, Object Storage logs to syslog.
|
||
|
||
Next, create ``/etc/rsyslog.d/client.conf`` with the following line:
|
||
|
||
.. code-block:: ini
|
||
|
||
*.* @192.168.1.10
|
||
|
||
This instructs rsyslog to send all logs to the IP listed. In this
|
||
example, the IP points to the cloud controller.
|
||
|
||
rsyslog Server Configuration
|
||
----------------------------
|
||
|
||
Designate a server as the central logging server. The best practice is
|
||
to choose a server that is solely dedicated to this purpose. Create a
|
||
file called ``/etc/rsyslog.d/server.conf`` with the following contents:
|
||
|
||
.. code-block:: ini
|
||
|
||
# Enable UDP
|
||
$ModLoad imudp
|
||
# Listen on 192.168.1.10 only
|
||
$UDPServerAddress 192.168.1.10
|
||
# Port 514
|
||
$UDPServerRun 514
|
||
|
||
# Create logging templates for nova
|
||
$template NovaFile,"/var/log/rsyslog/%HOSTNAME%/nova.log"
|
||
$template NovaAll,"/var/log/rsyslog/nova.log"
|
||
|
||
# Log everything else to syslog.log
|
||
$template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
|
||
*.* ?DynFile
|
||
|
||
# Log various openstack components to their own individual file
|
||
local0.* ?NovaFile
|
||
local0.* ?NovaAll
|
||
& ~
|
||
|
||
This example configuration handles the nova service only. It first
|
||
configures rsyslog to act as a server that runs on port 514. Next, it
|
||
creates a series of logging templates. Logging templates control where
|
||
received logs are stored. Using the last example, a nova log from
|
||
c01.example.com goes to the following locations:
|
||
|
||
- ``/var/log/rsyslog/c01.example.com/nova.log``
|
||
|
||
- ``/var/log/rsyslog/nova.log``
|
||
|
||
This is useful, as logs from c02.example.com go to:
|
||
|
||
- ``/var/log/rsyslog/c02.example.com/nova.log``
|
||
|
||
- ``/var/log/rsyslog/nova.log``
|
||
|
||
You have an individual log file for each compute node as well as an
|
||
aggregated log that contains nova logs from all nodes.
|
||
|
||
Monitoring
|
||
~~~~~~~~~~
|
||
|
||
There are two types of monitoring: watching for problems and watching
|
||
usage trends. The former ensures that all services are up and running,
|
||
creating a functional cloud. The latter involves monitoring resource
|
||
usage over time in order to make informed decisions about potential
|
||
bottlenecks and upgrades.
|
||
|
||
**Nagios** is an open source monitoring service. It's capable of executing
|
||
arbitrary commands to check the status of server and network services,
|
||
remotely executing arbitrary commands directly on servers, and allowing
|
||
servers to push notifications back in the form of passive monitoring.
|
||
Nagios has been around since 1999. Although newer monitoring services
|
||
are available, Nagios is a tried-and-true systems administration
|
||
staple.
|
||
|
||
Process Monitoring
|
||
------------------
|
||
|
||
A basic type of alert monitoring is to simply check and see whether a
|
||
required process is running.monitoring process monitoringprocess
|
||
monitoringlogging/monitoring process monitoring For example, ensure that
|
||
the ``nova-api`` service is running on the cloud controller:
|
||
|
||
.. code-block:: console
|
||
|
||
# ps aux | grep nova-api
|
||
nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api
|
||
--config-file=/etc/nova/nova.conf nova
|
||
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python
|
||
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12792 0.0 0.0 96052 22856 ? S Feb11 0:01 /usr/bin/python
|
||
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12793 0.0 0.3 290688 115516 ? S Feb11 1:23 /usr/bin/python
|
||
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12794 0.0 0.2 248636 77068 ? S Feb11 0:04 /usr/bin/python
|
||
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api
|
||
|
||
You can create automated alerts for critical processes by using Nagios
|
||
and NRPE. For example, to ensure that the ``nova-compute`` process is
|
||
running on compute nodes, create an alert on your Nagios server that
|
||
looks like this:
|
||
|
||
.. code-block:: none
|
||
|
||
define service {
|
||
host_name c01.example.com
|
||
check_command check_nrpe_1arg!check_nova-compute
|
||
use generic-service
|
||
notification_period 24x7
|
||
contact_groups sysadmins
|
||
service_description nova-compute
|
||
}
|
||
|
||
Then on the actual compute node, create the following NRPE
|
||
configuration:
|
||
|
||
.. code-block:: none
|
||
|
||
\command[check_nova-compute]=/usr/lib/nagios/plugins/check_procs -c 1: \
|
||
-a nova-compute
|
||
|
||
Nagios checks that at least one ``nova-compute`` service is running at
|
||
all times.
|
||
|
||
Resource Alerting
|
||
-----------------
|
||
|
||
Resource alerting provides notifications when one or more resources are
|
||
critically low. While the monitoring thresholds should be tuned to your
|
||
specific OpenStack environment, monitoring resource usage is not
|
||
specific to OpenStack at all—any generic type of alert will work
|
||
fine.
|
||
|
||
Some of the resources that you want to monitor include:
|
||
|
||
- Disk usage
|
||
|
||
- Server load
|
||
|
||
- Memory usage
|
||
|
||
- Network I/O
|
||
|
||
- Available vCPUs
|
||
|
||
For example, to monitor disk capacity on a compute node with Nagios, add
|
||
the following to your Nagios configuration:
|
||
|
||
.. code-block:: none
|
||
|
||
define service {
|
||
host_name c01.example.com
|
||
check_command check_nrpe!check_all_disks!20% 10%
|
||
use generic-service
|
||
contact_groups sysadmins
|
||
service_description Disk
|
||
}
|
||
|
||
On the compute node, add the following to your NRPE configuration:
|
||
|
||
.. code-block:: none
|
||
|
||
command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c \
|
||
$ARG2$ -e
|
||
|
||
Nagios alerts you with a WARNING when any disk on the compute node is 80
|
||
percent full and CRITICAL when 90 percent is full.
|
||
|
||
StackTach
|
||
---------
|
||
|
||
StackTach is a tool that collects and reports the notifications sent by
|
||
``nova``. Notifications are essentially the same as logs but can be much
|
||
more detailed. Nearly all OpenStack components are capable of generating
|
||
notifications when significant events occur. Notifications are messages
|
||
placed on the OpenStack queue (generally RabbitMQ) for consumption by
|
||
downstream systems. An overview of notifications can be found at `System
|
||
Usage
|
||
Data <https://wiki.openstack.org/wiki/SystemUsageData>`_.
|
||
|
||
To enable ``nova`` to send notifications, add the following to
|
||
``nova.conf``:
|
||
|
||
.. code-block:: ini
|
||
|
||
notification_topics=monitor
|
||
notification_driver=messagingv2
|
||
|
||
Once ``nova`` is sending notifications, install and configure StackTach.
|
||
StackTach workers for Queue consumption and pipeling processing are
|
||
configured to read these notifications from RabbitMQ servers and store
|
||
them in a database. Users can inquire on instances, requests and servers
|
||
by using the browser interface or command line tool,
|
||
`Stacky <https://github.com/rackerlabs/stacky>`_. Since StackTach is
|
||
relatively new and constantly changing, installation instructions
|
||
quickly become outdated. Please refer to the `StackTach Git
|
||
repo <https://git.openstack.org/cgit//openstack/stacktach>`_ for
|
||
instructions as well as a demo video. Additional details on the latest
|
||
developments can be discovered at the `official
|
||
page <http://stacktach.com/>`_
|
||
|
||
Logstash
|
||
--------
|
||
|
||
Logstash is a high performance indexing and search engine for logs. Logs
|
||
from Jenkins test runs are sent to logstash where they are indexed and
|
||
stored. Logstash facilitates reviewing logs from multiple sources in a
|
||
single test run, searching for errors or particular events within a test
|
||
run, and searching for log event trends across test runs.
|
||
|
||
There are four major layers in Logstash setup which are
|
||
|
||
- Log Pusher
|
||
|
||
- Log Indexer
|
||
|
||
- ElasticSearch
|
||
|
||
- Kibana
|
||
|
||
Each layer scales horizontally. As the number of logs grows you can add
|
||
more log pushers, more Logstash indexers, and more ElasticSearch nodes.
|
||
|
||
Logpusher is a pair of Python scripts which first listens to Jenkins
|
||
build events and converts them into Gearman jobs. Gearman provides a
|
||
generic application framework to farm out work to other machines or
|
||
processes that are better suited to do the work. It allows you to do
|
||
work in parallel, to load balance processing, and to call functions
|
||
between languages.Later Logpusher performs Gearman jobs to push log
|
||
files into logstash. Logstash indexer reads these log events, filters
|
||
them to remove unwanted lines, collapse multiple events together, and
|
||
parses useful information before shipping them to ElasticSearch for
|
||
storage and indexing. Kibana is a logstash oriented web client for
|
||
ElasticSearch.
|
||
|
||
OpenStack Telemetry
|
||
-------------------
|
||
|
||
An integrated OpenStack project (code-named :term:`ceilometer`) collects
|
||
metering and event data relating to OpenStack services. Data collected
|
||
by the Telemetry service could be used for billing. Depending on
|
||
deployment configuration, collected data may be accessible to users
|
||
based on the deployment configuration. The Telemetry service provides a
|
||
REST API documented at
|
||
http://developer.openstack.org/api-ref-telemetry-v2.html. You can read
|
||
more about the module in the `OpenStack Cloud Administrator
|
||
Guide <http://docs.openstack.org/admin-guide-cloud/telemetry.html>`_ or
|
||
in the `developer
|
||
documentation <http://docs.openstack.org/developer/ceilometer>`_.
|
||
|
||
OpenStack-Specific Resources
|
||
----------------------------
|
||
|
||
Resources such as memory, disk, and CPU are generic resources that all
|
||
servers (even non-OpenStack servers) have and are important to the
|
||
overall health of the server. When dealing with OpenStack specifically,
|
||
these resources are important for a second reason: ensuring that enough
|
||
are available to launch instances. There are a few ways you can see
|
||
OpenStack resource usage.monitoring OpenStack-specific
|
||
resourcesresources generic vs. OpenStack-specificlogging/monitoring
|
||
OpenStack-specific resources The first is through the :command:`nova` command:
|
||
|
||
.. code-block:: console
|
||
|
||
# nova usage-list
|
||
|
||
This command displays a list of how many instances a tenant has running
|
||
and some light usage statistics about the combined instances. This
|
||
command is useful for a quick overview of your cloud, but it doesn't
|
||
really get into a lot of details.
|
||
|
||
Next, the ``nova`` database contains three tables that store usage
|
||
information.
|
||
|
||
The ``nova.quotas`` and ``nova.quota_usages`` tables store quota
|
||
information. If a tenant's quota is different from the default quota
|
||
settings, its quota is stored in the ``nova.quotas`` table. For example:
|
||
|
||
.. code-block:: mysql
|
||
|
||
mysql> select project_id, resource, hard_limit from quotas;
|
||
+----------------------------------+-----------------------------+------------+
|
||
| project_id | resource | hard_limit |
|
||
+----------------------------------+-----------------------------+------------+
|
||
| 628df59f091142399e0689a2696f5baa | metadata_items | 128 |
|
||
| 628df59f091142399e0689a2696f5baa | injected_file_content_bytes | 10240 |
|
||
| 628df59f091142399e0689a2696f5baa | injected_files | 5 |
|
||
| 628df59f091142399e0689a2696f5baa | gigabytes | 1000 |
|
||
| 628df59f091142399e0689a2696f5baa | ram | 51200 |
|
||
| 628df59f091142399e0689a2696f5baa | floating_ips | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | instances | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | volumes | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | cores | 20 |
|
||
+----------------------------------+-----------------------------+------------+
|
||
|
||
The ``nova.quota_usages`` table keeps track of how many resources the
|
||
tenant currently has in use:
|
||
|
||
.. code-block:: mysql
|
||
|
||
mysql> select project_id, resource, in_use from quota_usages where project_id like '628%';
|
||
+----------------------------------+--------------+--------+
|
||
| project_id | resource | in_use |
|
||
+----------------------------------+--------------+--------+
|
||
| 628df59f091142399e0689a2696f5baa | instances | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | ram | 512 |
|
||
| 628df59f091142399e0689a2696f5baa | cores | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | floating_ips | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | volumes | 2 |
|
||
| 628df59f091142399e0689a2696f5baa | gigabytes | 12 |
|
||
| 628df59f091142399e0689a2696f5baa | images | 1 |
|
||
+----------------------------------+--------------+--------+
|
||
|
||
By comparing a tenant's hard limit with their current resource usage,
|
||
you can see their usage percentage. For example, if this tenant is using
|
||
1 floating IP out of 10, then they are using 10 percent of their
|
||
floating IP quota. Rather than doing the calculation manually, you can
|
||
use SQL or the scripting language of your choice and create a formatted
|
||
report:
|
||
|
||
.. code-block:: mysql
|
||
|
||
+----------------------------------+------------+-------------+---------------+
|
||
| some_tenant |
|
||
+-----------------------------------+------------+------------+---------------+
|
||
| Resource | Used | Limit | |
|
||
+-----------------------------------+------------+------------+---------------+
|
||
| cores | 1 | 20 | 5 % |
|
||
| floating_ips | 1 | 10 | 10 % |
|
||
| gigabytes | 12 | 1000 | 1 % |
|
||
| images | 1 | 4 | 25 % |
|
||
| injected_file_content_bytes | 0 | 10240 | 0 % |
|
||
| injected_file_path_bytes | 0 | 255 | 0 % |
|
||
| injected_files | 0 | 5 | 0 % |
|
||
| instances | 1 | 10 | 10 % |
|
||
| key_pairs | 0 | 100 | 0 % |
|
||
| metadata_items | 0 | 128 | 0 % |
|
||
| ram | 512 | 51200 | 1 % |
|
||
| reservation_expire | 0 | 86400 | 0 % |
|
||
| security_group_rules | 0 | 20 | 0 % |
|
||
| security_groups | 0 | 10 | 0 % |
|
||
| volumes | 2 | 10 | 20 % |
|
||
+-----------------------------------+------------+------------+---------------+
|
||
|
||
The preceding information was generated by using a custom script that
|
||
can be found on
|
||
`GitHub <https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report>`_.
|
||
|
||
.. note::
|
||
|
||
This script is specific to a certain OpenStack installation and must
|
||
be modified to fit your environment. However, the logic should
|
||
easily be transferable.
|
||
|
||
Intelligent Alerting
|
||
--------------------
|
||
|
||
Intelligent alerting can be thought of as a form of continuous
|
||
integration for operations. For example, you can easily check to see
|
||
whether the Image service is up and running by ensuring that
|
||
the ``glance-api`` and ``glance-registry`` processes are running or by
|
||
seeing whether ``glace-api`` is responding on port 9292.
|
||
|
||
But how can you tell whether images are being successfully uploaded to
|
||
the Image service? Maybe the disk that Image service is storing the
|
||
images on is full or the S3 back end is down. You could naturally check
|
||
this by doing a quick image upload:
|
||
|
||
.. code-block:: bash
|
||
|
||
#!/bin/bash
|
||
#
|
||
# assumes that reasonable credentials have been stored at
|
||
# /root/auth
|
||
|
||
|
||
. /root/openrc
|
||
wget http://download.cirros-cloud.net/0.3.4/cirros-0.3.4-x86_64-disk.img
|
||
glance image-create --name='cirros image' --is-public=true
|
||
--container-format=bare --disk-format=qcow2 < cirros-0.3.4-x8
|
||
6_64-disk.img
|
||
|
||
By taking this script and rolling it into an alert for your monitoring
|
||
system (such as Nagios), you now have an automated way of ensuring that
|
||
image uploads to the Image Catalog are working.
|
||
|
||
.. note::
|
||
|
||
You must remove the image after each test. Even better, test whether
|
||
you can successfully delete an image from the Image service.
|
||
|
||
Intelligent alerting takes considerably more time to plan and implement
|
||
than the other alerts described in this chapter. A good outline to
|
||
implement intelligent alerting is:
|
||
|
||
- Review common actions in your cloud.
|
||
|
||
- Create ways to automatically test these actions.
|
||
|
||
- Roll these tests into an alerting system.
|
||
|
||
Some other examples for Intelligent Alerting include:
|
||
|
||
- Can instances launch and be destroyed?
|
||
|
||
- Can users be created?
|
||
|
||
- Can objects be stored and deleted?
|
||
|
||
- Can volumes be created and destroyed?
|
||
|
||
Trending
|
||
--------
|
||
|
||
Trending can give you great insight into how your cloud is performing
|
||
day to day. You can learn, for example, if a busy day was simply a rare
|
||
occurrence or if you should start adding new compute nodes.
|
||
|
||
Trending takes a slightly different approach than alerting. While
|
||
alerting is interested in a binary result (whether a check succeeds or
|
||
fails), trending records the current state of something at a certain
|
||
point in time. Once enough points in time have been recorded, you can
|
||
see how the value has changed over time.
|
||
|
||
All of the alert types mentioned earlier can also be used for trend
|
||
reporting. Some other trend examples include:
|
||
|
||
- The number of instances on each compute node
|
||
|
||
- The types of flavors in use
|
||
|
||
- The number of volumes in use
|
||
|
||
- The number of Object Storage requests each hour
|
||
|
||
- The number of ``nova-api`` requests each hour
|
||
|
||
- The I/O statistics of your storage services
|
||
|
||
As an example, recording ``nova-api`` usage can allow you to track the
|
||
need to scale your cloud controller. By keeping an eye on ``nova-api``
|
||
requests, you can determine whether you need to spawn more ``nova-api``
|
||
processes or go as far as introducing an entirely new server to run
|
||
``nova-api``. To get an approximate count of the requests, look for
|
||
standard INFO messages in ``/var/log/nova/nova-api.log``:
|
||
|
||
.. code-block:: console
|
||
|
||
# grep INFO /var/log/nova/nova-api.log | wc
|
||
|
||
You can obtain further statistics by looking for the number of
|
||
successful requests:
|
||
|
||
.. code-block:: console
|
||
|
||
# grep " 200 " /var/log/nova/nova-api.log | wc
|
||
|
||
By running this command periodically and keeping a record of the result,
|
||
you can create a trending report over time that shows whether your
|
||
``nova-api`` usage is increasing, decreasing, or keeping steady.
|
||
|
||
A tool such as **collectd** can be used to store this information. While
|
||
collectd is out of the scope of this book, a good starting point would
|
||
be to use collectd to store the result as a COUNTER data type. More
|
||
information can be found in `collectd's
|
||
documentation <https://collectd.org/wiki/index.php/Data_source>`_.
|
||
|
||
Summary
|
||
~~~~~~~
|
||
|
||
For stable operations, you want to detect failure promptly and determine
|
||
causes efficiently. With a distributed system, it's even more important
|
||
to track the right items to meet a service-level target. Learning where
|
||
these logs are located in the file system or API gives you an advantage.
|
||
This chapter also showed how to read, interpret, and manipulate
|
||
information from OpenStack services so that you can monitor effectively.
|