
- Also includes these changes: GlusterFS supports Block Storage Fix options for nova flavor-access-list Fixes bug 1316040 - reference a generic SQL database "once instance" -> "one instance" recover sheepdog doc Change-Id: I81f57c189bb5138c89f0208404f31c4453ec0aac
1008 lines
42 KiB
XML
1008 lines
42 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<chapter version="5.0" xml:id="logging_monitoring"
|
|
xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:ns5="http://www.w3.org/1999/xhtml"
|
|
xmlns:ns4="http://www.w3.org/2000/svg"
|
|
xmlns:ns3="http://www.w3.org/1998/Math/MathML"
|
|
xmlns:ns="http://docbook.org/ns/docbook">
|
|
<?dbhtml stop-chunking?>
|
|
|
|
<title>Logging and Monitoring</title>
|
|
|
|
<para>As an OpenStack cloud is composed of so many different services, there
|
|
are a large number of log files. This chapter aims to assist you in locating
|
|
and working with them and describes other ways to track the status of your
|
|
deployment.<indexterm class="singular">
|
|
<primary>debugging</primary>
|
|
|
|
<see>logging/monitoring; maintenance/debugging</see>
|
|
</indexterm></para>
|
|
|
|
<section xml:id="where_are_logs">
|
|
<title>Where Are the Logs?</title>
|
|
|
|
<para>Most services use the convention of writing their log files to
|
|
subdirectories of the <code>/var/log directory</code>, as listed in <xref
|
|
linkend="openstack-log-locations" />.<indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>log information</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>log location</secondary>
|
|
</indexterm></para>
|
|
|
|
<table rules="all" xml:id="openstack-log-locations">
|
|
<caption>OpenStack log locations</caption>
|
|
|
|
<thead>
|
|
<tr>
|
|
<th>Node type</th>
|
|
|
|
<th>Service</th>
|
|
|
|
<th>Log location</th>
|
|
</tr>
|
|
</thead>
|
|
|
|
<tbody>
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para> <code>nova-*</code> </para></td>
|
|
|
|
<td><para> <code>/var/log/nova</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para> <code>glance-*</code> </para></td>
|
|
|
|
<td><para> <code>/var/log/glance</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para> <code>cinder-*</code> </para></td>
|
|
|
|
<td><para> <code>/var/log/cinder</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para> <code>keystone-*</code> </para></td>
|
|
|
|
<td><para> <code>/var/log/keystone</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para> <code>neutron-*</code> </para></td>
|
|
|
|
<td><para> <code>/var/log/neutron</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Cloud controller</para></td>
|
|
|
|
<td><para>horizon</para></td>
|
|
|
|
<td><para> <code>/var/log/apache2/</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>All nodes</para></td>
|
|
|
|
<td><para>misc (swift, dnsmasq)</para></td>
|
|
|
|
<td><para> <code>/var/log/syslog</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Compute nodes</para></td>
|
|
|
|
<td><para>libvirt</para></td>
|
|
|
|
<td><para> <code>/var/log/libvirt/libvirtd.log</code> </para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Compute nodes</para></td>
|
|
|
|
<td><para>Console (boot up messages) for VM instances:</para></td>
|
|
|
|
<td><para> <code>/var/lib/nova/instances/instance-</code><phrase
|
|
role="keep-together"><code><instance id>/console.log</code>
|
|
</phrase></para></td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><para>Block Storage nodes</para></td>
|
|
|
|
<td><para>cinder-volume</para></td>
|
|
|
|
<td><para> <code>/var/log/cinder/cinder-volume.log</code>
|
|
</para></td>
|
|
</tr>
|
|
</tbody>
|
|
</table>
|
|
</section>
|
|
|
|
<section xml:id="how_to_read_logs">
|
|
<title>Reading the Logs</title>
|
|
|
|
<para>OpenStack services use the standard logging levels, at increasing
|
|
severity: DEBUG, INFO, AUDIT, WARNING, ERROR, CRITICAL, and TRACE. That
|
|
is, messages only appear in the logs if they are more "severe" than the
|
|
particular log level, with DEBUG allowing all log statements through. For
|
|
example, TRACE is logged only if the software has a stack trace, while
|
|
INFO is logged for every message including those that are only for
|
|
information.<indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>logging levels</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>To disable DEBUG-level logging, edit
|
|
<filename>/etc/nova/nova.conf</filename> as follows:</para>
|
|
|
|
<programlisting language="ini">debug=false</programlisting>
|
|
|
|
<para>Keystone is handled a little differently. To modify the logging
|
|
level, edit the <filename>/etc/keystone/logging.conf</filename> file and
|
|
look at the <code>logger_root</code> and <code>handler_file</code>
|
|
sections.</para>
|
|
|
|
<para><phrase role="keep-together">Logging for horizon is configured in
|
|
<filename>/etc/openstack_dashboard/local_</filename></phrase><filename>settings.py</filename>.
|
|
Because horizon is a Django web application, it follows the <link
|
|
xlink:href="http://opsgui.de/NPGgww" xlink:title="Django Logging">Django
|
|
Logging framework conventions</link>.</para>
|
|
|
|
<para>The first step in finding the source of an error is typically to
|
|
search for a CRITICAL, TRACE, or ERROR message in the log starting at the
|
|
bottom of the log file.<indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>reading log messages</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>Here is an example of a CRITICAL log message, with the corresponding
|
|
TRACE (Python traceback) immediately following:</para>
|
|
|
|
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
|
|
cinder-volumes doesn't exist
|
|
2013-02-25 21:05:51 17409 TRACE cinder Traceback (most recent call last):
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/bin/cinder-volume", line 48, in <module>
|
|
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 422, in wait
|
|
2013-02-25 21:05:51 17409 TRACE cinder _launcher.wait()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 127, in wait
|
|
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
|
|
2013-02-25 21:05:51 17409 TRACE cinder return self._exit_event.wait()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
|
|
2013-02-25 21:05:51 17409 TRACE cinder return hubs.get_hub().switch()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
|
|
2013-02-25 21:05:51 17409 TRACE cinder return self.greenlet.switch()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
|
|
2013-02-25 21:05:51 17409 TRACE cinder result = function(*args, **kwargs)
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 88, in run_server
|
|
2013-02-25 21:05:51 17409 TRACE cinder server.start()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 159, in start
|
|
2013-02-25 21:05:51 17409 TRACE cinder self.manager.init_host()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 95,
|
|
in init_host
|
|
2013-02-25 21:05:51 17409 TRACE cinder self.driver.check_for_setup_error()
|
|
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 116,
|
|
in check_for_setup_error
|
|
2013-02-25 21:05:51 17409 TRACE cinder raise exception.VolumeBackendAPIException(data=exception_message)
|
|
2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume
|
|
backend API: volume group cinder-volumes doesn't exist
|
|
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
|
|
|
|
<para>In this example, <literal>cinder-volumes</literal> failed to start
|
|
and has provided a stack trace, since its volume backend has been unable
|
|
to set up the storage volume—probably because the LVM volume that is
|
|
expected from the configuration does not exist.</para>
|
|
|
|
<para>Here is an example error log:</para>
|
|
|
|
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
|
|
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
|
|
|
|
<para>In this error, a nova service has failed to connect to the RabbitMQ
|
|
server because it got a connection refused error.</para>
|
|
</section>
|
|
|
|
<section xml:id="tracing_instance_request">
|
|
<title>Tracing Instance Requests</title>
|
|
|
|
<para>When an instance fails to behave properly, you will often have to
|
|
trace activity associated with that instance across the log files of
|
|
various <code>nova-*</code> services and across both the cloud controller
|
|
and compute nodes.<indexterm class="singular">
|
|
<primary>instances</primary>
|
|
|
|
<secondary>tracing instance requests</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>tracing instance requests</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>The typical way is to trace the UUID associated with an instance
|
|
across the service logs.</para>
|
|
|
|
<para>Consider the following example:</para>
|
|
|
|
<screen><prompt>$</prompt> <userinput>nova list</userinput>
|
|
<computeroutput>+--------------------------------+--------+--------+--------------------------+
|
|
| ID | Name | Status | Networks |
|
|
+--------------------------------+--------+--------+--------------------------+
|
|
| fafed8-4a46-413b-b113-f1959ffe | cirros | ACTIVE | novanetwork=192.168.100.3|
|
|
+--------------------------------------+--------+--------+--------------------+</computeroutput></screen>
|
|
|
|
<para>Here, the ID associated with the instance is
|
|
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If you search for this
|
|
string on the cloud controller in the
|
|
<filename>/var/log/nova-*.log</filename> files, it appears in
|
|
<filename>nova-api.log</filename> and
|
|
<filename>nova-scheduler.log</filename>. If you search for this on the
|
|
compute nodes in <filename>/var/log/nova-*.log</filename>, it appears in
|
|
<filename>nova-network.log</filename> and
|
|
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL messages
|
|
appear, the most recent log entry that reports this may provide a hint
|
|
about what has gone wrong.</para>
|
|
</section>
|
|
|
|
<section xml:id="add_custom_logging">
|
|
<title>Adding Custom Logging Statements</title>
|
|
|
|
<para>If there is not enough information in the existing logs, you may
|
|
need to add your own custom logging statements to the <code>nova-*</code>
|
|
services.<indexterm class="singular">
|
|
<primary>customization</primary>
|
|
|
|
<secondary>custom log statements</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>adding custom log statements</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>The source files are located in
|
|
<filename>/usr/lib/python2.7/dist-packages/nova</filename>.</para>
|
|
|
|
<para>To add logging statements, the following line should be near the top
|
|
of the file. For most files, these should already be there:</para>
|
|
|
|
<programlisting language="python">from nova.openstack.common import log as logging
|
|
LOG = logging.getLogger(__name__)</programlisting>
|
|
|
|
<para>To add a DEBUG logging statement, you would do:</para>
|
|
|
|
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
|
|
|
|
<para>You may notice that all the existing logging messages are preceded
|
|
by an underscore and surrounded by parentheses, for example:</para>
|
|
|
|
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
|
|
|
|
<para>This formatting is used to support translation of logging messages
|
|
into different languages using the <link
|
|
xlink:href="http://opsgui.de/1eLBlHT">gettext</link> internationalization
|
|
library. You don't need to do this for your own custom log messages.
|
|
However, if you want to contribute the code back to the OpenStack project
|
|
that includes logging statements, you must surround your log messages with
|
|
underscores and parentheses.</para>
|
|
</section>
|
|
|
|
<section xml:id="rabbitmq">
|
|
<title>RabbitMQ Web Management Interface or rabbitmqctl</title>
|
|
|
|
<para>Aside from connection failures, RabbitMQ log files are generally not
|
|
useful for debugging OpenStack related issues. Instead, we recommend you
|
|
use the RabbitMQ web management interface.<indexterm class="singular">
|
|
<primary>RabbitMQ</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>RabbitMQ web management interface</secondary>
|
|
</indexterm> Enable it on your cloud controller:<indexterm
|
|
class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>enabling RabbitMQ</secondary>
|
|
</indexterm></para>
|
|
|
|
<screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>
|
|
|
|
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
|
|
|
|
<para>The RabbitMQ web management interface is accessible on your cloud
|
|
controller at <emphasis>http://localhost:55672</emphasis>.</para>
|
|
|
|
<note>
|
|
<para>Ubuntu 12.04 installs RabbitMQ version 2.7.1, which uses port
|
|
55672. RabbitMQ versions 3.0 and above use port 15672 instead. You can
|
|
check which version of RabbitMQ you have running on your local Ubuntu
|
|
machine by doing:</para>
|
|
|
|
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
|
|
Version: 2.7.1-0ubuntu4</userinput></screen>
|
|
</note>
|
|
|
|
<para>An alternative to enabling the RabbitMQ web management interface is
|
|
to use the <phrase role="keep-together"><literal>rabbitmqctl</literal></phrase> commands. For example,
|
|
<literal>rabbitmqctl list_queues| grep cinder</literal> displays any
|
|
messages left in the queue. If there are messages, it's a possible sign
|
|
that cinder services didn't connect properly to rabbitmq and might have to
|
|
be restarted.</para>
|
|
|
|
<para>Items to monitor for RabbitMQ include the number of items in each of
|
|
the queues and the processing time statistics for the server.</para>
|
|
</section>
|
|
|
|
<section xml:id="manage_logs_centrally">
|
|
<title>Centrally Managing Logs</title>
|
|
|
|
<para>Because your cloud is most likely composed of many servers, you must
|
|
check logs on each of those servers to properly piece an event together. A
|
|
better solution is to send the logs of all servers to a central location
|
|
so that they can all be accessed from the same area.<indexterm
|
|
class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>central log management</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>Ubuntu uses rsyslog as the default logging service. Since it is
|
|
natively able to send logs to a remote location, you don't have to install
|
|
anything extra to enable this feature, just modify the configuration file.
|
|
In doing this, consider running your logging over a management network or
|
|
using an encrypted VPN to avoid interception.</para>
|
|
|
|
<section xml:id="rsyslog_client_config">
|
|
<title>rsyslog Client Configuration</title>
|
|
|
|
<para>To begin, configure all OpenStack components to log to syslog in
|
|
addition to their standard log file location. Also configure each
|
|
component to log to a different syslog facility. This makes it easier to
|
|
split the logs into individual components on the central
|
|
server:<indexterm class="singular">
|
|
<primary>rsyslog</primary>
|
|
</indexterm></para>
|
|
|
|
<para><filename>nova.conf</filename>:</para>
|
|
|
|
<programlisting language="ini">use_syslog=True
|
|
syslog_log_facility=LOG_LOCAL0</programlisting>
|
|
|
|
<para><filename>glance-api.conf</filename> and
|
|
<filename>glance-registry.conf</filename>:</para>
|
|
|
|
<programlisting language="ini">use_syslog=True
|
|
syslog_log_facility=LOG_LOCAL1</programlisting>
|
|
|
|
<para><filename>cinder.conf</filename>:</para>
|
|
|
|
<programlisting language="ini">use_syslog=True
|
|
syslog_log_facility=LOG_LOCAL2</programlisting>
|
|
|
|
<para><filename>keystone.conf</filename>:</para>
|
|
|
|
<programlisting language="ini">use_syslog=True
|
|
syslog_log_facility=LOG_LOCAL3</programlisting>
|
|
|
|
<para>By default, Object Storage logs to syslog.</para>
|
|
|
|
<para>Next, create <filename>/etc/rsyslog.d/client.conf</filename> with
|
|
the following line:</para>
|
|
|
|
<programlisting language="ini">*.* @192.168.1.10</programlisting>
|
|
|
|
<para>This instructs rsyslog to send all logs to the IP listed. In this
|
|
example, the IP points to the cloud controller.</para>
|
|
</section>
|
|
|
|
<section xml:id="rsyslog_server_config">
|
|
<title>rsyslog Server Configuration</title>
|
|
|
|
<para>Designate a server as the central logging server. The best
|
|
practice is to choose a server that is solely dedicated to this purpose.
|
|
Create a file called <filename>/etc/rsyslog.d/server.conf</filename>
|
|
with the following contents:</para>
|
|
|
|
<programlisting language="ini"># Enable UDP
|
|
$ModLoad imudp
|
|
# Listen on 192.168.1.10 only
|
|
$UDPServerAddress 192.168.1.10
|
|
# Port 514
|
|
$UDPServerRun 514
|
|
|
|
# Create logging templates for nova
|
|
$template NovaFile,"/var/log/rsyslog/%HOSTNAME%/nova.log"
|
|
$template NovaAll,"/var/log/rsyslog/nova.log"
|
|
|
|
# Log everything else to syslog.log
|
|
$template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
|
|
*.* ?DynFile
|
|
|
|
# Log various openstack components to their own individual file
|
|
local0.* ?NovaFile
|
|
local0.* ?NovaAll
|
|
& ~</programlisting>
|
|
|
|
<para>This example configuration handles the nova service only. It first
|
|
configures rsyslog to act as a server that runs on port 514. Next, it
|
|
creates a series of logging templates. Logging templates control where
|
|
received logs are stored. Using the last example, a nova log from
|
|
c01.example.com goes to the following locations:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><filename>/var/log/rsyslog/c01.example.com/nova.log</filename></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><filename>/var/log/rsyslog/nova.log</filename></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>This is useful, as logs from c02.example.com go to:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para><filename>/var/log/rsyslog/c02.example.com/nova.log</filename></para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para><filename>/var/log/rsyslog/nova.log</filename></para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>You have an individual log file for each compute node as well as
|
|
an aggregated log that contains nova logs from all nodes.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="stacktach">
|
|
<!-- FIXME This section needs updating, especially with the advent of
|
|
ceilometer -->
|
|
|
|
<title>StackTach</title>
|
|
|
|
<para>StackTach is a tool created by Rackspace to collect and report the
|
|
notifications sent by <code>nova</code>. Notifications are essentially the
|
|
same as logs but can be much more detailed. A good overview of
|
|
notifications can be found at <link xlink:href="http://opsgui.de/NPGh3H"
|
|
xlink:title="StackTach GitHub repo">System Usage Data</link>.<indexterm
|
|
class="singular">
|
|
<primary>StackTach</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>StackTack tool</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>To enable <code>nova</code> to send notifications, add the following
|
|
to <filename>nova.conf</filename>:</para>
|
|
|
|
<programlisting language="ini"><?db-font-size 75%?>notification_topics=monitor
|
|
notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisting>
|
|
|
|
<para>Once <code>nova</code> is sending notifications, install and
|
|
configure StackTach. Since StackTach is relatively new and constantly
|
|
changing, installation instructions would quickly become outdated. Please
|
|
refer to the <link xlink:href="http://opsgui.de/1eLBpqQ">StackTach GitHub
|
|
repo</link> for instructions as well as a demo video.</para>
|
|
</section>
|
|
|
|
<section xml:id="monitoring">
|
|
<title>Monitoring</title>
|
|
|
|
<para>There are two types of monitoring: watching for problems and
|
|
watching usage trends. The former ensures that all services are up and
|
|
running, creating a functional cloud. The latter involves monitoring
|
|
resource usage over time in order to make informed decisions about
|
|
potential bottlenecks and upgrades.<indexterm class="singular">
|
|
<primary>cloud controllers</primary>
|
|
|
|
<secondary>process monitoring and</secondary>
|
|
</indexterm></para>
|
|
|
|
<?hard-pagebreak ?>
|
|
|
|
<sidebar>
|
|
<title>Nagios</title>
|
|
|
|
<para>Nagios is an open source monitoring service. It's capable of
|
|
executing arbitrary commands to check the status of server and network
|
|
services, remotely executing arbitrary commands directly on servers, and
|
|
allowing servers to push notifications back in the form of passive
|
|
monitoring. Nagios has been around since 1999. Although newer monitoring
|
|
services are available, Nagios is a tried-and-true systems
|
|
administration staple.<indexterm class="singular">
|
|
<primary>Nagios</primary>
|
|
</indexterm></para>
|
|
</sidebar>
|
|
|
|
<section xml:id="process_monitoring">
|
|
<title>Process Monitoring</title>
|
|
|
|
<para>A basic type of alert monitoring is to simply check and see
|
|
whether a required process is running.<indexterm class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>process monitoring</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>process monitoring</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>process monitoring</secondary>
|
|
</indexterm> For example, ensure that the <code>nova-api</code>
|
|
service is running on the cloud controller:</para>
|
|
|
|
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
|
|
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api
|
|
--config-file=/etc/nova/nova.conf nova
|
|
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python
|
|
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
|
nova 12792 0.0 0.0 96052 22856 ? S Feb11 0:01 /usr/bin/python
|
|
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
|
nova 12793 0.0 0.3 290688 115516 ? S Feb11 1:23 /usr/bin/python
|
|
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
|
nova 12794 0.0 0.2 248636 77068 ? S Feb11 0:04 /usr/bin/python
|
|
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
|
root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput></screen>
|
|
|
|
<para>You can create automated alerts for critical processes by using
|
|
Nagios and NRPE. For example, to ensure that the
|
|
<code>nova-compute</code> process is running on compute nodes, create an
|
|
alert on your Nagios server that looks like this:</para>
|
|
|
|
<programlisting>define service {
|
|
host_name c01.example.com
|
|
check_command check_nrpe_1arg!check_nova-compute
|
|
use generic-service
|
|
notification_period 24x7
|
|
contact_groups sysadmins
|
|
service_description nova-compute
|
|
}</programlisting>
|
|
|
|
<para>Then on the actual compute node, create the following NRPE
|
|
configuration:</para>
|
|
|
|
<programlisting>\command[check_nova-compute]=/usr/lib/nagios/plugins/check_procs -c 1: \
|
|
-a nova-compute</programlisting>
|
|
|
|
<para>Nagios checks that at least one <literal>nova-compute</literal>
|
|
service is running at all times.</para>
|
|
</section>
|
|
|
|
<section xml:id="resource_alerting">
|
|
<title>Resource Alerting</title>
|
|
|
|
<para>Resource alerting provides notifications when one or more
|
|
resources are critically low. While the monitoring thresholds should be
|
|
tuned to your specific OpenStack environment, monitoring resource usage
|
|
is not specific to OpenStack at all—any generic type of alert will work
|
|
fine.<indexterm class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>resource alerting</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>alerts</primary>
|
|
|
|
<secondary>resource</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>resources</primary>
|
|
|
|
<secondary>resource alerting</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>resource alerting</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>Some of the resources that you want to monitor include:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Disk usage</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Server load</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Memory usage</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Network I/O</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Available vCPUs</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>For example, to monitor disk capacity on a compute node with
|
|
Nagios, add the following to your Nagios configuration:</para>
|
|
|
|
<programlisting><?db-font-size 75%?>define service {
|
|
host_name c01.example.com
|
|
check_command check_nrpe!check_all_disks!20% 10%
|
|
use generic-service
|
|
contact_groups sysadmins
|
|
service_description Disk
|
|
}</programlisting>
|
|
|
|
<para>On the compute node, add the following to your NRPE
|
|
configuration:</para>
|
|
|
|
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c \
|
|
$ARG2$ -e</programlisting>
|
|
|
|
<para>Nagios alerts you with a WARNING when any disk on the compute node
|
|
is 80 percent full and CRITICAL when 90 percent is full.</para>
|
|
</section>
|
|
|
|
<section xml:id="metering_telemetry">
|
|
<title>Metering and Telemetry with Ceilometer</title>
|
|
|
|
<para>An integrated OpenStack project (code-named ceilometer) collects
|
|
metering data and provides alerts for Compute, Storage, and Networking.
|
|
Data collected by the metering system could be used for billing.
|
|
Depending on deployment configuration, metered data may be accessible to
|
|
users based on the deployment configuration. The Telemetry service
|
|
provides a REST API documented at <link
|
|
xlink:href="http://api.openstack.org/api-ref-telemetry.html"></link>.
|
|
You can read more about the project at <link
|
|
xlink:href="http://docs.openstack.org/developer/ceilometer"></link>.<indexterm
|
|
class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>metering and telemetry</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>telemetry/metering</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>metering/telemetry</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>ceilometer</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>ceilometer project</secondary>
|
|
</indexterm></para>
|
|
</section>
|
|
|
|
<section xml:id="os_resources">
|
|
<title>OpenStack-Specific Resources</title>
|
|
|
|
<para>Resources such as memory, disk, and CPU are generic resources that
|
|
all servers (even non-OpenStack servers) have and are important to the
|
|
overall health of the server. When dealing with OpenStack specifically,
|
|
these resources are important for a second reason: ensuring that enough
|
|
are available to launch instances. There are a few ways you can see
|
|
OpenStack resource usage.<indexterm class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>OpenStack-specific resources</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>resources</primary>
|
|
|
|
<secondary>generic vs. OpenStack-specific</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>OpenStack-specific resources</secondary>
|
|
</indexterm> The first is through the <code>nova</code>
|
|
command:</para>
|
|
|
|
<programlisting># nova usage-list</programlisting>
|
|
|
|
<para>This command displays a list of how many instances a tenant has
|
|
running and some light usage statistics about the combined instances.
|
|
This command is useful for a quick overview of your cloud, but it
|
|
doesn't really get into a lot of details.</para>
|
|
|
|
<para>Next, the <code>nova</code> database contains three tables that
|
|
store usage information.</para>
|
|
|
|
<para>The <code>nova.quotas</code> and <code>nova.quota_usages</code>
|
|
tables store quota information. If a tenant's quota is different from
|
|
the default quota settings, its quota is stored in the <phrase
|
|
role="keep-together"><code>nova.quotas</code></phrase> table. For
|
|
example:</para>
|
|
|
|
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
|
|
<computeroutput>+----------------------------------+-----------------------------+------------+
|
|
| project_id | resource | hard_limit |
|
|
+----------------------------------+-----------------------------+------------+
|
|
| 628df59f091142399e0689a2696f5baa | metadata_items | 128 |
|
|
| 628df59f091142399e0689a2696f5baa | injected_file_content_bytes | 10240 |
|
|
| 628df59f091142399e0689a2696f5baa | injected_files | 5 |
|
|
| 628df59f091142399e0689a2696f5baa | gigabytes | 1000 |
|
|
| 628df59f091142399e0689a2696f5baa | ram | 51200 |
|
|
| 628df59f091142399e0689a2696f5baa | floating_ips | 10 |
|
|
| 628df59f091142399e0689a2696f5baa | instances | 10 |
|
|
| 628df59f091142399e0689a2696f5baa | volumes | 10 |
|
|
| 628df59f091142399e0689a2696f5baa | cores | 20 |
|
|
+----------------------------------+-----------------------------+------------+</computeroutput></screen>
|
|
|
|
<para>The <code>nova.quota_usages</code> table keeps track of how many
|
|
resources the tenant currently has in use:</para>
|
|
|
|
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, in_use from quota_usages where project_id like '628%';</userinput>
|
|
<computeroutput>+----------------------------------+--------------+--------+
|
|
| project_id | resource | in_use |
|
|
+----------------------------------+--------------+--------+
|
|
| 628df59f091142399e0689a2696f5baa | instances | 1 |
|
|
| 628df59f091142399e0689a2696f5baa | ram | 512 |
|
|
| 628df59f091142399e0689a2696f5baa | cores | 1 |
|
|
| 628df59f091142399e0689a2696f5baa | floating_ips | 1 |
|
|
| 628df59f091142399e0689a2696f5baa | volumes | 2 |
|
|
| 628df59f091142399e0689a2696f5baa | gigabytes | 12 |
|
|
| 628df59f091142399e0689a2696f5baa | images | 1 |
|
|
+----------------------------------+--------------+--------+</computeroutput></screen>
|
|
|
|
<para>By comparing a tenant's hard limit with their current resource
|
|
usage, you can see their usage percentage. For example, if this tenant
|
|
is using 1 floating IP out of 10, then they are using 10 percent of
|
|
their floating IP quota. Rather than doing the calculation manually, you
|
|
can use SQL or the scripting language of your choice and create a
|
|
formatted report:</para>
|
|
|
|
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
|
|
| some_tenant |
|
|
+-----------------------------------+------------+------------+---------------+
|
|
| Resource | Used | Limit | |
|
|
+-----------------------------------+------------+------------+---------------+
|
|
| cores | 1 | 20 | 5 % |
|
|
| floating_ips | 1 | 10 | 10 % |
|
|
| gigabytes | 12 | 1000 | 1 % |
|
|
| images | 1 | 4 | 25 % |
|
|
| injected_file_content_bytes | 0 | 10240 | 0 % |
|
|
| injected_file_path_bytes | 0 | 255 | 0 % |
|
|
| injected_files | 0 | 5 | 0 % |
|
|
| instances | 1 | 10 | 10 % |
|
|
| key_pairs | 0 | 100 | 0 % |
|
|
| metadata_items | 0 | 128 | 0 % |
|
|
| ram | 512 | 51200 | 1 % |
|
|
| reservation_expire | 0 | 86400 | 0 % |
|
|
| security_group_rules | 0 | 20 | 0 % |
|
|
| security_groups | 0 | 10 | 0 % |
|
|
| volumes | 2 | 10 | 20 % |
|
|
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>
|
|
|
|
<para>The preceding information was generated by using a custom script
|
|
that can be found on <link
|
|
xlink:href="http://opsgui.de/NPGjbX">GitHub</link>.</para>
|
|
|
|
<note>
|
|
<para>This script is specific to a certain OpenStack installation and
|
|
must be modified to fit your environment. However, the logic should
|
|
easily be transferable.</para>
|
|
</note>
|
|
</section>
|
|
|
|
<section xml:id="intelligent_alerting">
|
|
<title>Intelligent Alerting</title>
|
|
|
|
<para>Intelligent alerting can be thought of as a form of continuous
|
|
integration for operations. For example, you can easily check to see
|
|
whether the Image Service is up and running by ensuring that
|
|
the <code>glance-api</code> and <code>glance-registry</code>
|
|
processes are running or by seeing whether <code>glace-api</code> is
|
|
responding on port 9292.<indexterm class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>intelligent alerting</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>alerts</primary>
|
|
|
|
<secondary>intelligent</secondary>
|
|
|
|
<seealso>logging/monitoring</seealso>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>intelligent alerting</primary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>intelligent alerting</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>But how can you tell whether images are being successfully
|
|
uploaded to the Image Service? Maybe the disk that Image Service is
|
|
storing the images on is full or the S3 backend is down. You could
|
|
naturally check this by doing a quick image upload:</para>
|
|
|
|
<?hard-pagebreak ?>
|
|
|
|
<programlisting language="bash">#!/bin/bash
|
|
#
|
|
# assumes that reasonable credentials have been stored at
|
|
# /root/auth
|
|
|
|
|
|
. /root/openrc
|
|
wget https://launchpad.net/cirros/trunk/0.3.0/+download/ \
|
|
cirros-0.3.0-x86_64-disk.img
|
|
glance image-create --name='cirros image' --is-public=true
|
|
--container-format=bare --disk-format=qcow2 < cirros-0.3.0-x8
|
|
6_64-disk.img</programlisting>
|
|
|
|
<para>By taking this script and rolling it into an alert for your
|
|
monitoring system (such as Nagios), you now have an automated way of
|
|
ensuring that image uploads to the Image Catalog are working.</para>
|
|
|
|
<note>
|
|
<para>You must remove the image after each test. Even better, test
|
|
whether you can successfully delete an image from the Image
|
|
Service.</para>
|
|
</note>
|
|
|
|
<para>Intelligent alerting takes considerably more time to plan and
|
|
implement than the other alerts described in this chapter. A good
|
|
outline to implement intelligent alerting is:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Review common actions in your cloud.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Create ways to automatically test these actions.</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Roll these tests into an alerting system.</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>Some other examples for Intelligent Alerting include:</para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>Can instances launch and be destroyed?</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Can users be created?</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Can objects be stored and deleted?</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>Can volumes be created and destroyed?</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
|
|
<section xml:id="trending">
|
|
<title>Trending</title>
|
|
|
|
<para>Trending can give you great insight into how your cloud is
|
|
performing day to day. You can learn, for example, if a busy day was
|
|
simply a rare occurrence or if you should start adding new compute
|
|
nodes.<indexterm class="singular">
|
|
<primary>monitoring</primary>
|
|
|
|
<secondary>trending</secondary>
|
|
|
|
<seealso>logging/monitoring</seealso>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>trending</primary>
|
|
|
|
<secondary>monitoring cloud performance with</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>logging/monitoring</primary>
|
|
|
|
<secondary>trending</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>Trending takes a slightly different approach than alerting. While
|
|
alerting is interested in a binary result (whether a check succeeds or
|
|
fails), trending records the current state of something at a certain
|
|
point in time. Once enough points in time have been recorded, you can
|
|
see how the value has changed over time.<indexterm class="singular">
|
|
<primary>trending</primary>
|
|
|
|
<secondary>vs. alerts</secondary>
|
|
</indexterm><indexterm class="singular">
|
|
<primary>binary</primary>
|
|
|
|
<secondary>binary results in trending</secondary>
|
|
</indexterm></para>
|
|
|
|
<para>All of the alert types mentioned earlier can also be used for
|
|
trend reporting. Some other trend examples include:<indexterm
|
|
class="singular">
|
|
<primary>trending</primary>
|
|
|
|
<secondary>report examples</secondary>
|
|
</indexterm></para>
|
|
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>The number of instances on each compute node</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The types of flavors in use</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The number of volumes in use</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The number of Object Storage requests each hour</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The number of <literal>nova-api</literal> requests each
|
|
hour</para>
|
|
</listitem>
|
|
|
|
<listitem>
|
|
<para>The I/O statistics of your storage services</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
|
|
<para>As an example, recording <code>nova-api</code> usage can allow you
|
|
to track the need to scale your cloud controller. By keeping an eye on
|
|
<code>nova-api</code> requests, you can determine whether you need to
|
|
spawn more <literal>nova-api</literal> processes or go as far as
|
|
introducing an entirely new server to run <code>nova-api</code>. To get
|
|
an approximate count of the requests, look for standard INFO messages in
|
|
<code>/var/log/nova/nova-api.log</code>:</para>
|
|
|
|
<programlisting># grep INFO /var/log/nova/nova-api.log | wc</programlisting>
|
|
|
|
<para>You can obtain further statistics by looking for the number of
|
|
successful requests:</para>
|
|
|
|
<programlisting># grep " 200 " /var/log/nova/nova-api.log | wc</programlisting>
|
|
|
|
<para>By running this command periodically and keeping a record of the
|
|
result, you can create a trending report over time that shows whether
|
|
your <code>nova-api</code> usage is increasing, decreasing, or keeping
|
|
steady.</para>
|
|
|
|
<para>A tool such as collectd can be used to store this information.
|
|
While collectd is out of the scope of this book, a good starting point
|
|
would be to use collectd to store the result as a COUNTER data type.
|
|
More information can be found in <link
|
|
xlink:href="http://opsgui.de/1eLBriA">collectd's
|
|
documentation</link>.</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section xml:id="ops-log-monitor-summary">
|
|
<title>Summary</title>
|
|
|
|
<para>For stable operations, you want to detect failure promptly and
|
|
determine causes efficiently. With a distributed system, it's even more
|
|
important to track the right items to meet a service-level target.
|
|
Learning where these logs are located in the file system or API gives you
|
|
an advantage. This chapter also showed how to read, interpret, and
|
|
manipulate information from OpenStack services so that you can monitor
|
|
effectively.</para>
|
|
</section>
|
|
</chapter>
|