772 lines
42 KiB
XML
772 lines
42 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!DOCTYPE chapter [
|
||
<!-- Some useful entities borrowed from HTML -->
|
||
<!ENTITY ndash "–">
|
||
<!ENTITY mdash "—">
|
||
<!ENTITY hellip "…">
|
||
<!ENTITY plusmn "±">
|
||
]>
|
||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||
xml:id="logging_monitoring">
|
||
<?dbhtml stop-chunking?>
|
||
<title>Logging and Monitoring</title>
|
||
<para>As an OpenStack cloud is composed of so many different
|
||
services, there are a large number of log files. This section
|
||
aims to assist you in locating and working with them, and
|
||
other ways to track the status of your deployment.</para>
|
||
<section xml:id="where_are_logs">
|
||
<title>Where Are the Logs?</title>
|
||
<para>Most services use the convention of writing
|
||
their log files to subdirectories of the <code>/var/log
|
||
directory</code>.</para>
|
||
<informaltable rules="all">
|
||
<thead>
|
||
<tr>
|
||
<th>Node Type</th>
|
||
<th>Service</th>
|
||
<th>Log Location</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>
|
||
<code>nova-*</code>
|
||
</para></td>
|
||
<td><para>
|
||
<code>/var/log/nova</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>
|
||
<code>glance-*</code>
|
||
</para></td>
|
||
<td><para>
|
||
<code>/var/log/glance</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>
|
||
<code>cinder-*</code>
|
||
</para></td>
|
||
<td><para>
|
||
<code>/var/log/cinder</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>
|
||
<code>keystone-*</code>
|
||
</para></td>
|
||
<td><para>
|
||
<code>/var/log/keystone</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>
|
||
<code>neutron-*</code>
|
||
</para></td>
|
||
<td><para>
|
||
<code>/var/log/neutron</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Cloud Controller</para></td>
|
||
<td><para>horizon</para></td>
|
||
<td><para>
|
||
<code>/var/log/apache2/</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>All nodes</para></td>
|
||
<td><para>misc (Swift,
|
||
dnsmasq)</para></td>
|
||
<td><para>
|
||
<code>/var/log/syslog</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Compute Nodes</para></td>
|
||
<td><para>libvirt</para></td>
|
||
<td><para>
|
||
<code>/var/log/libvirt/libvirtd.log</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Compute Nodes</para></td>
|
||
<td><para>Console (boot up messages) for VM instances:</para></td>
|
||
<td><para>
|
||
<code>/var/lib/nova/instances/instance-<instance
|
||
id>/console.log</code>
|
||
</para></td>
|
||
</tr>
|
||
<tr>
|
||
<td><para>Block Storage Nodes</para></td>
|
||
<td><para>cinder-volume</para></td>
|
||
<td><para>
|
||
<code>/var/log/cinder/cinder-volume.log</code>
|
||
</para></td>
|
||
</tr>
|
||
</tbody>
|
||
</informaltable>
|
||
</section>
|
||
<section xml:id="how_to_read_logs">
|
||
<title>Reading the Logs</title>
|
||
<para>OpenStack services use the standard logging levels, at
|
||
increasing severity: DEBUG, INFO, AUDIT, WARNING, ERROR,
|
||
CRITICAL, and TRACE. That is, messages only appear in the logs
|
||
if they are more "severe" than the particular log level
|
||
with DEBUG allowing all log statements through. For
|
||
example, TRACE is logged only if the software has a stack
|
||
trace, while INFO is logged for every message including
|
||
those that are only for information.</para>
|
||
<para>To disable DEBUG-level logging, edit
|
||
<filename>/etc/nova/nova.conf</filename>:</para>
|
||
<programlisting language="ini">debug=false</programlisting>
|
||
<para>Keystone is handled a little differently. To modify the
|
||
logging level, edit the
|
||
<filename>/etc/keystone/logging.conf</filename> file and look
|
||
at the <code>logger_root</code> and <code>handler_file</code>
|
||
sections.</para>
|
||
<para>Logging for Horizon is configured in
|
||
<filename>/etc/openstack_dashboard/local_settings.py</filename>.
|
||
As Horizon is a Django web application, it follows the
|
||
<link xlink:title="Django Logging"
|
||
xlink:href="https://docs.djangoproject.com/en/dev/topics/logging/"
|
||
>Django Logging</link>
|
||
(https://docs.djangoproject.com/en/dev/topics/logging/)
|
||
framework conventions.</para>
|
||
<para>The first step in finding the source of an error is
|
||
typically to search for a CRITICAL, TRACE, or ERROR
|
||
message in the log starting at the bottom of the log file.</para>
|
||
<para>An example of a CRITICAL log message, with the
|
||
corresponding TRACE (Python traceback) immediately
|
||
following:</para>
|
||
<screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
|
||
cinder-volumes doesn't exist
|
||
2013-02-25 21:05:51 17409 TRACE cinder Traceback (most recent call last):
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/bin/cinder-volume", line 48, in <module>
|
||
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 422, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder _launcher.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 127, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder return self._exit_event.wait()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
|
||
2013-02-25 21:05:51 17409 TRACE cinder return hubs.get_hub().switch()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
|
||
2013-02-25 21:05:51 17409 TRACE cinder return self.greenlet.switch()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
|
||
2013-02-25 21:05:51 17409 TRACE cinder result = function(*args, **kwargs)
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 88, in run_server
|
||
2013-02-25 21:05:51 17409 TRACE cinder server.start()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 159, in start
|
||
2013-02-25 21:05:51 17409 TRACE cinder self.manager.init_host()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 95,
|
||
in init_host
|
||
2013-02-25 21:05:51 17409 TRACE cinder self.driver.check_for_setup_error()
|
||
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 116,
|
||
in check_for_setup_error
|
||
2013-02-25 21:05:51 17409 TRACE cinder raise exception.VolumeBackendAPIException(data=exception_message)
|
||
2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume
|
||
backend API: volume group cinder-volumes doesn't exist
|
||
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>
|
||
<para>In this example, cinder-volumes failed to start and has
|
||
provided a stack trace, since its volume back-end has been
|
||
unable to setup the storage volume - probably because the
|
||
LVM volume that is expected from the configuration does
|
||
not exist.</para>
|
||
<para>An example error log:</para>
|
||
<screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
|
||
[Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>
|
||
<para>In this error, a nova service has failed to connect to
|
||
the RabbitMQ server, because it got a connection refused
|
||
error.</para>
|
||
</section>
|
||
<section xml:id="tracing_instance_request">
|
||
<title>Tracing Instance Requests</title>
|
||
<para>When an instance fails to behave properly, you will
|
||
often have to trace activity associated with that instance
|
||
across the log files of various <code>nova-*</code>
|
||
services, and across both the cloud controller and compute
|
||
nodes.</para>
|
||
<para>The typical way is to trace the UUID associated with an
|
||
instance across the service logs.</para>
|
||
<para>Consider the following example:</para>
|
||
<screen><prompt>$</prompt> <userinput>nova list</userinput>
|
||
<computeroutput>+--------------------------------+--------+--------+--------------------------+
|
||
| ID | Name | Status | Networks |
|
||
+--------------------------------+--------+--------+--------------------------+
|
||
| fafed8-4a46-413b-b113-f1959ffe | cirros | ACTIVE | novanetwork=192.168.100.3|
|
||
+--------------------------------------+--------+--------+--------------------+</computeroutput></screen>
|
||
<para>Here the ID associated with the instance is
|
||
<code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If
|
||
you search for this string on the cloud controller in the
|
||
<filename>/var/log/nova-*.log</filename> files, it appears in
|
||
<filename>nova-api.log</filename>, and
|
||
<filename>nova-scheduler.log</filename>. If you search for
|
||
this on the compute nodes in
|
||
<filename>/var/log/nova-*.log</filename>, it appears
|
||
<filename>nova-network.log</filename> and
|
||
<filename>nova-compute.log</filename>. If no ERROR or CRITICAL
|
||
messages appear, the most recent log entry that reports
|
||
this may provide a hint about what has gone wrong.</para>
|
||
</section>
|
||
<section xml:id="add_custom_logging">
|
||
<title>Adding Custom Logging Statements</title>
|
||
<para>If there is not enough information in the existing logs,
|
||
you may need to add your own custom logging statements to
|
||
the <code>nova-*</code> services.</para>
|
||
<para>The source files are located in
|
||
<filename>/usr/lib/python2.7/dist-packages/nova</filename>
|
||
</para>
|
||
<para>To add logging statements, the following line should be
|
||
near the top of the file. For most files, these should
|
||
already be there:</para>
|
||
<programlisting language="python">from nova.openstack.common import log as logging
|
||
LOG = logging.getLogger(__name__)</programlisting>
|
||
<para>To add a DEBUG logging statement, you would do:</para>
|
||
<programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>
|
||
<para>You may notice that all of the existing logging messages
|
||
are preceded by an underscore and surrounded by
|
||
parentheses, for example:</para>
|
||
<programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>
|
||
<para>This is used to support translation of logging messages
|
||
into different languages using the <link
|
||
xlink:href="http://docs.python.org/2/library/gettext.html"
|
||
>gettext</link>
|
||
(http://docs.python.org/2/library/gettext.html)
|
||
internationalization library. You don't need to do this
|
||
for your own custom log messages. However, if you want to
|
||
contribute the code back to the OpenStack project that
|
||
includes logging statements, you must surround your log
|
||
messages with underscore and parentheses.</para>
|
||
</section>
|
||
<section xml:id="rabbitmq">
|
||
<title>RabbitMQ Web Management Interface or
|
||
rabbitmqctl</title>
|
||
<para>Aside from connection failures, RabbitMQ log files are
|
||
generally not useful for debugging OpenStack related
|
||
issues. Instead, we recommend you use the RabbitMQ web
|
||
management interface. Enable it on your cloud
|
||
controller:</para>
|
||
<screen><prompt>#</prompt>
|
||
<userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable
|
||
rabbitmq_management</userinput></screen>
|
||
<screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>
|
||
<para>The RabbitMQ web management interface is accessible on
|
||
your cloud controller at http://localhost:55672.</para>
|
||
<note>
|
||
<para>Ubuntu 12.04 installs RabbitMQ version 2.7.1, which
|
||
uses port 55672. RabbitMQ versions 3.0 and above use
|
||
port 15672 instead. You can check which version of
|
||
RabbitMQ you have running on your local Ubuntu machine
|
||
by doing:</para>
|
||
<screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
|
||
Version: 2.7.1-0ubuntu4</userinput></screen>
|
||
</note>
|
||
<para>An alternative to enabling the RabbitMQ Web Management
|
||
Interface is to use the <command>rabbitmqctl</command> commands. For example,
|
||
<command>rabbitmqctl list_queues| grep
|
||
cinder</command> displays any messages
|
||
left in the queue. If there are, it's a possible sign that
|
||
cinder services didn't connect properly to rabbitmq and
|
||
might have to be restarted.</para>
|
||
<para>Items to monitor for RabbitMQ include the number of
|
||
items in each of the queues and the processing time
|
||
statistics for the server.</para>
|
||
</section>
|
||
<section xml:id="manage_logs_centrally">
|
||
<title>Centrally Managing Logs</title>
|
||
<para>Because your cloud is most likely composed of many
|
||
servers, you must check logs on each of those servers to
|
||
properly piece an event together. A better solution is to
|
||
send the logs of all servers to a central location so they
|
||
can all be accessed from the same area.</para>
|
||
<para>Ubuntu uses rsyslog as the default logging service.
|
||
Since it is natively able to send logs to a remote
|
||
location, you don't have to install anything extra to
|
||
enable this feature, just modify the configuration file.
|
||
In doing this, consider running your logging over a
|
||
management network, or using an encrypted VPN to avoid
|
||
interception.</para>
|
||
<section xml:id="rsyslog_client_config">
|
||
<title>rsyslog Client Configuration</title>
|
||
<para>To begin, configure all OpenStack components to log
|
||
to syslog in addition to their standard log file
|
||
location. Also configure each component to log to a
|
||
different syslog facility. This makes it easier to
|
||
split the logs into individual components on the
|
||
central server.</para>
|
||
<para>
|
||
<filename>nova.conf</filename>:</para>
|
||
<programlisting language="ini">use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL0</programlisting>
|
||
<para>
|
||
<filename>glance-api.conf</filename> and
|
||
<filename>glance-registry.conf</filename>:</para>
|
||
<programlisting language="ini">use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL1</programlisting>
|
||
<para>
|
||
<filename>cinder.conf</filename>:</para>
|
||
<programlisting language="ini">use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL2</programlisting>
|
||
<para>
|
||
<filename>keystone.conf</filename>:</para>
|
||
<programlisting language="ini">use_syslog=True
|
||
syslog_log_facility=LOG_LOCAL3</programlisting>
|
||
<para>Object Storage</para>
|
||
<para>By default, Object Storage logs to syslog.</para>
|
||
<para>Next, create <filename>/etc/rsyslog.d/client.conf</filename> with the
|
||
following line:</para>
|
||
<programlisting language="ini">*.* @192.168.1.10</programlisting>
|
||
<para>This instructs rsyslog to send all logs to the IP
|
||
listed. In this example, the IP points to the Cloud
|
||
Controller.</para>
|
||
</section>
|
||
<section xml:id="rsyslog_server_config">
|
||
<title>rsyslog Server Configuration</title>
|
||
<para>Designate a server as the central logging server.
|
||
The best practice is to choose a server that is solely
|
||
dedicated to this purpose. Create a file called
|
||
<filename>/etc/rsyslog.d/server.conf</filename> with the following
|
||
contents:</para>
|
||
<programlisting language="ini"># Enable UDP
|
||
$ModLoad imudp
|
||
# Listen on 192.168.1.10 only
|
||
$UDPServerAddress 192.168.1.10
|
||
# Port 514
|
||
$UDPServerRun 514
|
||
|
||
# Create logging templates for nova
|
||
$template NovaFile,"/var/log/rsyslog/%HOSTNAME%/nova.log"
|
||
$template NovaAll,"/var/log/rsyslog/nova.log"
|
||
|
||
|
||
|
||
# Log everything else to syslog.log
|
||
$template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
|
||
*.* ?DynFile
|
||
|
||
|
||
|
||
# Log various openstack components to their own individual file
|
||
local0.* ?NovaFile
|
||
local0.* ?NovaAll
|
||
& ~</programlisting>
|
||
<para>The above example configuration handles the nova service only.
|
||
It first configures rsyslog to act as a server that runs on port
|
||
514. Next, it creates a series of logging templates. Logging
|
||
templates control where received logs are stored. Using
|
||
the example above, a nova log from c01.example.com goes to the
|
||
following locations:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<filename>/var/log/rsyslog/c01.example.com/nova.log</filename>
|
||
</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
<filename>/var/log/rsyslog/nova.log</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>This is useful as logs from c02.example.com go to:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>
|
||
<filename>/var/log/rsyslog/c02.example.com/nova.log</filename>
|
||
</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>
|
||
<filename>/var/log/rsyslog/nova.log</filename>
|
||
</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>You have an individual log file for each compute
|
||
node as well as an aggregated log that contains nova
|
||
logs from all nodes.</para>
|
||
</section>
|
||
</section>
|
||
<section xml:id="stacktach">
|
||
<title>StackTach</title>
|
||
<para>StackTach is a tool created by Rackspace to collect and
|
||
report the notifications sent by <code>nova</code>.
|
||
Notifications are essentially the same as logs, but can be
|
||
much more detailed. A good overview of notifications can
|
||
be found at <link xlink:title="StackTach GitHub repo"
|
||
xlink:href="https://wiki.openstack.org/wiki/SystemUsageData"
|
||
>System Usage Data</link>
|
||
(https://wiki.openstack.org/wiki/SystemUsageData).</para>
|
||
<para>To enable nova to send notifications, add the following
|
||
to <filename>nova.conf</filename>:</para>
|
||
<programlisting language="ini"><?db-font-size 75%?>notification_topics=monitor
|
||
notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisting>
|
||
<para>Once <code>nova</code> is sending notifications, install
|
||
and configure StackTach. Since StackTach is relatively new
|
||
and constantly changing, installation instructions would
|
||
quickly become outdated. Please refer to the <link
|
||
xlink:href="https://github.com/rackerlabs/stacktach"
|
||
>StackTach GitHub repo</link>
|
||
(https://github.com/rackerlabs/stacktach) for instructions
|
||
as well as a demo video.</para>
|
||
</section>
|
||
<section xml:id="monitoring">
|
||
<title>Monitoring</title>
|
||
<para>There are two types of monitoring: watching for problems
|
||
and watching usage trends. The former ensures that all
|
||
services are up and running, creating a functional cloud.
|
||
The latter involves monitoring resource usage over time in
|
||
order to make informed decisions about potential
|
||
bottlenecks and upgrades.</para>
|
||
<sidebar>
|
||
<title>Nagios</title>
|
||
<para>Nagios is an open source monitoring service. It's
|
||
capable of executing arbitrary commands to check the
|
||
status of server and network services, remotely
|
||
executing arbitrary commands directly on servers, and
|
||
allow servers to push notifications back in the form
|
||
of passive monitoring. Nagios has been around since
|
||
1999. Although newer monitoring services are
|
||
available, Nagios is a tried-and-true systems
|
||
administration staple.</para>
|
||
</sidebar>
|
||
<section xml:id="process_monitoring">
|
||
<title>Process Monitoring</title>
|
||
<para>A basic type of alert monitoring is to simply check
|
||
and see if a required process is running. For example,
|
||
ensure that the <code>nova-api</code> service is
|
||
running on the Cloud Controller:</para>
|
||
<screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
|
||
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api --config-file=/etc/nova/nova.conf nova
|
||
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12792 0.0 0.0 96052 22856 ? S Feb11 0:01 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12793 0.0 0.3 290688 115516 ? S Feb11 1:23 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
nova 12794 0.0 0.2 248636 77068 ? S Feb11 0:04 /usr/bin/python /usr/bin/nova-api --config-file=/etc/nova/nova.conf
|
||
root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput></screen>
|
||
<para>You can create automated alerts for critical
|
||
processes by using Nagios and NRPE. For example, to
|
||
ensure that the <code>nova-compute</code> process is
|
||
running on compute nodes, create an alert on your
|
||
Nagios server that looks like this:</para>
|
||
<programlisting>define service {
|
||
host_name c01.example.com
|
||
check_command check_nrpe_1arg!check_nova-compute
|
||
use generic-service
|
||
notification_period 24x7
|
||
contact_groups sysadmins
|
||
service_description nova-compute
|
||
}</programlisting>
|
||
<para>Then on the actual compute node, create the
|
||
following NRPE configuration:</para>
|
||
<programlisting>\command[check_nova-compute]=/usr/lib/nagios/plugins/check_procs -c 1: -a nova-compute</programlisting>
|
||
<para>Nagios checks that at least one nova-compute service
|
||
is running at all times.</para>
|
||
</section>
|
||
<section xml:id="resource_alerting">
|
||
<title>Resource Alerting</title>
|
||
<para>Resource alerting provides notifications when one or
|
||
more resources are critically low. While the
|
||
monitoring thresholds should be tuned to your specific
|
||
OpenStack environment, monitoring resource usage is
|
||
not specific to OpenStack at all – any generic type of
|
||
alert will work fine.</para>
|
||
<para>Some of the resources that you want to monitor
|
||
include:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>Disk Usage</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Server Load</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Memory Usage</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Network IO</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Available vCPUs</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>For example, to monitor disk capacity on a compute
|
||
node with Nagios, add the following to your Nagios
|
||
configuration:</para>
|
||
<programlisting><?db-font-size 75%?>define service {
|
||
host_name c01.example.com
|
||
check_command check_nrpe!check_all_disks!20% 10%
|
||
use generic-service
|
||
contact_groups sysadmins
|
||
service_description Disk
|
||
}</programlisting>
|
||
<para>On the compute node, add the following to your NRPE
|
||
configuration:</para>
|
||
<programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c $ARG2$ -e</programlisting>
|
||
<para>Nagios alerts you with a WARNING when any disk on
|
||
the compute node is 80% full and CRITICAL when 90% is
|
||
full.</para>
|
||
</section>
|
||
<section xml:id="metering_telemetry">
|
||
<title>Metering and Telemetry with Ceilometer</title>
|
||
<para>An integrated OpenStack project, code-named ceilometer,
|
||
collects metering data and provides alerts for Compute, Storage,
|
||
and Networking. Data collected by the metering system could be
|
||
used for billing. Depending on deployment configuration, metered
|
||
data may be accessible to users based on the deployment
|
||
configuration. The Telemetry service provides a REST API
|
||
documented at <link
|
||
xlink:href="http://api.openstack.org/api-ref-telemetry.html"
|
||
>http://api.openstack.org/api-ref-telemetry.html</link>.
|
||
You can read more about the project at <link
|
||
xlink:href="http://docs.openstack.org/developer/ceilometer/"
|
||
>http://docs.openstack.org/developer/ceilometer/</link>.</para></section>
|
||
<section xml:id="os_resources">
|
||
<title>OpenStack-specific Resources</title>
|
||
<para>Resources such as memory, disk, and CPU are generic
|
||
resources that all servers (even non-OpenStack
|
||
servers) have and are important to the overall health
|
||
of the server. When dealing with OpenStack
|
||
specifically, these resources are important for a
|
||
second reason: ensuring enough are available in order
|
||
to launch instances. There are a few ways you can see
|
||
OpenStack resource usage.</para>
|
||
<para>The first is through the <code>nova</code>
|
||
command:</para>
|
||
<programlisting># nova usage-list</programlisting>
|
||
<para>This command displays a list of how many instances a
|
||
tenant has running and some light usage statistics
|
||
about the combined instances. This command is useful
|
||
for a quick overview of your cloud, but doesn't really
|
||
get into a lot of details.</para>
|
||
<para>Next, the <code>nova</code> database contains three
|
||
tables that store usage information.</para>
|
||
<para>The <code>nova.quotas</code> and
|
||
<code>nova.quota_usages</code> tables store quota
|
||
information. If a tenant's quota is different than the
|
||
default quota settings, their quota is stored in
|
||
<code>nova.quotas</code> table. For
|
||
example:</para>
|
||
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
|
||
<computeroutput>+----------------------------------+-----------------------------+------------+
|
||
| project_id | resource | hard_limit |
|
||
+----------------------------------+-----------------------------+------------+
|
||
| 628df59f091142399e0689a2696f5baa | metadata_items | 128 |
|
||
| 628df59f091142399e0689a2696f5baa | injected_file_content_bytes | 10240 |
|
||
| 628df59f091142399e0689a2696f5baa | injected_files | 5 |
|
||
| 628df59f091142399e0689a2696f5baa | gigabytes | 1000 |
|
||
| 628df59f091142399e0689a2696f5baa | ram | 51200 |
|
||
| 628df59f091142399e0689a2696f5baa | floating_ips | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | instances | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | volumes | 10 |
|
||
| 628df59f091142399e0689a2696f5baa | cores | 20 |
|
||
+----------------------------------+-----------------------------+------------+</computeroutput></screen>
|
||
<para>The <code>nova.quota_usages</code> table keeps track
|
||
of how many resources the tenant currently has in
|
||
use:</para>
|
||
<screen><prompt>mysql></prompt> <userinput>select project_id, resource, in_use from quota_usages where project_id like '628%';</userinput>
|
||
<computeroutput>+----------------------------------+--------------+--------+
|
||
| project_id | resource | in_use |
|
||
+----------------------------------+--------------+--------+
|
||
| 628df59f091142399e0689a2696f5baa | instances | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | ram | 512 |
|
||
| 628df59f091142399e0689a2696f5baa | cores | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | floating_ips | 1 |
|
||
| 628df59f091142399e0689a2696f5baa | volumes | 2 |
|
||
| 628df59f091142399e0689a2696f5baa | gigabytes | 12 |
|
||
| 628df59f091142399e0689a2696f5baa | images | 1 |
|
||
+----------------------------------+--------------+--------+</computeroutput></screen>
|
||
<para>By comparing a tenant's hard limit with their
|
||
current resource usage, you can see their usage
|
||
percentage. For example, if this tenant is using 1
|
||
Floating IP out of 10, then they are using 10% of
|
||
their Floating IP quota. Rather than doing the
|
||
calculation manually, you can use SQL or the scripting
|
||
language of your choice and create a formatted
|
||
report:</para>
|
||
<screen><computeroutput>+----------------------------------+------------+-------------+---------------+
|
||
| some_tenant |
|
||
+-----------------------------------+------------+------------+---------------+
|
||
| Resource | Used | Limit | |
|
||
+-----------------------------------+------------+------------+---------------+
|
||
| cores | 1 | 20 | 5 % |
|
||
| floating_ips | 1 | 10 | 10 % |
|
||
| gigabytes | 12 | 1000 | 1 % |
|
||
| images | 1 | 4 | 25 % |
|
||
| injected_file_content_bytes | 0 | 10240 | 0 % |
|
||
| injected_file_path_bytes | 0 | 255 | 0 % |
|
||
| injected_files | 0 | 5 | 0 % |
|
||
| instances | 1 | 10 | 10 % |
|
||
| key_pairs | 0 | 100 | 0 % |
|
||
| metadata_items | 0 | 128 | 0 % |
|
||
| ram | 512 | 51200 | 1 % |
|
||
| reservation_expire | 0 | 86400 | 0 % |
|
||
| security_group_rules | 0 | 20 | 0 % |
|
||
| security_groups | 0 | 10 | 0 % |
|
||
| volumes | 2 | 10 | 20 % |
|
||
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>
|
||
<para>The above was generated using a custom script which
|
||
can be found on GitHub
|
||
(https://github.com/cybera/novac/blob/dev/libexec/novac-quota-report).</para>
|
||
<note>
|
||
<para>This script is specific to a certain OpenStack
|
||
installation and must be modified to fit your
|
||
environment. However, the logic should easily be
|
||
transferable.</para>
|
||
</note>
|
||
</section>
|
||
<section xml:id="intelligent_alerting">
|
||
<title>Intelligent Alerting</title>
|
||
<para>Intelligent alerting can be thought of as a form of
|
||
continuous integration for operations. For example,
|
||
you can easily check to see if the Image Service is up and
|
||
running by ensuring that the <code>glance-api</code>
|
||
and <code>glance-registry</code> processes are running
|
||
or by seeing if <code>glace-api</code> is responding
|
||
on port 9292.</para>
|
||
<para>But how can you tell if images are being
|
||
successfully uploaded to the Image Service? Maybe the
|
||
disk that Image Service is storing the images on is
|
||
full or the S3 back-end is down. You could naturally
|
||
check this by doing a quick image upload:</para>
|
||
<programlisting language="bash">#!/bin/bash
|
||
#
|
||
# assumes that reasonable credentials have been stored at
|
||
# /root/auth
|
||
|
||
|
||
. /root/openrc
|
||
wget https://launchpad.net/cirros/trunk/0.3.0/+download/cirros-0.3.0-x86_64-disk.img
|
||
glance image-create --name='cirros image' --is-public=true --container-format=bare --disk-format=qcow2 < cirros-0.3.0-x8
|
||
6_64-disk.img</programlisting>
|
||
<para>By taking this script and rolling it into an alert
|
||
for your monitoring system (such as Nagios), you now
|
||
have an automated way of ensuring image uploads to the
|
||
Image Catalog are working.</para>
|
||
<note>
|
||
<para>You must remove the image after each test. Even
|
||
better, test whether you can successfully delete
|
||
an image from the Image Service.</para>
|
||
</note>
|
||
<para>Intelligent alerting takes a considerable more
|
||
amount of time to plan and implement than the other
|
||
alerts described in this chapter. A good outline to
|
||
implement intelligent alerting is:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>Review common actions in your cloud</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Create ways to automatically test these
|
||
actions</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Roll these tests into an alerting
|
||
system</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>Some other examples for Intelligent Alerting
|
||
include:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>Can instances launch and destroyed?</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Can users be created?</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Can objects be stored and deleted?</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Can volumes be created and destroyed?</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
</section>
|
||
<section xml:id="trending">
|
||
<title>Trending</title>
|
||
<para>Trending can give you great insight into how your
|
||
cloud is performing day to day. For example, if a busy
|
||
day was simply a rare occurrence or if you should
|
||
start adding new compute nodes.</para>
|
||
<para>Trending takes a slightly different approach than
|
||
alerting. While alerting is interested in a binary
|
||
result (whether a check succeeds or fails), trending
|
||
records the current state of something at a certain
|
||
point in time. Once enough points in time have been
|
||
recorded, you can see how the value has changed over
|
||
time.</para>
|
||
<para>All of the alert types mentioned earlier can also be
|
||
used for trend reporting. Some other trend examples
|
||
include:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>The number of instances on each compute
|
||
node</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The types of flavors in use</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The number of volumes in use</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The number of Object Storage requests each
|
||
hour</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The number of nova-api requests each
|
||
hour</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The I/O statistics of your storage
|
||
services</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>As an example, recording <code>nova-api</code> usage
|
||
can allow you to track the need to scale your cloud
|
||
controller. By keeping an eye on <code>nova-api</code>
|
||
requests, you can determine if you need to spawn more
|
||
nova-api processes or go as far as introducing an
|
||
entirely new server to run <code>nova-api</code>. To
|
||
get an approximate count of the requests, look for
|
||
standard INFO messages in
|
||
<code>/var/log/nova/nova-api.log</code>:</para>
|
||
<para># grep INFO /var/log/nova/nova-api.log | wc</para>
|
||
<para>You can obtain further statistics by looking for the
|
||
number of successful requests:</para>
|
||
<para># grep " 200 " /var/log/nova/nova-api.log |
|
||
wc</para>
|
||
<para>By running this command periodically and keeping a
|
||
record of the result, you can create a trending report
|
||
over time that shows whether your
|
||
<code>nova-api</code> usage is increasing,
|
||
decreasing, or keeping steady.</para>
|
||
<para>A tool such as collectd can be used to store this
|
||
information. While collectd is out of the scope of
|
||
this book, a good starting point would be to use
|
||
collectd to store the result as a COUNTER data type.
|
||
More information can be found in collectd's
|
||
documentation
|
||
(https://collectd.org/wiki/index.php/Data_source)</para>
|
||
</section>
|
||
</section>
|
||
<section xml:id="ops-log-monitor-summary">
|
||
<title>Summary</title>
|
||
<para>For stable operations, you want to detect failure promptly and
|
||
determine causes efficiently. With a distributed system, it's even
|
||
more important to track the right items to meet a service level target.
|
||
Learning where these logs are located in the file system or API gives
|
||
you an advantage. Plus, we have discussed how to read, interpret, and
|
||
manipulate information from OpenStack services so you can monitor
|
||
effectively.</para>
|
||
</section>
|
||
</chapter>
|