operations-guide/doc/openstack-ops/ch_ops_log_monitor.xml

<?xml version="1.0" encoding="UTF-8"?>
<chapter version="5.0" xml:id="logging_monitoring"
         xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:xi="http://www.w3.org/2001/XInclude"
         xmlns:ns5="http://www.w3.org/1999/xhtml"
         xmlns:ns4="http://www.w3.org/2000/svg"
         xmlns:ns3="http://www.w3.org/1998/Math/MathML"
         xmlns:ns="http://docbook.org/ns/docbook">
  <?dbhtml stop-chunking?>

  <title>Logging and Monitoring</title>

  <para>As an OpenStack cloud is composed of so many different services, there
  are a large number of log files. This chapter aims to assist you in locating
  and working with them and describes other ways to track the status of your
  deployment.<indexterm class="singular">
      <primary>debugging</primary>

      <see>logging/monitoring; maintenance/debugging</see>
    </indexterm></para>

  <section xml:id="where_are_logs">
    <title>Where Are the Logs?</title>

    <para>Most services use the convention of writing their log files to
    subdirectories of the <code>/var/log directory</code>, as listed in <xref
    linkend="openstack-log-locations" />.<indexterm class="singular">
        <primary>cloud controllers</primary>

        <secondary>log information</secondary>
      </indexterm><indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>log location</secondary>
      </indexterm></para>

    <table rules="all" xml:id="openstack-log-locations">
      <caption>OpenStack log locations</caption>

      <thead>
        <tr>
          <th>Node type</th>

          <th>Service</th>

          <th>Log location</th>
        </tr>
      </thead>

      <tbody>
        <tr>
          <td><para>Cloud controller</para></td>

          <td><para> <code>nova-*</code> </para></td>

          <td><para> <code>/var/log/nova</code> </para></td>
        </tr>

        <tr>
          <td><para>Cloud controller</para></td>

          <td><para> <code>glance-*</code> </para></td>

          <td><para> <code>/var/log/glance</code> </para></td>
        </tr>

        <tr>
          <td><para>Cloud controller</para></td>

          <td><para> <code>cinder-*</code> </para></td>

          <td><para> <code>/var/log/cinder</code> </para></td>
        </tr>

        <tr>
          <td><para>Cloud controller</para></td>

          <td><para> <code>keystone-*</code> </para></td>

          <td><para> <code>/var/log/keystone</code> </para></td>
        </tr>

        <tr>
          <td><para>Cloud controller</para></td>

          <td><para> <code>neutron-*</code> </para></td>

          <td><para> <code>/var/log/neutron</code> </para></td>
        </tr>

        <tr>
          <td><para>Cloud controller</para></td>

          <td><para>horizon</para></td>

          <td><para> <code>/var/log/apache2/</code> </para></td>
        </tr>

        <tr>
          <td><para>All nodes</para></td>

          <td><para>misc (swift, dnsmasq)</para></td>

          <td><para> <code>/var/log/syslog</code> </para></td>
        </tr>

        <tr>
          <td><para>Compute nodes</para></td>

          <td><para>libvirt</para></td>

          <td><para> <code>/var/log/libvirt/libvirtd.log</code> </para></td>
        </tr>

        <tr>
          <td><para>Compute nodes</para></td>

          <td><para>Console (boot up messages) for VM instances:</para></td>

          <td><para> <code>/var/lib/nova/instances/instance-</code><phrase
          role="keep-together"><code>&lt;instance id&gt;/console.log</code>
          </phrase></para></td>
        </tr>

        <tr>
          <td><para>Block Storage nodes</para></td>

          <td><para>cinder-volume</para></td>

          <td><para> <code>/var/log/cinder/cinder-volume.log</code>
          </para></td>
        </tr>
      </tbody>
    </table>
  </section>

  <section xml:id="how_to_read_logs">
    <title>Reading the Logs</title>

    <para>OpenStack services use the standard logging levels, at increasing
    severity: DEBUG, INFO, AUDIT, WARNING, ERROR, CRITICAL, and TRACE. That
    is, messages only appear in the logs if they are more "severe" than the
    particular log level, with DEBUG allowing all log statements through. For
    example, TRACE is logged only if the software has a stack trace, while
    INFO is logged for every message including those that are only for
    information.<indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>logging levels</secondary>
      </indexterm></para>

    <para>To disable DEBUG-level logging, edit
    <filename>/etc/nova/nova.conf</filename> as follows:</para>

    <programlisting language="ini">debug=false</programlisting>

    <para>Keystone is handled a little differently. To modify the logging
    level, edit the <filename>/etc/keystone/logging.conf</filename> file and
    look at the <code>logger_root</code> and <code>handler_file</code>
    sections.</para>

    <para><phrase role="keep-together">Logging for horizon is configured in
    <filename>/etc/openstack_dashboard/local_</filename></phrase><filename>settings.py</filename>.
    Because horizon is a Django web application, it follows the <link
    xlink:href="http://opsgui.de/NPGgww" xlink:title="Django Logging">Django
    Logging framework conventions</link>.</para>

    <para>The first step in finding the source of an error is typically to
    search for a CRITICAL, TRACE, or ERROR message in the log starting at the
    bottom of the log file.<indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>reading log messages</secondary>
      </indexterm></para>

    <para>Here is an example of a CRITICAL log message, with the corresponding
    TRACE (Python traceback) immediately following:</para>

    <screen><computeroutput>2013-02-25 21:05:51 17409 CRITICAL cinder [-] Bad or unexpected response from the storage volume backend API: volume group
 cinder-volumes doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder Traceback (most recent call last):
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/bin/cinder-volume", line 48, in &lt;module&gt;
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 422, in wait
2013-02-25 21:05:51 17409 TRACE cinder _launcher.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 127, in wait
2013-02-25 21:05:51 17409 TRACE cinder service.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 166, in wait
2013-02-25 21:05:51 17409 TRACE cinder return self._exit_event.wait()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/event.py", line 116, in wait
2013-02-25 21:05:51 17409 TRACE cinder return hubs.get_hub().switch()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 177, in switch
2013-02-25 21:05:51 17409 TRACE cinder return self.greenlet.switch()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/eventlet/greenthread.py", line 192, in main
2013-02-25 21:05:51 17409 TRACE cinder result = function(*args, **kwargs)
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 88, in run_server
2013-02-25 21:05:51 17409 TRACE cinder server.start()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/service.py", line 159, in start
2013-02-25 21:05:51 17409 TRACE cinder self.manager.init_host()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/manager.py", line 95,
 in init_host
2013-02-25 21:05:51 17409 TRACE cinder self.driver.check_for_setup_error()
2013-02-25 21:05:51 17409 TRACE cinder File "/usr/lib/python2.7/dist-packages/cinder/volume/driver.py", line 116,
 in check_for_setup_error
2013-02-25 21:05:51 17409 TRACE cinder raise exception.VolumeBackendAPIException(data=exception_message)
2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume
 backend API: volume group cinder-volumes doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder</computeroutput></screen>

    <para>In this example, <literal>cinder-volumes</literal> failed to start
    and has provided a stack trace, since its volume backend has been unable
    to set up the storage volume—probably because the LVM volume that is
    expected from the configuration does not exist.</para>

    <para>Here is an example error log:</para>

    <screen><computeroutput>2013-02-25 20:26:33 6619 ERROR nova.openstack.common.rpc.common [-] AMQP server on localhost:5672 is unreachable:
 [Errno 111] ECONNREFUSED. Trying again in 23 seconds.</computeroutput></screen>

    <para>In this error, a nova service has failed to connect to the RabbitMQ
    server because it got a connection refused error.</para>
  </section>

  <section xml:id="tracing_instance_request">
    <title>Tracing Instance Requests</title>

    <para>When an instance fails to behave properly, you will often have to
    trace activity associated with that instance across the log files of
    various <code>nova-*</code> services and across both the cloud controller
    and compute nodes.<indexterm class="singular">
        <primary>instances</primary>

        <secondary>tracing instance requests</secondary>
      </indexterm><indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>tracing instance requests</secondary>
      </indexterm></para>

    <para>The typical way is to trace the UUID associated with an instance
    across the service logs.</para>

    <para>Consider the following example:</para>

    <screen><prompt>$</prompt> <userinput>nova list</userinput>
<computeroutput>+--------------------------------+--------+--------+--------------------------+
| ID                             | Name   | Status | Networks                 |
+--------------------------------+--------+--------+--------------------------+
| fafed8-4a46-413b-b113-f1959ffe | cirros | ACTIVE | novanetwork=192.168.100.3|
+--------------------------------------+--------+--------+--------------------+</computeroutput></screen>

    <para>Here, the ID associated with the instance is
    <code>faf7ded8-4a46-413b-b113-f19590746ffe</code>. If you search for this
    string on the cloud controller in the
    <filename>/var/log/nova-*.log</filename> files, it appears in
    <filename>nova-api.log</filename> and
    <filename>nova-scheduler.log</filename>. If you search for this on the
    compute nodes in <filename>/var/log/nova-*.log</filename>, it appears in
    <filename>nova-network.log</filename> and
    <filename>nova-compute.log</filename>. If no ERROR or CRITICAL messages
    appear, the most recent log entry that reports this may provide a hint
    about what has gone wrong.</para>
  </section>

  <section xml:id="add_custom_logging">
    <title>Adding Custom Logging Statements</title>

    <para>If there is not enough information in the existing logs, you may
    need to add your own custom logging statements to the <code>nova-*</code>
    services.<indexterm class="singular">
        <primary>customization</primary>

        <secondary>custom log statements</secondary>
      </indexterm><indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>adding custom log statements</secondary>
      </indexterm></para>

    <para>The source files are located in
    <filename>/usr/lib/python2.7/dist-packages/nova</filename>.</para>

    <para>To add logging statements, the following line should be near the top
    of the file. For most files, these should already be there:</para>

    <programlisting language="python">from nova.openstack.common import log as logging
LOG = logging.getLogger(__name__)</programlisting>

    <para>To add a DEBUG logging statement, you would do:</para>

    <programlisting language="python">LOG.debug("This is a custom debugging statement")</programlisting>

    <para>You may notice that all the existing logging messages are preceded
    by an underscore and surrounded by parentheses, for example:</para>

    <programlisting language="python">LOG.debug(_("Logging statement appears here"))</programlisting>

    <para>This formatting is used to support translation of logging messages
    into different languages using the <link
    xlink:href="http://opsgui.de/1eLBlHT">gettext</link> internationalization
    library. You don't need to do this for your own custom log messages.
    However, if you want to contribute the code back to the OpenStack project
    that includes logging statements, you must surround your log messages with
    underscores and parentheses.</para>
  </section>

  <section xml:id="rabbitmq">
    <title>RabbitMQ Web Management Interface or rabbitmqctl</title>

    <para>Aside from connection failures, RabbitMQ log files are generally not
    useful for debugging OpenStack related issues. Instead, we recommend you
    use the RabbitMQ web management interface.<indexterm class="singular">
        <primary>RabbitMQ</primary>
      </indexterm><indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>RabbitMQ web management interface</secondary>
      </indexterm> Enable it on your cloud controller:<indexterm
        class="singular">
        <primary>cloud controllers</primary>

        <secondary>enabling RabbitMQ</secondary>
      </indexterm></para>

    <screen><prompt>#</prompt> <userinput>/usr/lib/rabbitmq/bin/rabbitmq-plugins enable rabbitmq_management</userinput></screen>

    <screen><prompt>#</prompt> <userinput>service rabbitmq-server restart</userinput></screen>

    <para>The RabbitMQ web management interface is accessible on your cloud
    controller at <emphasis>http://localhost:55672</emphasis>.</para>

    <note>
      <para>Ubuntu 12.04 installs RabbitMQ version 2.7.1, which uses port
      55672. RabbitMQ versions 3.0 and above use port 15672 instead. You can
      check which version of RabbitMQ you have running on your local Ubuntu
      machine by doing:</para>

      <screen><prompt>$</prompt> <userinput>dpkg -s rabbitmq-server | grep "Version:"
Version: 2.7.1-0ubuntu4</userinput></screen>
    </note>

    <para>An alternative to enabling the RabbitMQ web management interface is
    to use the <phrase role="keep-together"><literal>rabbitmqctl</literal></phrase> commands. For example,
    <literal>rabbitmqctl list_queues| grep cinder</literal> displays any
    messages left in the queue. If there are messages, it's a possible sign
    that cinder services didn't connect properly to rabbitmq and might have to
    be restarted.</para>

    <para>Items to monitor for RabbitMQ include the number of items in each of
    the queues and the processing time statistics for the server.</para>
  </section>

  <section xml:id="manage_logs_centrally">
    <title>Centrally Managing Logs</title>

    <para>Because your cloud is most likely composed of many servers, you must
    check logs on each of those servers to properly piece an event together. A
    better solution is to send the logs of all servers to a central location
    so that they can all be accessed from the same area.<indexterm
        class="singular">
        <primary>logging/monitoring</primary>

        <secondary>central log management</secondary>
      </indexterm></para>

    <para>Ubuntu uses rsyslog as the default logging service. Since it is
    natively able to send logs to a remote location, you don't have to install
    anything extra to enable this feature, just modify the configuration file.
    In doing this, consider running your logging over a management network or
    using an encrypted VPN to avoid interception.</para>

    <section xml:id="rsyslog_client_config">
      <title>rsyslog Client Configuration</title>

      <para>To begin, configure all OpenStack components to log to syslog in
      addition to their standard log file location. Also configure each
      component to log to a different syslog facility. This makes it easier to
      split the logs into individual components on the central
      server:<indexterm class="singular">
          <primary>rsyslog</primary>
        </indexterm></para>

      <para><filename>nova.conf</filename>:</para>

      <programlisting language="ini">use_syslog=True
syslog_log_facility=LOG_LOCAL0</programlisting>

      <para><filename>glance-api.conf</filename> and
      <filename>glance-registry.conf</filename>:</para>

      <programlisting language="ini">use_syslog=True
syslog_log_facility=LOG_LOCAL1</programlisting>

      <para><filename>cinder.conf</filename>:</para>

      <programlisting language="ini">use_syslog=True
syslog_log_facility=LOG_LOCAL2</programlisting>

      <para><filename>keystone.conf</filename>:</para>

      <programlisting language="ini">use_syslog=True
syslog_log_facility=LOG_LOCAL3</programlisting>

      <para>By default, Object Storage logs to syslog.</para>

      <para>Next, create <filename>/etc/rsyslog.d/client.conf</filename> with
      the following line:</para>

      <programlisting language="ini">*.* @192.168.1.10</programlisting>

      <para>This instructs rsyslog to send all logs to the IP listed. In this
      example, the IP points to the cloud controller.</para>
    </section>

    <section xml:id="rsyslog_server_config">
      <title>rsyslog Server Configuration</title>

      <para>Designate a server as the central logging server. The best
      practice is to choose a server that is solely dedicated to this purpose.
      Create a file called <filename>/etc/rsyslog.d/server.conf</filename>
      with the following contents:</para>

      <programlisting language="ini"># Enable UDP
$ModLoad imudp
# Listen on 192.168.1.10 only
$UDPServerAddress 192.168.1.10
# Port 514
$UDPServerRun 514

# Create logging templates for nova
$template NovaFile,"/var/log/rsyslog/%HOSTNAME%/nova.log"
$template NovaAll,"/var/log/rsyslog/nova.log"

# Log everything else to syslog.log
$template DynFile,"/var/log/rsyslog/%HOSTNAME%/syslog.log"
*.* ?DynFile

# Log various openstack components to their own individual file
local0.* ?NovaFile
local0.* ?NovaAll
&amp; ~</programlisting>

      <para>This example configuration handles the nova service only. It first
      configures rsyslog to act as a server that runs on port 514. Next, it
      creates a series of logging templates. Logging templates control where
      received logs are stored. Using the last example, a nova log from
      c01.example.com goes to the following locations:</para>

      <itemizedlist>
        <listitem>
          <para><filename>/var/log/rsyslog/c01.example.com/nova.log</filename></para>
        </listitem>

        <listitem>
          <para><filename>/var/log/rsyslog/nova.log</filename></para>
        </listitem>
      </itemizedlist>

      <para>This is useful, as logs from c02.example.com go to:</para>

      <itemizedlist>
        <listitem>
          <para><filename>/var/log/rsyslog/c02.example.com/nova.log</filename></para>
        </listitem>

        <listitem>
          <para><filename>/var/log/rsyslog/nova.log</filename></para>
        </listitem>
      </itemizedlist>

      <para>You have an individual log file for each compute node as well as
      an aggregated log that contains nova logs from all nodes.</para>
    </section>
  </section>

  <section xml:id="stacktach">
    <!-- FIXME This section needs updating, especially with the advent of
         ceilometer -->

    <title>StackTach</title>

    <para>StackTach is a tool created by Rackspace to collect and report the
    notifications sent by <code>nova</code>. Notifications are essentially the
    same as logs but can be much more detailed. A good overview of
    notifications can be found at <link xlink:href="http://opsgui.de/NPGh3H"
    xlink:title="StackTach GitHub repo">System Usage Data</link>.<indexterm
        class="singular">
        <primary>StackTach</primary>
      </indexterm><indexterm class="singular">
        <primary>logging/monitoring</primary>

        <secondary>StackTack tool</secondary>
      </indexterm></para>

    <para>To enable <code>nova</code> to send notifications, add the following
    to <filename>nova.conf</filename>:</para>

    <programlisting language="ini"><?db-font-size 75%?>notification_topics=monitor
notification_driver=nova.openstack.common.notifier.rabbit_notifier</programlisting>

    <para>Once <code>nova</code> is sending notifications, install and
    configure StackTach. Since StackTach is relatively new and constantly
    changing, installation instructions would quickly become outdated. Please
    refer to the <link xlink:href="http://opsgui.de/1eLBpqQ">StackTach GitHub
    repo</link> for instructions as well as a demo video.</para>
  </section>

  <section xml:id="monitoring">
    <title>Monitoring</title>

    <para>There are two types of monitoring: watching for problems and
    watching usage trends. The former ensures that all services are up and
    running, creating a functional cloud. The latter involves monitoring
    resource usage over time in order to make informed decisions about
    potential bottlenecks and upgrades.<indexterm class="singular">
        <primary>cloud controllers</primary>

        <secondary>process monitoring and</secondary>
      </indexterm></para>

    <?hard-pagebreak ?>

    <sidebar>
      <title>Nagios</title>

      <para>Nagios is an open source monitoring service. It's capable of
      executing arbitrary commands to check the status of server and network
      services, remotely executing arbitrary commands directly on servers, and
      allowing servers to push notifications back in the form of passive
      monitoring. Nagios has been around since 1999. Although newer monitoring
      services are available, Nagios is a tried-and-true systems
      administration staple.<indexterm class="singular">
          <primary>Nagios</primary>
        </indexterm></para>
    </sidebar>

    <section xml:id="process_monitoring">
      <title>Process Monitoring</title>

      <para>A basic type of alert monitoring is to simply check and see
      whether a required process is running.<indexterm class="singular">
          <primary>monitoring</primary>

          <secondary>process monitoring</secondary>
        </indexterm><indexterm class="singular">
          <primary>process monitoring</primary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>process monitoring</secondary>
        </indexterm> For example, ensure that the <code>nova-api</code>
      service is running on the cloud controller:</para>

      <screen><prompt>#</prompt> <userinput>ps aux | grep nova-api</userinput>
<computeroutput>nova 12786 0.0 0.0 37952 1312 ? Ss Feb11 0:00 su -s /bin/sh -c exec nova-api
--config-file=/etc/nova/nova.conf nova
nova 12787 0.0 0.1 135764 57400 ? S Feb11 0:01 /usr/bin/python
 /usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12792 0.0 0.0 96052 22856 ? S Feb11 0:01 /usr/bin/python
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12793 0.0 0.3 290688 115516 ? S Feb11 1:23 /usr/bin/python
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
nova 12794 0.0 0.2 248636 77068 ? S Feb11 0:04 /usr/bin/python
/usr/bin/nova-api --config-file=/etc/nova/nova.conf
root 24121 0.0 0.0 11688 912 pts/5 S+ 13:07 0:00 grep nova-api</computeroutput></screen>

      <para>You can create automated alerts for critical processes by using
      Nagios and NRPE. For example, to ensure that the
      <code>nova-compute</code> process is running on compute nodes, create an
      alert on your Nagios server that looks like this:</para>

      <programlisting>define service {
    host_name c01.example.com
    check_command check_nrpe_1arg!check_nova-compute
    use generic-service
    notification_period 24x7
    contact_groups sysadmins
    service_description nova-compute
}</programlisting>

      <para>Then on the actual compute node, create the following NRPE
      configuration:</para>

      <programlisting>\command[check_nova-compute]=/usr/lib/nagios/plugins/check_procs -c 1: \
-a nova-compute</programlisting>

      <para>Nagios checks that at least one <literal>nova-compute</literal>
      service is running at all times.</para>
    </section>

    <section xml:id="resource_alerting">
      <title>Resource Alerting</title>

      <para>Resource alerting provides notifications when one or more
      resources are critically low. While the monitoring thresholds should be
      tuned to your specific OpenStack environment, monitoring resource usage
      is not specific to OpenStack at all—any generic type of alert will work
      fine.<indexterm class="singular">
          <primary>monitoring</primary>

          <secondary>resource alerting</secondary>
        </indexterm><indexterm class="singular">
          <primary>alerts</primary>

          <secondary>resource</secondary>
        </indexterm><indexterm class="singular">
          <primary>resources</primary>

          <secondary>resource alerting</secondary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>resource alerting</secondary>
        </indexterm></para>

      <para>Some of the resources that you want to monitor include:</para>

      <itemizedlist>
        <listitem>
          <para>Disk usage</para>
        </listitem>

        <listitem>
          <para>Server load</para>
        </listitem>

        <listitem>
          <para>Memory usage</para>
        </listitem>

        <listitem>
          <para>Network I/O</para>
        </listitem>

        <listitem>
          <para>Available vCPUs</para>
        </listitem>
      </itemizedlist>

      <para>For example, to monitor disk capacity on a compute node with
      Nagios, add the following to your Nagios configuration:</para>

      <programlisting><?db-font-size 75%?>define service {
    host_name c01.example.com
    check_command check_nrpe!check_all_disks!20% 10%
    use generic-service
    contact_groups sysadmins
    service_description Disk
}</programlisting>

      <para>On the compute node, add the following to your NRPE
      configuration:</para>

      <programlisting><?db-font-size 75%?>command[check_all_disks]=/usr/lib/nagios/plugins/check_disk -w $ARG1$ -c \
$ARG2$ -e</programlisting>

      <para>Nagios alerts you with a WARNING when any disk on the compute node
      is 80 percent full and CRITICAL when 90 percent is full.</para>
    </section>

    <section xml:id="metering_telemetry">
      <title>Metering and Telemetry with Ceilometer</title>

      <para>An integrated OpenStack project (code-named ceilometer) collects
      metering data and provides alerts for Compute, Storage, and Networking.
      Data collected by the metering system could be used for billing.
      Depending on deployment configuration, metered data may be accessible to
      users based on the deployment configuration. The Telemetry service
      provides a REST API documented at <link
      xlink:href="http://api.openstack.org/api-ref-telemetry.html"></link>.
      You can read more about the project at <link
      xlink:href="http://docs.openstack.org/developer/ceilometer"></link>.<indexterm
          class="singular">
          <primary>monitoring</primary>

          <secondary>metering and telemetry</secondary>
        </indexterm><indexterm class="singular">
          <primary>telemetry/metering</primary>
        </indexterm><indexterm class="singular">
          <primary>metering/telemetry</primary>
        </indexterm><indexterm class="singular">
          <primary>ceilometer</primary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>ceilometer project</secondary>
        </indexterm></para>
    </section>

    <section xml:id="os_resources">
      <title>OpenStack-Specific Resources</title>

      <para>Resources such as memory, disk, and CPU are generic resources that
      all servers (even non-OpenStack servers) have and are important to the
      overall health of the server. When dealing with OpenStack specifically,
      these resources are important for a second reason: ensuring that enough
      are available to launch instances. There are a few ways you can see
      OpenStack resource usage.<indexterm class="singular">
          <primary>monitoring</primary>

          <secondary>OpenStack-specific resources</secondary>
        </indexterm><indexterm class="singular">
          <primary>resources</primary>

          <secondary>generic vs. OpenStack-specific</secondary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>OpenStack-specific resources</secondary>
        </indexterm> The first is through the <code>nova</code>
      command:</para>

      <programlisting># nova usage-list</programlisting>

      <para>This command displays a list of how many instances a tenant has
      running and some light usage statistics about the combined instances.
      This command is useful for a quick overview of your cloud, but it
      doesn't really get into a lot of details.</para>

      <para>Next, the <code>nova</code> database contains three tables that
      store usage information.</para>

      <para>The <code>nova.quotas</code> and <code>nova.quota_usages</code>
      tables store quota information. If a tenant's quota is different from
      the default quota settings, its quota is stored in the <phrase
      role="keep-together"><code>nova.quotas</code></phrase> table. For
      example:</para>

      <screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, hard_limit from quotas;</userinput>
<computeroutput>+----------------------------------+-----------------------------+------------+
| project_id                       | resource                    | hard_limit |
+----------------------------------+-----------------------------+------------+
| 628df59f091142399e0689a2696f5baa | metadata_items              | 128        |
| 628df59f091142399e0689a2696f5baa | injected_file_content_bytes | 10240      |
| 628df59f091142399e0689a2696f5baa | injected_files              | 5          |
| 628df59f091142399e0689a2696f5baa | gigabytes                   | 1000       |
| 628df59f091142399e0689a2696f5baa | ram                         | 51200      |
| 628df59f091142399e0689a2696f5baa | floating_ips                | 10         |
| 628df59f091142399e0689a2696f5baa | instances                   | 10         |
| 628df59f091142399e0689a2696f5baa | volumes                     | 10         |
| 628df59f091142399e0689a2696f5baa | cores                       | 20         |
+----------------------------------+-----------------------------+------------+</computeroutput></screen>

      <para>The <code>nova.quota_usages</code> table keeps track of how many
      resources the tenant currently has in use:</para>

      <screen><prompt>mysql&gt;</prompt> <userinput>select project_id, resource, in_use from quota_usages where project_id like '628%';</userinput>
<computeroutput>+----------------------------------+--------------+--------+
| project_id                       | resource     | in_use |
+----------------------------------+--------------+--------+
| 628df59f091142399e0689a2696f5baa | instances    | 1      |
| 628df59f091142399e0689a2696f5baa | ram          | 512    |
| 628df59f091142399e0689a2696f5baa | cores        | 1      |
| 628df59f091142399e0689a2696f5baa | floating_ips | 1      |
| 628df59f091142399e0689a2696f5baa | volumes      | 2      |
| 628df59f091142399e0689a2696f5baa | gigabytes    | 12     |
| 628df59f091142399e0689a2696f5baa | images       | 1      |
+----------------------------------+--------------+--------+</computeroutput></screen>

      <para>By comparing a tenant's hard limit with their current resource
      usage, you can see their usage percentage. For example, if this tenant
      is using 1 floating IP out of 10, then they are using 10 percent of
      their floating IP quota. Rather than doing the calculation manually, you
      can use SQL or the scripting language of your choice and create a
      formatted report:</para>

      <screen><computeroutput>+----------------------------------+------------+-------------+---------------+
| some_tenant                                                                 |
+-----------------------------------+------------+------------+---------------+
| Resource                          | Used       | Limit      |               |
+-----------------------------------+------------+------------+---------------+
| cores                             | 1          | 20         |           5 % |
| floating_ips                      | 1          | 10         |          10 % |
| gigabytes                         | 12         | 1000       |           1 % |
| images                            | 1          | 4          |          25 % |
| injected_file_content_bytes       | 0          | 10240      |           0 % |
| injected_file_path_bytes          | 0          | 255        |           0 % |
| injected_files                    | 0          | 5          |           0 % |
| instances                         | 1          | 10         |          10 % |
| key_pairs                         | 0          | 100        |           0 % |
| metadata_items                    | 0          | 128        |           0 % |
| ram                               | 512        | 51200      |           1 % |
| reservation_expire                | 0          | 86400      |           0 % |
| security_group_rules              | 0          | 20         |           0 % |
| security_groups                   | 0          | 10         |           0 % |
| volumes                           | 2          | 10         |          20 % |
+-----------------------------------+------------+------------+---------------+</computeroutput></screen>

      <para>The preceding information was generated by using a custom script
      that can be found on <link
      xlink:href="http://opsgui.de/NPGjbX">GitHub</link>.</para>

      <note>
        <para>This script is specific to a certain OpenStack installation and
        must be modified to fit your environment. However, the logic should
        easily be transferable.</para>
      </note>
    </section>

    <section xml:id="intelligent_alerting">
      <title>Intelligent Alerting</title>

      <para>Intelligent alerting can be thought of as a form of continuous
      integration for operations. For example, you can easily check to see
      whether the Image Service is up and running by ensuring that
      the&#160;<code>glance-api</code> and <code>glance-registry</code>
      processes are running or by seeing whether <code>glace-api</code> is
      responding on port 9292.<indexterm class="singular">
          <primary>monitoring</primary>

          <secondary>intelligent alerting</secondary>
        </indexterm><indexterm class="singular">
          <primary>alerts</primary>

          <secondary>intelligent</secondary>

          <seealso>logging/monitoring</seealso>
        </indexterm><indexterm class="singular">
          <primary>intelligent alerting</primary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>intelligent alerting</secondary>
        </indexterm></para>

      <para>But how can you tell whether images are being successfully
      uploaded to the Image Service? Maybe the disk that Image Service is
      storing the images on is full or the S3 backend is down. You could
      naturally check this by doing a quick image upload:</para>

      <?hard-pagebreak ?>

      <programlisting language="bash">#!/bin/bash
#
# assumes that reasonable credentials have been stored at
# /root/auth


. /root/openrc
wget https://launchpad.net/cirros/trunk/0.3.0/+download/ \
     cirros-0.3.0-x86_64-disk.img
glance image-create --name='cirros image' --is-public=true
--container-format=bare --disk-format=qcow2 &lt; cirros-0.3.0-x8
6_64-disk.img</programlisting>

      <para>By taking this script and rolling it into an alert for your
      monitoring system (such as Nagios), you now have an automated way of
      ensuring that image uploads to the Image Catalog are working.</para>

      <note>
        <para>You must remove the image after each test. Even better, test
        whether you can successfully delete an image from the Image
        Service.</para>
      </note>

      <para>Intelligent alerting takes considerably more time to plan and
      implement than the other alerts described in this chapter. A good
      outline to implement intelligent alerting is:</para>

      <itemizedlist>
        <listitem>
          <para>Review common actions in your cloud.</para>
        </listitem>

        <listitem>
          <para>Create ways to automatically test these actions.</para>
        </listitem>

        <listitem>
          <para>Roll these tests into an alerting system.</para>
        </listitem>
      </itemizedlist>

      <para>Some other examples for Intelligent Alerting include:</para>

      <itemizedlist>
        <listitem>
          <para>Can instances launch and be destroyed?</para>
        </listitem>

        <listitem>
          <para>Can users be created?</para>
        </listitem>

        <listitem>
          <para>Can objects be stored and deleted?</para>
        </listitem>

        <listitem>
          <para>Can volumes be created and destroyed?</para>
        </listitem>
      </itemizedlist>
    </section>

    <section xml:id="trending">
      <title>Trending</title>

      <para>Trending can give you great insight into how your cloud is
      performing day to day. You can learn, for example, if a busy day was
      simply a rare occurrence or if you should start adding new compute
      nodes.<indexterm class="singular">
          <primary>monitoring</primary>

          <secondary>trending</secondary>

          <seealso>logging/monitoring</seealso>
        </indexterm><indexterm class="singular">
          <primary>trending</primary>

          <secondary>monitoring cloud performance with</secondary>
        </indexterm><indexterm class="singular">
          <primary>logging/monitoring</primary>

          <secondary>trending</secondary>
        </indexterm></para>

      <para>Trending takes a slightly different approach than alerting. While
      alerting is interested in a binary result (whether a check succeeds or
      fails), trending records the current state of something at a certain
      point in time. Once enough points in time have been recorded, you can
      see how the value has changed over time.<indexterm class="singular">
          <primary>trending</primary>

          <secondary>vs. alerts</secondary>
        </indexterm><indexterm class="singular">
          <primary>binary</primary>

          <secondary>binary results in trending</secondary>
        </indexterm></para>

      <para>All of the alert types mentioned earlier can also be used for
      trend reporting. Some other trend examples include:<indexterm
          class="singular">
          <primary>trending</primary>

          <secondary>report examples</secondary>
        </indexterm></para>

      <itemizedlist>
        <listitem>
          <para>The number of instances on each compute node</para>
        </listitem>

        <listitem>
          <para>The types of flavors in use</para>
        </listitem>

        <listitem>
          <para>The number of volumes in use</para>
        </listitem>

        <listitem>
          <para>The number of Object Storage requests each hour</para>
        </listitem>

        <listitem>
          <para>The number of <literal>nova-api</literal> requests each
          hour</para>
        </listitem>

        <listitem>
          <para>The I/O statistics of your storage services</para>
        </listitem>
      </itemizedlist>

      <para>As an example, recording <code>nova-api</code> usage can allow you
      to track the need to scale your cloud controller. By keeping an eye on
      <code>nova-api</code> requests, you can determine whether you need to
      spawn more <literal>nova-api</literal> processes or go as far as
      introducing an entirely new server to run <code>nova-api</code>. To get
      an approximate count of the requests, look for standard INFO messages in
      <code>/var/log/nova/nova-api.log</code>:</para>

      <programlisting># grep INFO /var/log/nova/nova-api.log | wc</programlisting>

      <para>You can obtain further statistics by looking for the number of
      successful requests:</para>

      <programlisting># grep " 200 " /var/log/nova/nova-api.log | wc</programlisting>

      <para>By running this command periodically and keeping a record of the
      result, you can create a trending report over time that shows whether
      your <code>nova-api</code> usage is increasing, decreasing, or keeping
      steady.</para>

      <para>A tool such as collectd can be used to store this information.
      While collectd is out of the scope of this book, a good starting point
      would be to use collectd to store the result as a COUNTER data type.
      More information can be found in <link
      xlink:href="http://opsgui.de/1eLBriA">collectd's
      documentation</link>.</para>
    </section>
  </section>

  <section xml:id="ops-log-monitor-summary">
    <title>Summary</title>

    <para>For stable operations, you want to detect failure promptly and
    determine causes efficiently. With a distributed system, it's even more
    important to track the right items to meet a service-level target.
    Learning where these logs are located in the file system or API gives you
    an advantage. This chapter also showed how to read, interpret, and
    manipulate information from OpenStack services so that you can monitor
    effectively.</para>
  </section>
</chapter>