
Remove all instances of the nova-manage sub-command in the ops-guide and replace it with nova command-line client commands. Change-Id: Ibb3f0be68ccd165ce7a8ad7746fbd20b45f0ff6a Partial-Bug: #1517322 Co-Authored-By: daz <dazzachan@yahoo.com.au>
573 lines
33 KiB
XML
573 lines
33 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<!DOCTYPE appendix [
|
|
<!ENTITY % openstack SYSTEM "openstack.ent">
|
|
%openstack;
|
|
]>
|
|
<appendix xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
|
xml:id="app_crypt" label="B">
|
|
<title>Tales From the Cryp^H^H^H^H Cloud</title>
|
|
|
|
<para>Herein lies a selection of tales from OpenStack cloud operators. Read,
|
|
and learn from their wisdom.</para>
|
|
|
|
<section xml:id="double_vlan">
|
|
<title>Double VLAN</title>
|
|
<para>I was on-site in Kelowna, British Columbia, Canada
|
|
setting up a new OpenStack cloud. The deployment was fully
|
|
automated: Cobbler deployed the OS on the bare metal,
|
|
bootstrapped it, and Puppet took over from there. I had
|
|
run the deployment scenario so many times in practice and
|
|
took for granted that everything was working.</para>
|
|
<para>On my last day in Kelowna, I was in a conference call
|
|
from my hotel. In the background, I was fooling around on
|
|
the new cloud. I launched an instance and logged in.
|
|
Everything looked fine. Out of boredom, I ran
|
|
<command>ps aux</command> and
|
|
all of the sudden the instance locked up.</para>
|
|
<para>Thinking it was just a one-off issue, I terminated the
|
|
instance and launched a new one. By then, the conference
|
|
call ended and I was off to the data center.</para>
|
|
<para>At the data center, I was finishing up some tasks and remembered
|
|
the lock-up. I logged into the new instance and ran <command>ps
|
|
aux</command> again. It worked. Phew. I decided to run it one
|
|
more time. It locked up.</para>
|
|
<para>After reproducing the problem several times, I came to
|
|
the unfortunate conclusion that this cloud did indeed have
|
|
a problem. Even worse, my time was up in Kelowna and I had
|
|
to return back to Calgary.</para>
|
|
<para>Where do you even begin troubleshooting something like
|
|
this? An instance that just randomly locks up when a command is
|
|
issued. Is it the image? Nope—it happens on all images.
|
|
Is it the compute node? Nope—all nodes. Is the instance
|
|
locked up? No! New SSH connections work just fine!</para>
|
|
<para>We reached out for help. A networking engineer suggested
|
|
it was an MTU issue. Great! MTU! Something to go on!
|
|
What's MTU and why would it cause a problem?</para>
|
|
<para>MTU is maximum transmission unit. It specifies the
|
|
maximum number of bytes that the interface accepts for
|
|
each packet. If two interfaces have two different MTUs,
|
|
bytes might get chopped off and weird things happen—such
|
|
as random session lockups.</para>
|
|
<note>
|
|
<para>Not all packets have a size of 1500. Running the <command>ls</command>
|
|
command over SSH might only create a single packets
|
|
less than 1500 bytes. However, running a command with
|
|
heavy output, such as <command>ps aux</command>
|
|
requires several packets of 1500 bytes.</para>
|
|
</note>
|
|
<para>OK, so where is the MTU issue coming from? Why haven't
|
|
we seen this in any other deployment? What's new in this
|
|
situation? Well, new data center, new uplink, new
|
|
switches, new model of switches, new servers, first time
|
|
using this model of servers… so, basically everything was
|
|
new. Wonderful. We toyed around with raising the MTU at
|
|
various areas: the switches, the NICs on the compute
|
|
nodes, the virtual NICs in the instances, we even had the
|
|
data center raise the MTU for our uplink interface. Some
|
|
changes worked, some didn't. This line of troubleshooting
|
|
didn't feel right, though. We shouldn't have to be
|
|
changing the MTU in these areas.</para>
|
|
<para>As a last resort, our network admin (Alvaro) and myself
|
|
sat down with four terminal windows, a pencil, and a piece
|
|
of paper. In one window, we ran ping. In the second
|
|
window, we ran <command>tcpdump</command> on the cloud
|
|
controller. In the third, <command>tcpdump</command> on
|
|
the compute node. And the forth had <command>tcpdump</command>
|
|
on the instance. For background, this cloud was a
|
|
multi-node, non-multi-host setup.</para>
|
|
<para>One cloud controller acted as a gateway to all compute
|
|
nodes. VlanManager was used for the network config. This
|
|
means that the cloud controller and all compute nodes had
|
|
a different VLAN for each OpenStack project. We used the
|
|
-s option of <command>ping</command> to change the packet size.
|
|
We watched as sometimes packets would fully return,
|
|
sometimes they'd only make it out and never back in, and
|
|
sometimes the packets would stop at a random point. We
|
|
changed <command>tcpdump</command> to start displaying the
|
|
hex dump of the packet. We pinged between every
|
|
combination of outside, controller, compute, and
|
|
instance.</para>
|
|
<para>Finally, Alvaro noticed something. When a packet from
|
|
the outside hits the cloud controller, it should not be
|
|
configured with a VLAN. We verified this as true. When the
|
|
packet went from the cloud controller to the compute node,
|
|
it should only have a VLAN if it was destined for an
|
|
instance. This was still true. When the ping reply was
|
|
sent from the instance, it should be in a VLAN. True. When
|
|
it came back to the cloud controller and on its way out to
|
|
the Internet, it should no longer have a VLAN.
|
|
False. Uh oh. It looked as though the VLAN part of the
|
|
packet was not being removed.</para>
|
|
<para>That made no sense.</para>
|
|
<para>While bouncing this idea around in our heads, I was
|
|
randomly typing commands on the compute node:
|
|
<screen><userinput><prompt>$</prompt> ip a</userinput>
|
|
<computeroutput>…
|
|
10: vlan100@vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br100 state UP
|
|
…</computeroutput></screen>
|
|
</para>
|
|
<para>"Hey Alvaro, can you run a VLAN on top of a
|
|
VLAN?"</para>
|
|
<para>"If you did, you'd add an extra 4 bytes to the
|
|
packet…"</para>
|
|
<para>Then it all made sense…
|
|
<screen><userinput><prompt>$</prompt> grep vlan_interface /etc/nova/nova.conf</userinput>
|
|
<computeroutput>vlan_interface=vlan20</computeroutput></screen>
|
|
</para>
|
|
<para>In <filename>nova.conf</filename>, <code>vlan_interface</code>
|
|
specifies what interface OpenStack should attach all VLANs
|
|
to. The correct setting should have been:
|
|
<programlisting>vlan_interface=bond0</programlisting></para>
|
|
<para>As this would be the server's bonded NIC.</para>
|
|
<para>vlan20 is the VLAN that the data center gave us for
|
|
outgoing Internet access. It's a correct VLAN and
|
|
is also attached to bond0.</para>
|
|
<para>By mistake, I configured OpenStack to attach all tenant
|
|
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
|
|
on top of another. This added an extra 4 bytes to
|
|
each packet and caused a packet of 1504 bytes to be sent
|
|
out which would cause problems when it arrived at an
|
|
interface that only accepted 1500.</para>
|
|
<para>As soon as this setting was fixed, everything
|
|
worked.</para>
|
|
</section>
|
|
<section xml:id="issue">
|
|
<title>"The Issue"</title>
|
|
<para>At the end of August 2012, a post-secondary school in
|
|
Alberta, Canada migrated its infrastructure to an
|
|
OpenStack cloud. As luck would have it, within the first
|
|
day or two of it running, one of their servers just
|
|
disappeared from the network. Blip. Gone.</para>
|
|
<para>After restarting the instance, everything was back up
|
|
and running. We reviewed the logs and saw that at some
|
|
point, network communication stopped and then everything
|
|
went idle. We chalked this up to a random
|
|
occurrence.</para>
|
|
<para>A few nights later, it happened again.</para>
|
|
<para>We reviewed both sets of logs. The one thing that stood
|
|
out the most was DHCP. At the time, OpenStack, by default,
|
|
set DHCP leases for one minute (it's now two minutes).
|
|
This means that every instance
|
|
contacts the cloud controller (DHCP server) to renew its
|
|
fixed IP. For some reason, this instance could not renew
|
|
its IP. We correlated the instance's logs with the logs on
|
|
the cloud controller and put together a
|
|
conversation:</para>
|
|
<orderedlist>
|
|
<listitem>
|
|
<para>Instance tries to renew IP.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Cloud controller receives the renewal request
|
|
and sends a response.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Instance "ignores" the response and re-sends the
|
|
renewal request.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Cloud controller receives the second request and
|
|
sends a new response.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>Instance begins sending a renewal request to
|
|
<code>255.255.255.255</code> since it hasn't
|
|
heard back from the cloud controller.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The cloud controller receives the
|
|
<code>255.255.255.255</code> request and sends
|
|
a third response.</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The instance finally gives up.</para>
|
|
</listitem>
|
|
</orderedlist>
|
|
<para>With this information in hand, we were sure that the
|
|
problem had to do with DHCP. We thought that for some
|
|
reason, the instance wasn't getting a new IP address and
|
|
with no IP, it shut itself off from the network.</para>
|
|
<para>A quick Google search turned up this: <link
|
|
xlink:href="https://lists.launchpad.net/openstack/msg11696.html"
|
|
>DHCP lease errors in VLAN mode</link>
|
|
(https://lists.launchpad.net/openstack/msg11696.html)
|
|
which further supported our DHCP theory.</para>
|
|
<para>An initial idea was to just increase the lease time. If
|
|
the instance only renewed once every week, the chances of
|
|
this problem happening would be tremendously smaller than
|
|
every minute. This didn't solve the problem, though. It
|
|
was just covering the problem up.</para>
|
|
<para>We decided to have <command>tcpdump</command> run on this
|
|
instance and see if we could catch it in action again.
|
|
Sure enough, we did.</para>
|
|
<para>The <command>tcpdump</command> looked very, very weird. In
|
|
short, it looked as though network communication stopped
|
|
before the instance tried to renew its IP. Since there is
|
|
so much DHCP chatter from a one minute lease, it's very
|
|
hard to confirm it, but even with only milliseconds
|
|
difference between packets, if one packet arrives first,
|
|
it arrived first, and if that packet reported network
|
|
issues, then it had to have happened before DHCP.</para>
|
|
<para>Additionally, this instance in question was responsible
|
|
for a very, very large backup job each night. While "The
|
|
Issue" (as we were now calling it) didn't happen exactly
|
|
when the backup happened, it was close enough (a few
|
|
hours) that we couldn't ignore it.</para>
|
|
<para>Further days go by and we catch The Issue in action more
|
|
and more. We find that dhclient is not running after The
|
|
Issue happens. Now we're back to thinking it's a DHCP
|
|
issue. Running <filename>/etc/init.d/networking</filename> restart
|
|
brings everything back up and running.</para>
|
|
<para>Ever have one of those days where all of the sudden you
|
|
get the Google results you were looking for? Well, that's
|
|
what happened here. I was looking for information on
|
|
dhclient and why it dies when it can't renew its lease and
|
|
all of the sudden I found a bunch of OpenStack and dnsmasq
|
|
discussions that were identical to the problem we were
|
|
seeing!</para>
|
|
<para>
|
|
<link
|
|
xlink:href="http://www.gossamer-threads.com/lists/openstack/operators/18197"
|
|
>Problem with Heavy Network IO and Dnsmasq</link>
|
|
(http://www.gossamer-threads.com/lists/openstack/operators/18197)
|
|
</para>
|
|
<para>
|
|
<link
|
|
xlink:href="http://www.gossamer-threads.com/lists/openstack/dev/14696"
|
|
>instances losing IP address while running, due to No
|
|
DHCPOFFER</link>
|
|
(http://www.gossamer-threads.com/lists/openstack/dev/14696)</para>
|
|
<para>Seriously, Google.</para>
|
|
<para>This bug report was the key to everything:
|
|
<link
|
|
xlink:href="https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978"
|
|
> KVM images lose connectivity with bridged
|
|
network</link>
|
|
(https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978)</para>
|
|
<para>It was funny to read the report. It was full of people
|
|
who had some strange network problem but didn't quite
|
|
explain it in the same way.</para>
|
|
<para>So it was a qemu/kvm bug.</para>
|
|
<para>At the same time of finding the bug report, a co-worker
|
|
was able to successfully reproduce The Issue! How? He used
|
|
<command>iperf</command> to spew a ton of bandwidth at an instance. Within 30
|
|
minutes, the instance just disappeared from the
|
|
network.</para>
|
|
<para>Armed with a patched qemu and a way to reproduce, we set
|
|
out to see if we've finally solved The Issue. After 48
|
|
hours straight of hammering the instance with bandwidth,
|
|
we were confident. The rest is history. You can search the
|
|
bug report for "joe" to find my comments and actual
|
|
tests.</para>
|
|
</section>
|
|
<section xml:id="disappear_images">
|
|
<title>Disappearing Images</title>
|
|
<para>At the end of 2012, Cybera (a nonprofit with a mandate
|
|
to oversee the development of cyberinfrastructure in
|
|
Alberta, Canada) deployed an updated OpenStack cloud for
|
|
their <link xlink:title="DAIR project"
|
|
xlink:href="http://www.canarie.ca/cloud/"
|
|
>DAIR project</link>
|
|
(http://www.canarie.ca/en/dair-program/about). A few days
|
|
into production, a compute node locks up. Upon rebooting
|
|
the node, I checked to see what instances were hosted on
|
|
that node so I could boot them on behalf of the customer.
|
|
Luckily, only one instance.</para>
|
|
<para>The <command>nova reboot</command> command wasn't working, so
|
|
I used <command>virsh</command>, but it immediately came back
|
|
with an error saying it was unable to find the backing
|
|
disk. In this case, the backing disk is the Glance image
|
|
that is copied to
|
|
<filename>/var/lib/nova/instances/_base</filename> when the
|
|
image is used for the first time. Why couldn't it find it?
|
|
I checked the directory and sure enough it was
|
|
gone.</para>
|
|
<para>I reviewed the <code>nova</code> database and saw the
|
|
instance's entry in the <code>nova.instances</code> table.
|
|
The image that the instance was using matched what virsh
|
|
was reporting, so no inconsistency there.</para>
|
|
<para>I checked Glance and noticed that this image was a
|
|
snapshot that the user created. At least that was good
|
|
news—this user would have been the only user
|
|
affected.</para>
|
|
<para>Finally, I checked StackTach and reviewed the user's events. They
|
|
had created and deleted several snapshots—most likely
|
|
experimenting. Although the timestamps didn't match up, my
|
|
conclusion was that they launched their instance and then deleted
|
|
the snapshot and it was somehow removed from
|
|
<filename>/var/lib/nova/instances/_base</filename>. None of that
|
|
made sense, but it was the best I could come up with.</para>
|
|
<para>It turns out the reason that this compute node locked up
|
|
was a hardware issue. We removed it from the DAIR cloud
|
|
and called Dell to have it serviced. Dell arrived and
|
|
began working. Somehow or another (or a fat finger), a
|
|
different compute node was bumped and rebooted.
|
|
Great.</para>
|
|
<para>When this node fully booted, I ran through the same
|
|
scenario of seeing what instances were running so I could
|
|
turn them back on. There were a total of four. Three
|
|
booted and one gave an error. It was the same error as
|
|
before: unable to find the backing disk. Seriously,
|
|
what?</para>
|
|
<para>Again, it turns out that the image was a snapshot. The
|
|
three other instances that successfully started were
|
|
standard cloud images. Was it a problem with snapshots?
|
|
That didn't make sense.</para>
|
|
<para>A note about DAIR's architecture:
|
|
<filename>/var/lib/nova/instances</filename> is a shared NFS
|
|
mount. This means that all compute nodes have access to
|
|
it, which includes the <code>_base</code> directory.
|
|
Another centralized area is <filename>/var/log/rsyslog</filename>
|
|
on the cloud controller. This directory collects all
|
|
OpenStack logs from all compute nodes. I wondered if there
|
|
were any entries for the file that <command>virsh</command> is
|
|
reporting:
|
|
<screen><computeroutput>dair-ua-c03/nova.log:Dec 19 12:10:59 dair-ua-c03
|
|
2012-12-19 12:10:59 INFO nova.virt.libvirt.imagecache
|
|
[-] Removing base file:
|
|
/var/lib/nova/instances/_base/7b4783508212f5d242cbf9ff56fb8d33b4ce6166_10</computeroutput>
|
|
</screen>
|
|
</para>
|
|
<para>Ah-hah! So OpenStack was deleting it. But why?</para>
|
|
<para>A feature was introduced in Essex to periodically check
|
|
and see if there were any <code>_base</code> files not in use.
|
|
If there
|
|
were, OpenStack Compute would delete them. This idea sounds innocent
|
|
enough and has some good qualities to it. But how did this
|
|
feature end up turned on? It was disabled by default in
|
|
Essex. As it should be. It was <link
|
|
xlink:href="https://bugs.launchpad.net/nova/+bug/1029674"
|
|
>decided to be turned on in Folsom</link>
|
|
(https://bugs.launchpad.net/nova/+bug/1029674). I cannot
|
|
emphasize enough that:</para>
|
|
<para>
|
|
<emphasis>Actions which delete things should not be
|
|
enabled by default.</emphasis>
|
|
</para>
|
|
<para>Disk space is cheap these days. Data recovery is
|
|
not.</para>
|
|
<para>Secondly, DAIR's shared
|
|
<filename>/var/lib/nova/instances</filename> directory
|
|
contributed to the problem. Since all compute nodes have
|
|
access to this directory, all compute nodes periodically
|
|
review the _base directory. If there is only one instance
|
|
using an image, and the node that the instance is on is
|
|
down for a few minutes, it won't be able to mark the image
|
|
as still in use. Therefore, the image seems like it's not
|
|
in use and is deleted. When the compute node comes back
|
|
online, the instance hosted on that node is unable to
|
|
start.</para>
|
|
</section>
|
|
<section xml:id="valentines">
|
|
<title>The Valentine's Day Compute Node Massacre</title>
|
|
<para>Although the title of this story is much more dramatic
|
|
than the actual event, I don't think, or hope, that I'll
|
|
have the opportunity to use "Valentine's Day Massacre"
|
|
again in a title.</para>
|
|
<para>This past Valentine's Day, I received an alert that a
|
|
compute node was no longer available in the cloud—meaning,
|
|
<screen><prompt>$</prompt><userinput>nova service-list</userinput></screen>
|
|
showed this particular node in down state.</para>
|
|
<para>I logged into the cloud controller and was able to both
|
|
<command>ping</command> and SSH into the problematic compute node which
|
|
seemed very odd. Usually if I receive this type of alert,
|
|
the compute node has totally locked up and would be
|
|
inaccessible.</para>
|
|
<para>After a few minutes of troubleshooting, I saw the
|
|
following details:</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>A user recently tried launching a CentOS
|
|
instance on that node</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>This user was the only user on the node (new
|
|
node)</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The load shot up to 8 right before I received
|
|
the alert</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The bonded 10gb network device (bond0) was in a
|
|
DOWN state</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>The 1gb NIC was still alive and active</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>I looked at the status of both NICs in the bonded pair
|
|
and saw that neither was able to communicate with the
|
|
switch port. Seeing as how each NIC in the bond is
|
|
connected to a separate switch, I thought that the chance
|
|
of a switch port dying on each switch at the same time was
|
|
quite improbable. I concluded that the 10gb dual port NIC
|
|
had died and needed replaced. I created a ticket for the
|
|
hardware support department at the data center where the
|
|
node was hosted. I felt lucky that this was a new node and
|
|
no one else was hosted on it yet.</para>
|
|
<para>An hour later I received the same alert, but for another
|
|
compute node. Crap. OK, now there's definitely a problem
|
|
going on. Just like the original node, I was able to log
|
|
in by SSH. The bond0 NIC was DOWN but the 1gb NIC was
|
|
active.</para>
|
|
<para>And the best part: the same user had just tried creating
|
|
a CentOS instance. What?</para>
|
|
<para>I was totally confused at this point, so I texted our
|
|
network admin to see if he was available to help. He
|
|
logged in to both switches and immediately saw the
|
|
problem: the switches detected spanning tree packets
|
|
coming from the two compute nodes and immediately shut the
|
|
ports down to prevent spanning tree loops:
|
|
<screen><computeroutput>Feb 15 01:40:18 SW-1 Stp: %SPANTREE-4-BLOCK_BPDUGUARD: Received BPDU packet on Port-Channel35 with BPDU guard enabled. Disabling interface. (source mac fa:16:3e:24:e7:22)
|
|
Feb 15 01:40:18 SW-1 Ebra: %ETH-4-ERRDISABLE: bpduguard error detected on Port-Channel35.
|
|
Feb 15 01:40:18 SW-1 Mlag: %MLAG-4-INTF_INACTIVE_LOCAL: Local interface Port-Channel35 is link down. MLAG 35 is inactive.
|
|
Feb 15 01:40:18 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-Channel35 (Server35), changed state to down
|
|
Feb 15 01:40:19 SW-1 Stp: %SPANTREE-6-INTERFACE_DEL: Interface Port-Channel35 has been removed from instance MST0
|
|
Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet35 (Server35), changed state to down</computeroutput></screen>
|
|
</para>
|
|
<para>He re-enabled the switch ports and the two compute nodes
|
|
immediately came back to life.</para>
|
|
<para>Unfortunately, this story has an open ending... we're
|
|
still looking into why the CentOS image was sending out
|
|
spanning tree packets. Further, we're researching a proper
|
|
way on how to mitigate this from happening. It's a bigger
|
|
issue than one might think. While it's extremely important
|
|
for switches to prevent spanning tree loops, it's very
|
|
problematic to have an entire compute node be cut from the
|
|
network when this happens. If a compute node is hosting
|
|
100 instances and one of them sends a spanning tree
|
|
packet, that instance has effectively DDOS'd the other 99
|
|
instances.</para>
|
|
<para>This is an ongoing and hot topic in networking circles
|
|
—especially with the raise of virtualization and virtual
|
|
switches.</para>
|
|
</section>
|
|
<section xml:id="rabbithole">
|
|
<title>Down the Rabbit Hole</title>
|
|
<para>Users being able to retrieve console logs from running
|
|
instances is a boon for support—many times they can
|
|
figure out what's going on inside their instance and fix
|
|
what's going on without bothering you. Unfortunately,
|
|
sometimes overzealous logging of failures can cause
|
|
problems of its own.</para>
|
|
<para>A report came in: VMs were launching slowly, or not at
|
|
all. Cue the standard checks—nothing on the Nagios, but
|
|
there was a spike in network towards the current master of
|
|
our RabbitMQ cluster. Investigation started, but soon the
|
|
other parts of the queue cluster were leaking memory like
|
|
a sieve. Then the alert came in—the master Rabbit server
|
|
went down and connections failed over to the slave.</para>
|
|
<para>At that time, our control services were hosted by
|
|
another team and we didn't have much debugging information
|
|
to determine what was going on with the master, and we
|
|
could not reboot it. That team noted that it failed without
|
|
alert, but managed to reboot it. After an hour, the
|
|
cluster had returned to its normal state and we went home
|
|
for the day.</para>
|
|
<para>Continuing the diagnosis the next morning was kick
|
|
started by another identical failure. We quickly got the
|
|
message queue running again, and tried to work out why
|
|
Rabbit was suffering from so much network traffic.
|
|
Enabling debug logging on
|
|
<systemitem class="service">nova-api</systemitem> quickly brought
|
|
understanding. A <command>tail -f
|
|
/var/log/nova/nova-api.log</command> was scrolling by
|
|
faster than we'd ever seen before. CTRL+C on that and we
|
|
could plainly see the contents of a system log spewing
|
|
failures over and over again - a system log from one of
|
|
our users' instances.</para>
|
|
<para>After finding the instance ID we headed over to
|
|
<filename>/var/lib/nova/instances</filename> to find the
|
|
<filename>console.log</filename>:
|
|
<screen><computeroutput>
|
|
adm@cc12:/var/lib/nova/instances/instance-00000e05# wc -l console.log
|
|
92890453 console.log
|
|
adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log
|
|
5.5G console.log</computeroutput></screen></para>
|
|
<para>Sure enough, the user had been periodically refreshing
|
|
the console log page on the dashboard and the 5G file was
|
|
traversing the Rabbit cluster to get to the
|
|
dashboard.</para>
|
|
<para>We called them and asked them to stop for a while, and
|
|
they were happy to abandon the horribly broken VM. After
|
|
that, we started monitoring the size of console
|
|
logs.</para>
|
|
<para>To this day, <link
|
|
xlink:href="https://bugs.launchpad.net/nova/+bug/832507"
|
|
>the issue</link>
|
|
(https://bugs.launchpad.net/nova/+bug/832507) doesn't have
|
|
a permanent resolution, but we look forward to the
|
|
discussion at the next summit.</para>
|
|
</section>
|
|
<section xml:id="haunted">
|
|
<title>Havana Haunted by the Dead</title>
|
|
<para>Felix Lee of Academia Sinica Grid Computing Centre in Taiwan
|
|
contributed this story.</para>
|
|
<para>I just upgraded OpenStack from Grizzly to Havana 2013.2-2 using
|
|
the RDO repository and everything was running pretty
|
|
well—except the EC2 API.</para>
|
|
<para>I noticed that the API would suffer from a heavy load and
|
|
respond slowly to particular EC2 requests such as
|
|
<literal>RunInstances</literal>.</para>
|
|
<para>Output from <filename>/var/log/nova/nova-api.log</filename> on
|
|
Havana:</para>
|
|
<screen><computeroutput>2014-01-10 09:11:45.072 129745 INFO nova.ec2.wsgi.server
|
|
[req-84d16d16-3808-426b-b7af-3b90a11b83b0
|
|
0c6e7dba03c24c6a9bce299747499e8a 7052bd6714e7460caeb16242e68124f9]
|
|
117.103.103.29 "GET
|
|
/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000001&InstanceInitiatedShutdownBehavior=terminate...
|
|
HTTP/1.1" status: 200 len: 1109 time: 138.5970151
|
|
</computeroutput></screen>
|
|
<para>This request took over two minutes to process, but executed
|
|
quickly on another co-existing Grizzly deployment using the same
|
|
hardware and system configuration.</para>
|
|
<para>Output from <filename>/var/log/nova/nova-api.log</filename> on
|
|
Grizzly:</para>
|
|
<screen><computeroutput>2014-01-08 11:15:15.704 INFO nova.ec2.wsgi.server
|
|
[req-ccac9790-3357-4aa8-84bd-cdaab1aa394e
|
|
ebbd729575cb404081a45c9ada0849b7 8175953c209044358ab5e0ec19d52c37]
|
|
117.103.103.29 "GET
|
|
/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000007&InstanceInitiatedShutdownBehavior=terminate...
|
|
HTTP/1.1" status: 200 len: 931 time: 3.9426181
|
|
</computeroutput></screen>
|
|
<para>While monitoring system resources, I noticed
|
|
a significant increase in memory consumption while the EC2 API
|
|
processed this request. I thought it wasn't handling memory
|
|
properly—possibly not releasing memory. If the API received
|
|
several of these requests, memory consumption quickly grew until
|
|
the system ran out of RAM and began using swap. Each node has
|
|
48 GB of RAM and the "nova-api" process would consume all of it
|
|
within minutes. Once this happened, the entire system would become
|
|
unusably slow until I restarted the
|
|
<systemitem class="service">nova-api</systemitem> service.</para>
|
|
<para>So, I found myself wondering what changed in the EC2 API on
|
|
Havana that might cause this to happen. Was it a bug or a normal
|
|
behavior that I now need to work around?</para>
|
|
<para>After digging into the nova (OpenStack Compute) code, I noticed two areas in
|
|
<filename>api/ec2/cloud.py</filename> potentially impacting my
|
|
system:</para>
|
|
<programlisting language="python">
|
|
instances = self.compute_api.get_all(context,
|
|
search_opts=search_opts,
|
|
sort_dir='asc')
|
|
|
|
sys_metas = self.compute_api.get_all_system_metadata(
|
|
context, search_filts=[{'key': ['EC2_client_token']},
|
|
{'value': [client_token]}])
|
|
</programlisting>
|
|
<para>Since my database contained many records—over 1 million
|
|
metadata records and over 300,000 instance records in "deleted"
|
|
or "errored" states—each search took a long time. I decided to clean
|
|
up the database by first archiving a copy for backup and then
|
|
performing some deletions using the MySQL client. For example, I
|
|
ran the following SQL command to remove rows of instances deleted
|
|
for over a year:</para>
|
|
<screen><prompt>mysql></prompt> <userinput>delete from nova.instances where deleted=1 and terminated_at < (NOW() - INTERVAL 1 YEAR);</userinput></screen>
|
|
<para>Performance increased greatly after deleting the old records and
|
|
my new deployment continues to behave well.</para>
|
|
</section>
|
|
</appendix>
|