
This adds a new story to tales from the crypt, about an experience upgrading from Grizzly to Havana on a decent sized cloud. Change-Id: I887dc10c0efda14cd322777b44c141880522623c
584 lines
33 KiB
XML
584 lines
33 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
||
<!DOCTYPE appendix [
|
||
<!-- Some useful entities borrowed from HTML -->
|
||
<!ENTITY ndash "–">
|
||
<!ENTITY mdash "—">
|
||
<!ENTITY hellip "…">
|
||
<!ENTITY plusmn "±">
|
||
|
||
]>
|
||
<appendix xmlns="http://docbook.org/ns/docbook"
|
||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||
xml:id="app_crypt" label="B">
|
||
<title>Tales From the Cryp^H^H^H^H Cloud</title>
|
||
|
||
<para>Herein lies a selection of tales from OpenStack cloud operators. Read,
|
||
and learn from their wisdom.</para>
|
||
|
||
<section xml:id="double_vlan">
|
||
<title>Double VLAN</title>
|
||
<para>I was on-site in Kelowna, British Columbia, Canada
|
||
setting up a new OpenStack cloud. The deployment was fully
|
||
automated: Cobbler deployed the OS on the bare metal,
|
||
bootstrapped it, and Puppet took over from there. I had
|
||
run the deployment scenario so many times in practice and
|
||
took for granted that everything was working.</para>
|
||
<para>On my last day in Kelowna, I was in a conference call
|
||
from my hotel. In the background, I was fooling around on
|
||
the new cloud. I launched an instance and logged in.
|
||
Everything looked fine. Out of boredom, I ran
|
||
<command>ps aux</command> and
|
||
all of the sudden the instance locked up.</para>
|
||
<para>Thinking it was just a one-off issue, I terminated the
|
||
instance and launched a new one. By then, the conference
|
||
call ended and I was off to the data center.</para>
|
||
<para>At the data center, I was finishing up some tasks and
|
||
remembered the lock-up. I logged into the new instance and
|
||
ran <command>ps aux</command> again. It worked. Phew. I decided
|
||
to run it one
|
||
more time. It locked up. WTF.</para>
|
||
<para>After reproducing the problem several times, I came to
|
||
the unfortunate conclusion that this cloud did indeed have
|
||
a problem. Even worse, my time was up in Kelowna and I had
|
||
to return back to Calgary.</para>
|
||
<para>Where do you even begin troubleshooting something like
|
||
this? An instance just randomly locks when a command is
|
||
issued. Is it the image? Nope — it happens on all images.
|
||
Is it the compute node? Nope — all nodes. Is the instance
|
||
locked up? No! New SSH connections work just fine!</para>
|
||
<para>We reached out for help. A networking engineer suggested
|
||
it was an MTU issue. Great! MTU! Something to go on!
|
||
What's MTU and why would it cause a problem?</para>
|
||
<para>MTU is maximum transmission unit. It specifies the
|
||
maximum number of bytes that the interface accepts for
|
||
each packet. If two interfaces have two different MTUs,
|
||
bytes might get chopped off and weird things happen --
|
||
such as random session lockups.</para>
|
||
<note>
|
||
<para>Not all packets have a size of 1500. Running the <command>ls</command>
|
||
command over SSH might only create a single packets
|
||
less than 1500 bytes. However, running a command with
|
||
heavy output, such as <command>ps aux</command>
|
||
requires several packets of 1500 bytes.</para>
|
||
</note>
|
||
<para>OK, so where is the MTU issue coming from? Why haven't
|
||
we seen this in any other deployment? What's new in this
|
||
situation? Well, new data center, new uplink, new
|
||
switches, new model of switches, new servers, first time
|
||
using this model of servers… so, basically everything was
|
||
new. Wonderful. We toyed around with raising the MTU at
|
||
various areas: the switches, the NICs on the compute
|
||
nodes, the virtual NICs in the instances, we even had the
|
||
data center raise the MTU for our uplink interface. Some
|
||
changes worked, some didn't. This line of troubleshooting
|
||
didn't feel right, though. We shouldn't have to be
|
||
changing the MTU in these areas.</para>
|
||
<para>As a last resort, our network admin (Alvaro) and myself
|
||
sat down with four terminal windows, a pencil, and a piece
|
||
of paper. In one window, we ran ping. In the second
|
||
window, we ran <command>tcpdump</command> on the cloud
|
||
controller. In the third, <command>tcpdump</command> on
|
||
the compute node. And the forth had <command>tcpdump</command>
|
||
on the instance. For background, this cloud was a
|
||
multi-node, non-multi-host setup.</para>
|
||
<para>One cloud controller acted as a gateway to all compute
|
||
nodes. VlanManager was used for the network config. This
|
||
means that the cloud controller and all compute nodes had
|
||
a different VLAN for each OpenStack project. We used the
|
||
-s option of <command>ping</command> to change the packet size.
|
||
We watched as sometimes packets would fully return,
|
||
sometimes they'd only make it out and never back in, and
|
||
sometimes the packets would stop at a random point. We
|
||
changed <command>tcpdump</command> to start displaying the
|
||
hex dump of the packet. We pinged between every
|
||
combination of outside, controller, compute, and
|
||
instance.</para>
|
||
<para>Finally, Alvaro noticed something. When a packet from
|
||
the outside hits the cloud controller, it should not be
|
||
configured with a VLAN. We verified this as true. When the
|
||
packet went from the cloud controller to the compute node,
|
||
it should only have a VLAN if it was destined for an
|
||
instance. This was still true. When the ping reply was
|
||
sent from the instance, it should be in a VLAN. True. When
|
||
it came back to the cloud controller and on its way out to
|
||
the public internet, it should no longer have a VLAN.
|
||
False. Uh oh. It looked as though the VLAN part of the
|
||
packet was not being removed.</para>
|
||
<para>That made no sense.</para>
|
||
<para>While bouncing this idea around in our heads, I was
|
||
randomly typing commands on the compute node:
|
||
<screen><userinput><prompt>$</prompt> ip a</userinput>
|
||
<computeroutput>…
|
||
10: vlan100@vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master br100 state UP
|
||
…</computeroutput></screen>
|
||
</para>
|
||
<para>"Hey Alvaro, can you run a VLAN on top of a
|
||
VLAN?"</para>
|
||
<para>"If you did, you'd add an extra 4 bytes to the
|
||
packet…"</para>
|
||
<para>Then it all made sense…
|
||
<screen><userinput><prompt>$</prompt> grep vlan_interface /etc/nova/nova.conf</userinput>
|
||
<computeroutput>vlan_interface=vlan20</computeroutput></screen>
|
||
</para>
|
||
<para>In <filename>nova.conf</filename>, <code>vlan_interface</code>
|
||
specifies what interface OpenStack should attach all VLANs
|
||
to. The correct setting should have been:
|
||
<programlisting>vlan_interface=bond0</programlisting>.</para>
|
||
<para>As this would be the server's bonded NIC.</para>
|
||
<para>vlan20 is the VLAN that the data center gave us for
|
||
outgoing public internet access. It's a correct VLAN and
|
||
is also attached to bond0.</para>
|
||
<para>By mistake, I configured OpenStack to attach all tenant
|
||
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
|
||
on top of another which then added an extra 4 bytes to
|
||
each packet which cause a packet of 1504 bytes to be sent
|
||
out which would cause problems when it arrived at an
|
||
interface that only accepted 1500!</para>
|
||
<para>As soon as this setting was fixed, everything
|
||
worked.</para>
|
||
</section>
|
||
<section xml:id="issue">
|
||
<title>"The Issue"</title>
|
||
<para>At the end of August 2012, a post-secondary school in
|
||
Alberta, Canada migrated its infrastructure to an
|
||
OpenStack cloud. As luck would have it, within the first
|
||
day or two of it running, one of their servers just
|
||
disappeared from the network. Blip. Gone.</para>
|
||
<para>After restarting the instance, everything was back up
|
||
and running. We reviewed the logs and saw that at some
|
||
point, network communication stopped and then everything
|
||
went idle. We chalked this up to a random
|
||
occurrence.</para>
|
||
<para>A few nights later, it happened again.</para>
|
||
<para>We reviewed both sets of logs. The one thing that stood
|
||
out the most was DHCP. At the time, OpenStack, by default,
|
||
set DHCP leases for one minute (it's now two minutes).
|
||
This means that every instance
|
||
contacts the cloud controller (DHCP server) to renew its
|
||
fixed IP. For some reason, this instance could not renew
|
||
its IP. We correlated the instance's logs with the logs on
|
||
the cloud controller and put together a
|
||
conversation:</para>
|
||
<orderedlist>
|
||
<listitem>
|
||
<para>Instance tries to renew IP.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Cloud controller receives the renewal request
|
||
and sends a response.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Instance "ignores" the response and re-sends the
|
||
renewal request.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Cloud controller receives the second request and
|
||
sends a new response.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>Instance begins sending a renewal request to
|
||
<code>255.255.255.255</code> since it hasn't
|
||
heard back from the cloud controller.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The cloud controller receives the
|
||
<code>255.255.255.255</code> request and sends
|
||
a third response.</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The instance finally gives up.</para>
|
||
</listitem>
|
||
</orderedlist>
|
||
<para>With this information in hand, we were sure that the
|
||
problem had to do with DHCP. We thought that for some
|
||
reason, the instance wasn't getting a new IP address and
|
||
with no IP, it shut itself off from the network.</para>
|
||
<para>A quick Google search turned up this: <link
|
||
xlink:href="https://lists.launchpad.net/openstack/msg11696.html"
|
||
>DHCP lease errors in VLAN mode</link>
|
||
(https://lists.launchpad.net/openstack/msg11696.html)
|
||
which further supported our DHCP theory.</para>
|
||
<para>An initial idea was to just increase the lease time. If
|
||
the instance only renewed once every week, the chances of
|
||
this problem happening would be tremendously smaller than
|
||
every minute. This didn't solve the problem, though. It
|
||
was just covering the problem up.</para>
|
||
<para>We decided to have <command>tcpdump</command> run on this
|
||
instance and see if we could catch it in action again.
|
||
Sure enough, we did.</para>
|
||
<para>The <command>tcpdump</command> looked very, very weird. In
|
||
short, it looked as though network communication stopped
|
||
before the instance tried to renew its IP. Since there is
|
||
so much DHCP chatter from a one minute lease, it's very
|
||
hard to confirm it, but even with only milliseconds
|
||
difference between packets, if one packet arrives first,
|
||
it arrived first, and if that packet reported network
|
||
issues, then it had to have happened before DHCP.</para>
|
||
<para>Additionally, this instance in question was responsible
|
||
for a very, very large backup job each night. While "The
|
||
Issue" (as we were now calling it) didn't happen exactly
|
||
when the backup happened, it was close enough (a few
|
||
hours) that we couldn't ignore it.</para>
|
||
<para>Further days go by and we catch The Issue in action more
|
||
and more. We find that dhclient is not running after The
|
||
Issue happens. Now we're back to thinking it's a DHCP
|
||
issue. Running <filename>/etc/init.d/networking</filename> restart
|
||
brings everything back up and running.</para>
|
||
<para>Ever have one of those days where all of the sudden you
|
||
get the Google results you were looking for? Well, that's
|
||
what happened here. I was looking for information on
|
||
dhclient and why it dies when it can't renew its lease and
|
||
all of the sudden I found a bunch of OpenStack and dnsmasq
|
||
discussions that were identical to the problem we were
|
||
seeing!</para>
|
||
<para>
|
||
<link
|
||
xlink:href="http://www.gossamer-threads.com/lists/openstack/operators/18197"
|
||
>Problem with Heavy Network IO and Dnsmasq</link>
|
||
(http://www.gossamer-threads.com/lists/openstack/operators/18197)
|
||
</para>
|
||
<para>
|
||
<link
|
||
xlink:href="http://www.gossamer-threads.com/lists/openstack/dev/14696"
|
||
>instances losing IP address while running, due to No
|
||
DHCPOFFER</link>
|
||
(http://www.gossamer-threads.com/lists/openstack/dev/14696)</para>
|
||
<para>Seriously, Google.</para>
|
||
<para>This bug report was the key to everything:
|
||
<link
|
||
xlink:href="https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978"
|
||
> KVM images lose connectivity with bridged
|
||
network</link>
|
||
(https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978)</para>
|
||
<para>It was funny to read the report. It was full of people
|
||
who had some strange network problem but didn't quite
|
||
explain it in the same way.</para>
|
||
<para>So it was a qemu/kvm bug.</para>
|
||
<para>At the same time of finding the bug report, a co-worker
|
||
was able to successfully reproduce The Issue! How? He used
|
||
<command>iperf</command> to spew a ton of bandwidth at an instance. Within 30
|
||
minutes, the instance just disappeared from the
|
||
network.</para>
|
||
<para>Armed with a patched qemu and a way to reproduce, we set
|
||
out to see if we've finally solved The Issue. After 48
|
||
hours straight of hammering the instance with bandwidth,
|
||
we were confident. The rest is history. You can search the
|
||
bug report for "joe" to find my comments and actual
|
||
tests.</para>
|
||
</section>
|
||
<section xml:id="disappear_images">
|
||
<title>Disappearing Images</title>
|
||
<para>At the end of 2012, Cybera (a nonprofit with a mandate
|
||
to oversee the development of cyberinfrastructure in
|
||
Alberta, Canada) deployed an updated OpenStack cloud for
|
||
their <link xlink:title="DAIR project"
|
||
xlink:href="http://www.canarie.ca/en/dair-program/about"
|
||
>DAIR project</link>
|
||
(http://www.canarie.ca/en/dair-program/about). A few days
|
||
into production, a compute node locks up. Upon rebooting
|
||
the node, I checked to see what instances were hosted on
|
||
that node so I could boot them on behalf of the customer.
|
||
Luckily, only one instance.</para>
|
||
<para>The <command>nova reboot</command> command wasn't working, so
|
||
I used <command>virsh</command>, but it immediately came back
|
||
with an error saying it was unable to find the backing
|
||
disk. In this case, the backing disk is the Glance image
|
||
that is copied to
|
||
<filename>/var/lib/nova/instances/_base</filename> when the
|
||
image is used for the first time. Why couldn't it find it?
|
||
I checked the directory and sure enough it was
|
||
gone.</para>
|
||
<para>I reviewed the <code>nova</code> database and saw the
|
||
instance's entry in the <code>nova.instances</code> table.
|
||
The image that the instance was using matched what virsh
|
||
was reporting, so no inconsistency there.</para>
|
||
<para>I checked Glance and noticed that this image was a
|
||
snapshot that the user created. At least that was good
|
||
news — this user would have been the only user
|
||
affected.</para>
|
||
<para>Finally, I checked StackTach and reviewed the user's
|
||
events. They had created and deleted several snapshots
|
||
— most likely experimenting. Although the timestamps
|
||
didn't match up, my conclusion was that they launched
|
||
their instance and then deleted the snapshot and it was
|
||
somehow removed from
|
||
<filename>/var/lib/nova/instances/_base</filename>. None of
|
||
that made sense, but it was the best I could come up
|
||
with.</para>
|
||
<para>It turns out the reason that this compute node locked up
|
||
was a hardware issue. We removed it from the DAIR cloud
|
||
and called Dell to have it serviced. Dell arrived and
|
||
began working. Somehow or another (or a fat finger), a
|
||
different compute node was bumped and rebooted.
|
||
Great.</para>
|
||
<para>When this node fully booted, I ran through the same
|
||
scenario of seeing what instances were running so I could
|
||
turn them back on. There were a total of four. Three
|
||
booted and one gave an error. It was the same error as
|
||
before: unable to find the backing disk. Seriously,
|
||
what?</para>
|
||
<para>Again, it turns out that the image was a snapshot. The
|
||
three other instances that successfully started were
|
||
standard cloud images. Was it a problem with snapshots?
|
||
That didn't make sense.</para>
|
||
<para>A note about DAIR's architecture:
|
||
<filename>/var/lib/nova/instances</filename> is a shared NFS
|
||
mount. This means that all compute nodes have access to
|
||
it, which includes the <code>_base</code> directory.
|
||
Another centralized area is <filename>/var/log/rsyslog</filename>
|
||
on the cloud controller. This directory collects all
|
||
OpenStack logs from all compute nodes. I wondered if there
|
||
were any entries for the file that <command>virsh</command> is
|
||
reporting:
|
||
<screen><computeroutput>dair-ua-c03/nova.log:Dec 19 12:10:59 dair-ua-c03
|
||
2012-12-19 12:10:59 INFO nova.virt.libvirt.imagecache
|
||
[-] Removing base file:
|
||
/var/lib/nova/instances/_base/7b4783508212f5d242cbf9ff56fb8d33b4ce6166_10</computeroutput>
|
||
</screen>
|
||
</para>
|
||
<para>Ah-hah! So OpenStack was deleting it. But why?</para>
|
||
<para>A feature was introduced in Essex to periodically check
|
||
and see if there were any <code>_base</code> files not in use.
|
||
If there
|
||
were, Nova would delete them. This idea sounds innocent
|
||
enough and has some good qualities to it. But how did this
|
||
feature end up turned on? It was disabled by default in
|
||
Essex. As it should be. It was <link
|
||
xlink:href="https://bugs.launchpad.net/nova/+bug/1029674"
|
||
>decided to be turned on in Folsom</link>
|
||
(https://bugs.launchpad.net/nova/+bug/1029674). I cannot
|
||
emphasize enough that:</para>
|
||
<para>
|
||
<emphasis>Actions which delete things should not be
|
||
enabled by default.</emphasis>
|
||
</para>
|
||
<para>Disk space is cheap these days. Data recovery is
|
||
not.</para>
|
||
<para>Secondly, DAIR's shared
|
||
<filename>/var/lib/nova/instances</filename> directory
|
||
contributed to the problem. Since all compute nodes have
|
||
access to this directory, all compute nodes periodically
|
||
review the _base directory. If there is only one instance
|
||
using an image, and the node that the instance is on is
|
||
down for a few minutes, it won't be able to mark the image
|
||
as still in use. Therefore, the image seems like it's not
|
||
in use and is deleted. When the compute node comes back
|
||
online, the instance hosted on that node is unable to
|
||
start.</para>
|
||
</section>
|
||
<section xml:id="valentines">
|
||
<title>The Valentine's Day Compute Node Massacre</title>
|
||
<para>Although the title of this story is much more dramatic
|
||
than the actual event, I don't think, or hope, that I'll
|
||
have the opportunity to use "Valentine's Day Massacre"
|
||
again in a title.</para>
|
||
<para>This past Valentine's Day, I received an alert that a
|
||
compute node was no longer available in the cloud
|
||
— meaning,
|
||
<screen><prompt>$</prompt><userinput>nova-manage service list</userinput></screen>
|
||
showed this particular node with a status of
|
||
<code>XXX</code>.</para>
|
||
<para>I logged into the cloud controller and was able to both
|
||
<command>ping</command> and SSH into the problematic compute node which
|
||
seemed very odd. Usually if I receive this type of alert,
|
||
the compute node has totally locked up and would be
|
||
inaccessible.</para>
|
||
<para>After a few minutes of troubleshooting, I saw the
|
||
following details:</para>
|
||
<itemizedlist>
|
||
<listitem>
|
||
<para>A user recently tried launching a CentOS
|
||
instance on that node</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>This user was the only user on the node (new
|
||
node)</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The load shot up to 8 right before I received
|
||
the alert</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The bonded 10gb network device (bond0) was in a
|
||
DOWN state</para>
|
||
</listitem>
|
||
<listitem>
|
||
<para>The 1gb NIC was still alive and active</para>
|
||
</listitem>
|
||
</itemizedlist>
|
||
<para>I looked at the status of both NICs in the bonded pair
|
||
and saw that neither was able to communicate with the
|
||
switch port. Seeing as how each NIC in the bond is
|
||
connected to a separate switch, I thought that the chance
|
||
of a switch port dying on each switch at the same time was
|
||
quite improbable. I concluded that the 10gb dual port NIC
|
||
had died and needed replaced. I created a ticket for the
|
||
hardware support department at the data center where the
|
||
node was hosted. I felt lucky that this was a new node and
|
||
no one else was hosted on it yet.</para>
|
||
<para>An hour later I received the same alert, but for another
|
||
compute node. Crap. OK, now there's definitely a problem
|
||
going on. Just like the original node, I was able to log
|
||
in by SSH. The bond0 NIC was DOWN but the 1gb NIC was
|
||
active.</para>
|
||
<para>And the best part: the same user had just tried creating
|
||
a CentOS instance. What?</para>
|
||
<para>I was totally confused at this point, so I texted our
|
||
network admin to see if he was available to help. He
|
||
logged in to both switches and immediately saw the
|
||
problem: the switches detected spanning tree packets
|
||
coming from the two compute nodes and immediately shut the
|
||
ports down to prevent spanning tree loops:
|
||
<screen><computeroutput>Feb 15 01:40:18 SW-1 Stp: %SPANTREE-4-BLOCK_BPDUGUARD: Received BPDU packet on Port-Channel35 with BPDU guard enabled. Disabling interface. (source mac fa:16:3e:24:e7:22)
|
||
Feb 15 01:40:18 SW-1 Ebra: %ETH-4-ERRDISABLE: bpduguard error detected on Port-Channel35.
|
||
Feb 15 01:40:18 SW-1 Mlag: %MLAG-4-INTF_INACTIVE_LOCAL: Local interface Port-Channel35 is link down. MLAG 35 is inactive.
|
||
Feb 15 01:40:18 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Port-Channel35 (Server35), changed state to down
|
||
Feb 15 01:40:19 SW-1 Stp: %SPANTREE-6-INTERFACE_DEL: Interface Port-Channel35 has been removed from instance MST0
|
||
Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ethernet35 (Server35), changed state to down</computeroutput></screen>
|
||
</para>
|
||
<para>He re-enabled the switch ports and the two compute nodes
|
||
immediately came back to life.</para>
|
||
<para>Unfortunately, this story has an open ending... we're
|
||
still looking into why the CentOS image was sending out
|
||
spanning tree packets. Further, we're researching a proper
|
||
way on how to mitigate this from happening. It's a bigger
|
||
issue than one might think. While it's extremely important
|
||
for switches to prevent spanning tree loops, it's very
|
||
problematic to have an entire compute node be cut from the
|
||
network when this happens. If a compute node is hosting
|
||
100 instances and one of them sends a spanning tree
|
||
packet, that instance has effectively DDOS'd the other 99
|
||
instances.</para>
|
||
<para>This is an ongoing and hot topic in networking circles
|
||
— especially with the raise of virtualization and virtual
|
||
switches.</para>
|
||
</section>
|
||
<section xml:id="rabbithole">
|
||
<title>Down the Rabbit Hole</title>
|
||
<para>Users being able to retrieve console logs from running
|
||
instances is a boon for support — many times they can
|
||
figure out what's going on inside their instance and fix
|
||
what's going on without bothering you. Unfortunately,
|
||
sometimes overzealous logging of failures can cause
|
||
problems of its own.</para>
|
||
<para>A report came in: VMs were launching slowly, or not at
|
||
all. Cue the standard checks — nothing on the nagios, but
|
||
there was a spike in network towards the current master of
|
||
our RabbitMQ cluster. Investigation started, but soon the
|
||
other parts of the queue cluster were leaking memory like
|
||
a sieve. Then the alert came in — the master rabbit server
|
||
went down. Connections failed over to the slave.</para>
|
||
<para>At that time, our control services were hosted by
|
||
another team and we didn't have much debugging information
|
||
to determine what was going on with the master, and
|
||
couldn't reboot it. That team noted that it failed without
|
||
alert, but managed to reboot it. After an hour, the
|
||
cluster had returned to its normal state and we went home
|
||
for the day.</para>
|
||
<para>Continuing the diagnosis the next morning was kick
|
||
started by another identical failure. We quickly got the
|
||
message queue running again, and tried to work out why
|
||
Rabbit was suffering from so much network traffic.
|
||
Enabling debug logging on
|
||
<systemitem class="service">nova-api</systemitem> quickly brought
|
||
understanding. A <command>tail -f
|
||
/var/log/nova/nova-api.log</command> was scrolling by
|
||
faster than we'd ever seen before. CTRL+C on that and we
|
||
could plainly see the contents of a system log spewing
|
||
failures over and over again - a system log from one of
|
||
our users' instances.</para>
|
||
<para>After finding the instance ID we headed over to
|
||
<filename>/var/lib/nova/instances</filename> to find the
|
||
<filename>console.log</filename>:
|
||
<screen><computeroutput>
|
||
adm@cc12:/var/lib/nova/instances/instance-00000e05# wc -l console.log
|
||
92890453 console.log
|
||
adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log
|
||
5.5G console.log</computeroutput></screen></para>
|
||
<para>Sure enough, the user had been periodically refreshing
|
||
the console log page on the dashboard and the 5G file was
|
||
traversing the rabbit cluster to get to the
|
||
dashboard.</para>
|
||
<para>We called them and asked them to stop for a while, and
|
||
they were happy to abandon the horribly broken VM. After
|
||
that, we started monitoring the size of console
|
||
logs.</para>
|
||
<para>To this day, <link
|
||
xlink:href="https://bugs.launchpad.net/nova/+bug/832507"
|
||
>the issue</link>
|
||
(https://bugs.launchpad.net/nova/+bug/832507) doesn't have
|
||
a permanent resolution, but we look forward to the
|
||
discussion at the next summit.</para>
|
||
</section>
|
||
<section xml:id="haunted">
|
||
<title>Havana Haunted by the Dead</title>
|
||
<para>Felix Lee of Academia Sinica Grid Computing Centre in Taiwan
|
||
contributed this story.</para>
|
||
<para>I just upgraded OpenStack from Grizzly to Havana 2013.2-2 using
|
||
the RDO repository and everything was running pretty well
|
||
-- except the EC2 API.</para>
|
||
<para>I noticed that the API would suffer from a heavy load and
|
||
respond slowly to particular EC2 requests such as
|
||
<literal>RunInstances</literal>.</para>
|
||
<para>Output from <filename>/var/log/nova/nova-api.log</filename> on
|
||
Havana:</para>
|
||
<screen><computeroutput>2014-01-10 09:11:45.072 129745 INFO nova.ec2.wsgi.server
|
||
[req-84d16d16-3808-426b-b7af-3b90a11b83b0
|
||
0c6e7dba03c24c6a9bce299747499e8a 7052bd6714e7460caeb16242e68124f9]
|
||
117.103.103.29 "GET
|
||
/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000001&InstanceInitiatedShutdownBehavior=terminate...
|
||
HTTP/1.1" status: 200 len: 1109 time: 138.5970151
|
||
</computeroutput></screen>
|
||
<para>This request took over two minutes to process, but executed
|
||
quickly on another co-existing Grizzly deployment using the same
|
||
hardware and system configuration.</para>
|
||
<para>Output from <filename>/var/log/nova/nova-api.log</filename> on
|
||
Grizzly:</para>
|
||
<screen><computeroutput>2014-01-08 11:15:15.704 INFO nova.ec2.wsgi.server
|
||
[req-ccac9790-3357-4aa8-84bd-cdaab1aa394e
|
||
ebbd729575cb404081a45c9ada0849b7 8175953c209044358ab5e0ec19d52c37]
|
||
117.103.103.29 "GET
|
||
/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000007&InstanceInitiatedShutdownBehavior=terminate...
|
||
HTTP/1.1" status: 200 len: 931 time: 3.9426181
|
||
</computeroutput></screen>
|
||
<para>While monitoring system resources, I noticed
|
||
a significant increase in memory consumption while the EC2 API
|
||
processed this request. I thought it wasn't handling memory
|
||
properly -- possibly not releasing memory. If the API received
|
||
several of these requests, memory consumption quickly grew until
|
||
the system ran out of RAM and began using swap. Each node has
|
||
48 GB of RAM and the "nova-api" process would consume all of it
|
||
within minutes. Once this happened, the entire system would become
|
||
unusably slow until I restarted the
|
||
<systemitem class="service">nova-api</systemitem> service.</para>
|
||
<para>So, I found myself wondering what changed in the EC2 API on
|
||
Havana that might cause this to happen. Was it a bug or a normal
|
||
behavior that I now need to work around?</para>
|
||
<para>After digging into the Nova code, I noticed two areas in
|
||
<filename>api/ec2/cloud.py</filename> potentially impacting my
|
||
system:</para>
|
||
<programlisting language="python">
|
||
instances = self.compute_api.get_all(context,
|
||
search_opts=search_opts,
|
||
sort_dir='asc')
|
||
|
||
sys_metas = self.compute_api.get_all_system_metadata(
|
||
context, search_filts=[{'key': ['EC2_client_token']},
|
||
{'value': [client_token]}])
|
||
</programlisting>
|
||
<para>Since my database contained many records -- over 1 million
|
||
metadata records and over 300,000 instance records in "deleted"
|
||
or "errored" states -- each search took ages. I decided to clean
|
||
up the database by first archiving a copy for backup and then
|
||
performing some deletions using the MySQL client. For example, I
|
||
ran the following SQL command to remove rows of instances deleted
|
||
for over a year:</para>
|
||
<screen><prompt>mysql></prompt> <userinput>delete from nova.instances where deleted=1 and terminated_at < (NOW() - INTERVAL 1 YEAR);</userinput></screen>
|
||
<para>Performance increased greatly after deleting the old records and
|
||
my new deployment continues to behave well.</para>
|
||
</section>
|
||
</appendix>
|
||
<!-- FIXME: Add more stories, especially more up-to-date ones -->
|
||
<!-- STAUS: OK for Havana -->
|