made some changes to app_crypt xml file

modified nagios to Nagios
changed a few sentences format
capitalized Rabbit for consistency
removed WTF - should be professional

Change-Id: I4af3c8790e842961a1d44231151410ea02248484
This commit is contained in:
Shilla Saebi 2015-09-24 18:56:10 -04:00
parent 367b8c3c15
commit 4c4f0f7191

View File

@ -32,13 +32,13 @@
<para>At the data center, I was finishing up some tasks and remembered
the lock-up. I logged into the new instance and ran <command>ps
aux</command> again. It worked. Phew. I decided to run it one
more time. It locked up. WTF.</para>
more time. It locked up.</para>
<para>After reproducing the problem several times, I came to
the unfortunate conclusion that this cloud did indeed have
a problem. Even worse, my time was up in Kelowna and I had
to return back to Calgary.</para>
<para>Where do you even begin troubleshooting something like
this? An instance just randomly locks when a command is
this? An instance that just randomly locks up when a command is
issued. Is it the image? Nope&mdash;it happens on all images.
Is it the compute node? Nope&mdash;all nodes. Is the instance
locked up? No! New SSH connections work just fine!</para>
@ -126,10 +126,10 @@
is also attached to bond0.</para>
<para>By mistake, I configured OpenStack to attach all tenant
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
on top of another which then added an extra 4 bytes to
each packet which cause a packet of 1504 bytes to be sent
on top of another. This added an extra 4 bytes to
each packet and caused a packet of 1504 bytes to be sent
out which would cause problems when it arrived at an
interface that only accepted 1500!</para>
interface that only accepted 1500.</para>
<para>As soon as this setting was fixed, everything
worked.</para>
</section>
@ -455,16 +455,16 @@ Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ether
sometimes overzealous logging of failures can cause
problems of its own.</para>
<para>A report came in: VMs were launching slowly, or not at
all. Cue the standard checks&mdash;nothing on the nagios, but
all. Cue the standard checks&mdash;nothing on the Nagios, but
there was a spike in network towards the current master of
our RabbitMQ cluster. Investigation started, but soon the
other parts of the queue cluster were leaking memory like
a sieve. Then the alert came in&mdash;the master rabbit server
went down. Connections failed over to the slave.</para>
a sieve. Then the alert came in&mdash;the master Rabbit server
went down and connections failed over to the slave.</para>
<para>At that time, our control services were hosted by
another team and we didn't have much debugging information
to determine what was going on with the master, and
couldn't reboot it. That team noted that it failed without
to determine what was going on with the master, and we
could not reboot it. That team noted that it failed without
alert, but managed to reboot it. After an hour, the
cluster had returned to its normal state and we went home
for the day.</para>
@ -490,7 +490,7 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log
5.5G console.log</computeroutput></screen></para>
<para>Sure enough, the user had been periodically refreshing
the console log page on the dashboard and the 5G file was
traversing the rabbit cluster to get to the
traversing the Rabbit cluster to get to the
dashboard.</para>
<para>We called them and asked them to stop for a while, and
they were happy to abandon the horribly broken VM. After
@ -561,7 +561,7 @@ HTTP/1.1" status: 200 len: 931 time: 3.9426181
</programlisting>
<para>Since my database contained many records&mdash;over 1 million
metadata records and over 300,000 instance records in "deleted"
or "errored" states&mdash;each search took ages. I decided to clean
or "errored" states&mdash;each search took a long time. I decided to clean
up the database by first archiving a copy for backup and then
performing some deletions using the MySQL client. For example, I
ran the following SQL command to remove rows of instances deleted