made some changes to app_crypt xml file
modified nagios to Nagios changed a few sentences format capitalized Rabbit for consistency removed WTF - should be professional Change-Id: I4af3c8790e842961a1d44231151410ea02248484
This commit is contained in:
parent
367b8c3c15
commit
4c4f0f7191
@ -32,13 +32,13 @@
|
||||
<para>At the data center, I was finishing up some tasks and remembered
|
||||
the lock-up. I logged into the new instance and ran <command>ps
|
||||
aux</command> again. It worked. Phew. I decided to run it one
|
||||
more time. It locked up. WTF.</para>
|
||||
more time. It locked up.</para>
|
||||
<para>After reproducing the problem several times, I came to
|
||||
the unfortunate conclusion that this cloud did indeed have
|
||||
a problem. Even worse, my time was up in Kelowna and I had
|
||||
to return back to Calgary.</para>
|
||||
<para>Where do you even begin troubleshooting something like
|
||||
this? An instance just randomly locks when a command is
|
||||
this? An instance that just randomly locks up when a command is
|
||||
issued. Is it the image? Nope—it happens on all images.
|
||||
Is it the compute node? Nope—all nodes. Is the instance
|
||||
locked up? No! New SSH connections work just fine!</para>
|
||||
@ -126,10 +126,10 @@
|
||||
is also attached to bond0.</para>
|
||||
<para>By mistake, I configured OpenStack to attach all tenant
|
||||
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
|
||||
on top of another which then added an extra 4 bytes to
|
||||
each packet which cause a packet of 1504 bytes to be sent
|
||||
on top of another. This added an extra 4 bytes to
|
||||
each packet and caused a packet of 1504 bytes to be sent
|
||||
out which would cause problems when it arrived at an
|
||||
interface that only accepted 1500!</para>
|
||||
interface that only accepted 1500.</para>
|
||||
<para>As soon as this setting was fixed, everything
|
||||
worked.</para>
|
||||
</section>
|
||||
@ -455,16 +455,16 @@ Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ether
|
||||
sometimes overzealous logging of failures can cause
|
||||
problems of its own.</para>
|
||||
<para>A report came in: VMs were launching slowly, or not at
|
||||
all. Cue the standard checks—nothing on the nagios, but
|
||||
all. Cue the standard checks—nothing on the Nagios, but
|
||||
there was a spike in network towards the current master of
|
||||
our RabbitMQ cluster. Investigation started, but soon the
|
||||
other parts of the queue cluster were leaking memory like
|
||||
a sieve. Then the alert came in—the master rabbit server
|
||||
went down. Connections failed over to the slave.</para>
|
||||
a sieve. Then the alert came in—the master Rabbit server
|
||||
went down and connections failed over to the slave.</para>
|
||||
<para>At that time, our control services were hosted by
|
||||
another team and we didn't have much debugging information
|
||||
to determine what was going on with the master, and
|
||||
couldn't reboot it. That team noted that it failed without
|
||||
to determine what was going on with the master, and we
|
||||
could not reboot it. That team noted that it failed without
|
||||
alert, but managed to reboot it. After an hour, the
|
||||
cluster had returned to its normal state and we went home
|
||||
for the day.</para>
|
||||
@ -490,7 +490,7 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log
|
||||
5.5G console.log</computeroutput></screen></para>
|
||||
<para>Sure enough, the user had been periodically refreshing
|
||||
the console log page on the dashboard and the 5G file was
|
||||
traversing the rabbit cluster to get to the
|
||||
traversing the Rabbit cluster to get to the
|
||||
dashboard.</para>
|
||||
<para>We called them and asked them to stop for a while, and
|
||||
they were happy to abandon the horribly broken VM. After
|
||||
@ -561,7 +561,7 @@ HTTP/1.1" status: 200 len: 931 time: 3.9426181
|
||||
</programlisting>
|
||||
<para>Since my database contained many records—over 1 million
|
||||
metadata records and over 300,000 instance records in "deleted"
|
||||
or "errored" states—each search took ages. I decided to clean
|
||||
or "errored" states—each search took a long time. I decided to clean
|
||||
up the database by first archiving a copy for backup and then
|
||||
performing some deletions using the MySQL client. For example, I
|
||||
ran the following SQL command to remove rows of instances deleted
|
||||
|
Loading…
x
Reference in New Issue
Block a user