made some changes to app_crypt xml file
modified nagios to Nagios changed a few sentences format capitalized Rabbit for consistency removed WTF - should be professional Change-Id: I4af3c8790e842961a1d44231151410ea02248484
This commit is contained in:
parent
367b8c3c15
commit
4c4f0f7191
@ -32,13 +32,13 @@
|
|||||||
<para>At the data center, I was finishing up some tasks and remembered
|
<para>At the data center, I was finishing up some tasks and remembered
|
||||||
the lock-up. I logged into the new instance and ran <command>ps
|
the lock-up. I logged into the new instance and ran <command>ps
|
||||||
aux</command> again. It worked. Phew. I decided to run it one
|
aux</command> again. It worked. Phew. I decided to run it one
|
||||||
more time. It locked up. WTF.</para>
|
more time. It locked up.</para>
|
||||||
<para>After reproducing the problem several times, I came to
|
<para>After reproducing the problem several times, I came to
|
||||||
the unfortunate conclusion that this cloud did indeed have
|
the unfortunate conclusion that this cloud did indeed have
|
||||||
a problem. Even worse, my time was up in Kelowna and I had
|
a problem. Even worse, my time was up in Kelowna and I had
|
||||||
to return back to Calgary.</para>
|
to return back to Calgary.</para>
|
||||||
<para>Where do you even begin troubleshooting something like
|
<para>Where do you even begin troubleshooting something like
|
||||||
this? An instance just randomly locks when a command is
|
this? An instance that just randomly locks up when a command is
|
||||||
issued. Is it the image? Nope—it happens on all images.
|
issued. Is it the image? Nope—it happens on all images.
|
||||||
Is it the compute node? Nope—all nodes. Is the instance
|
Is it the compute node? Nope—all nodes. Is the instance
|
||||||
locked up? No! New SSH connections work just fine!</para>
|
locked up? No! New SSH connections work just fine!</para>
|
||||||
@ -126,10 +126,10 @@
|
|||||||
is also attached to bond0.</para>
|
is also attached to bond0.</para>
|
||||||
<para>By mistake, I configured OpenStack to attach all tenant
|
<para>By mistake, I configured OpenStack to attach all tenant
|
||||||
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
|
VLANs to vlan20 instead of bond0 thereby stacking one VLAN
|
||||||
on top of another which then added an extra 4 bytes to
|
on top of another. This added an extra 4 bytes to
|
||||||
each packet which cause a packet of 1504 bytes to be sent
|
each packet and caused a packet of 1504 bytes to be sent
|
||||||
out which would cause problems when it arrived at an
|
out which would cause problems when it arrived at an
|
||||||
interface that only accepted 1500!</para>
|
interface that only accepted 1500.</para>
|
||||||
<para>As soon as this setting was fixed, everything
|
<para>As soon as this setting was fixed, everything
|
||||||
worked.</para>
|
worked.</para>
|
||||||
</section>
|
</section>
|
||||||
@ -455,16 +455,16 @@ Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ether
|
|||||||
sometimes overzealous logging of failures can cause
|
sometimes overzealous logging of failures can cause
|
||||||
problems of its own.</para>
|
problems of its own.</para>
|
||||||
<para>A report came in: VMs were launching slowly, or not at
|
<para>A report came in: VMs were launching slowly, or not at
|
||||||
all. Cue the standard checks—nothing on the nagios, but
|
all. Cue the standard checks—nothing on the Nagios, but
|
||||||
there was a spike in network towards the current master of
|
there was a spike in network towards the current master of
|
||||||
our RabbitMQ cluster. Investigation started, but soon the
|
our RabbitMQ cluster. Investigation started, but soon the
|
||||||
other parts of the queue cluster were leaking memory like
|
other parts of the queue cluster were leaking memory like
|
||||||
a sieve. Then the alert came in—the master rabbit server
|
a sieve. Then the alert came in—the master Rabbit server
|
||||||
went down. Connections failed over to the slave.</para>
|
went down and connections failed over to the slave.</para>
|
||||||
<para>At that time, our control services were hosted by
|
<para>At that time, our control services were hosted by
|
||||||
another team and we didn't have much debugging information
|
another team and we didn't have much debugging information
|
||||||
to determine what was going on with the master, and
|
to determine what was going on with the master, and we
|
||||||
couldn't reboot it. That team noted that it failed without
|
could not reboot it. That team noted that it failed without
|
||||||
alert, but managed to reboot it. After an hour, the
|
alert, but managed to reboot it. After an hour, the
|
||||||
cluster had returned to its normal state and we went home
|
cluster had returned to its normal state and we went home
|
||||||
for the day.</para>
|
for the day.</para>
|
||||||
@ -490,7 +490,7 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log
|
|||||||
5.5G console.log</computeroutput></screen></para>
|
5.5G console.log</computeroutput></screen></para>
|
||||||
<para>Sure enough, the user had been periodically refreshing
|
<para>Sure enough, the user had been periodically refreshing
|
||||||
the console log page on the dashboard and the 5G file was
|
the console log page on the dashboard and the 5G file was
|
||||||
traversing the rabbit cluster to get to the
|
traversing the Rabbit cluster to get to the
|
||||||
dashboard.</para>
|
dashboard.</para>
|
||||||
<para>We called them and asked them to stop for a while, and
|
<para>We called them and asked them to stop for a while, and
|
||||||
they were happy to abandon the horribly broken VM. After
|
they were happy to abandon the horribly broken VM. After
|
||||||
@ -561,7 +561,7 @@ HTTP/1.1" status: 200 len: 931 time: 3.9426181
|
|||||||
</programlisting>
|
</programlisting>
|
||||||
<para>Since my database contained many records—over 1 million
|
<para>Since my database contained many records—over 1 million
|
||||||
metadata records and over 300,000 instance records in "deleted"
|
metadata records and over 300,000 instance records in "deleted"
|
||||||
or "errored" states—each search took ages. I decided to clean
|
or "errored" states—each search took a long time. I decided to clean
|
||||||
up the database by first archiving a copy for backup and then
|
up the database by first archiving a copy for backup and then
|
||||||
performing some deletions using the MySQL client. For example, I
|
performing some deletions using the MySQL client. For example, I
|
||||||
ran the following SQL command to remove rows of instances deleted
|
ran the following SQL command to remove rows of instances deleted
|
||||||
|
Loading…
x
Reference in New Issue
Block a user