From 4c4f0f71916581dcd7385cc2729b3a67848bf7e4 Mon Sep 17 00:00:00 2001 From: Shilla Saebi Date: Thu, 24 Sep 2015 18:56:10 -0400 Subject: [PATCH] made some changes to app_crypt xml file modified nagios to Nagios changed a few sentences format capitalized Rabbit for consistency removed WTF - should be professional Change-Id: I4af3c8790e842961a1d44231151410ea02248484 --- doc/openstack-ops/app_crypt.xml | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/doc/openstack-ops/app_crypt.xml b/doc/openstack-ops/app_crypt.xml index 8d109d21..9ebab50f 100644 --- a/doc/openstack-ops/app_crypt.xml +++ b/doc/openstack-ops/app_crypt.xml @@ -32,13 +32,13 @@ At the data center, I was finishing up some tasks and remembered the lock-up. I logged into the new instance and ran ps aux again. It worked. Phew. I decided to run it one - more time. It locked up. WTF. + more time. It locked up. After reproducing the problem several times, I came to the unfortunate conclusion that this cloud did indeed have a problem. Even worse, my time was up in Kelowna and I had to return back to Calgary. Where do you even begin troubleshooting something like - this? An instance just randomly locks when a command is + this? An instance that just randomly locks up when a command is issued. Is it the image? Nope—it happens on all images. Is it the compute node? Nope—all nodes. Is the instance locked up? No! New SSH connections work just fine! @@ -126,10 +126,10 @@ is also attached to bond0. By mistake, I configured OpenStack to attach all tenant VLANs to vlan20 instead of bond0 thereby stacking one VLAN - on top of another which then added an extra 4 bytes to - each packet which cause a packet of 1504 bytes to be sent + on top of another. This added an extra 4 bytes to + each packet and caused a packet of 1504 bytes to be sent out which would cause problems when it arrived at an - interface that only accepted 1500! + interface that only accepted 1500. As soon as this setting was fixed, everything worked. @@ -455,16 +455,16 @@ Feb 15 01:40:19 SW-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface Ether sometimes overzealous logging of failures can cause problems of its own. A report came in: VMs were launching slowly, or not at - all. Cue the standard checks—nothing on the nagios, but + all. Cue the standard checks—nothing on the Nagios, but there was a spike in network towards the current master of our RabbitMQ cluster. Investigation started, but soon the other parts of the queue cluster were leaking memory like - a sieve. Then the alert came in—the master rabbit server - went down. Connections failed over to the slave. + a sieve. Then the alert came in—the master Rabbit server + went down and connections failed over to the slave. At that time, our control services were hosted by another team and we didn't have much debugging information - to determine what was going on with the master, and - couldn't reboot it. That team noted that it failed without + to determine what was going on with the master, and we + could not reboot it. That team noted that it failed without alert, but managed to reboot it. After an hour, the cluster had returned to its normal state and we went home for the day. @@ -490,7 +490,7 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log 5.5G console.log Sure enough, the user had been periodically refreshing the console log page on the dashboard and the 5G file was - traversing the rabbit cluster to get to the + traversing the Rabbit cluster to get to the dashboard. We called them and asked them to stop for a while, and they were happy to abandon the horribly broken VM. After @@ -561,7 +561,7 @@ HTTP/1.1" status: 200 len: 931 time: 3.9426181 Since my database contained many records—over 1 million metadata records and over 300,000 instance records in "deleted" - or "errored" states—each search took ages. I decided to clean + or "errored" states—each search took a long time. I decided to clean up the database by first archiving a copy for backup and then performing some deletions using the MySQL client. For example, I ran the following SQL command to remove rows of instances deleted