From bca2f54ecfd6076d89e89e409179937d9e662404 Mon Sep 17 00:00:00 2001 From: Tom Fifield <tom@openstack.org> Date: Thu, 23 Jan 2014 13:48:15 -0500 Subject: [PATCH] Add a new story by Felix Lee This adds a new story to tales from the crypt, about an experience upgrading from Grizzly to Havana on a decent sized cloud. Change-Id: I887dc10c0efda14cd322777b44c141880522623c --- doc/openstack-ops/app_crypt.xml | 67 +++++++++++++++++++++++++++++++++ 1 file changed, 67 insertions(+) diff --git a/doc/openstack-ops/app_crypt.xml b/doc/openstack-ops/app_crypt.xml index 13617cc0..3440876c 100644 --- a/doc/openstack-ops/app_crypt.xml +++ b/doc/openstack-ops/app_crypt.xml @@ -511,6 +511,73 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log a permanent resolution, but we look forward to the discussion at the next summit.</para> </section> + <section xml:id="haunted"> + <title>Havana Haunted by the Dead</title> + <para>Felix Lee of Academia Sinica Grid Computing Centre in Taiwan + contributed this story.</para> + <para>I just upgraded OpenStack from Grizzly to Havana 2013.2-2 using + the RDO repository and everything was running pretty well + -- except the EC2 API.</para> + <para>I noticed that the API would suffer from a heavy load and + respond slowly to particular EC2 requests such as + <literal>RunInstances</literal>.</para> + <para>Output from <filename>/var/log/nova/nova-api.log</filename> on + Havana:</para> + <screen><computeroutput>2014-01-10 09:11:45.072 129745 INFO nova.ec2.wsgi.server +[req-84d16d16-3808-426b-b7af-3b90a11b83b0 +0c6e7dba03c24c6a9bce299747499e8a 7052bd6714e7460caeb16242e68124f9] +117.103.103.29 "GET +/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000001&InstanceInitiatedShutdownBehavior=terminate... +HTTP/1.1" status: 200 len: 1109 time: 138.5970151 +</computeroutput></screen> + <para>This request took over two minutes to process, but executed + quickly on another co-existing Grizzly deployment using the same + hardware and system configuration.</para> + <para>Output from <filename>/var/log/nova/nova-api.log</filename> on + Grizzly:</para> + <screen><computeroutput>2014-01-08 11:15:15.704 INFO nova.ec2.wsgi.server +[req-ccac9790-3357-4aa8-84bd-cdaab1aa394e +ebbd729575cb404081a45c9ada0849b7 8175953c209044358ab5e0ec19d52c37] +117.103.103.29 "GET +/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000007&InstanceInitiatedShutdownBehavior=terminate... +HTTP/1.1" status: 200 len: 931 time: 3.9426181 +</computeroutput></screen> + <para>While monitoring system resources, I noticed + a significant increase in memory consumption while the EC2 API + processed this request. I thought it wasn't handling memory + properly -- possibly not releasing memory. If the API received + several of these requests, memory consumption quickly grew until + the system ran out of RAM and began using swap. Each node has + 48 GB of RAM and the "nova-api" process would consume all of it + within minutes. Once this happened, the entire system would become + unusably slow until I restarted the + <systemitem class="service">nova-api</systemitem> service.</para> + <para>So, I found myself wondering what changed in the EC2 API on + Havana that might cause this to happen. Was it a bug or a normal + behavior that I now need to work around?</para> + <para>After digging into the Nova code, I noticed two areas in + <filename>api/ec2/cloud.py</filename> potentially impacting my + system:</para> + <programlisting language="python"> + instances = self.compute_api.get_all(context, + search_opts=search_opts, + sort_dir='asc') + + sys_metas = self.compute_api.get_all_system_metadata( + context, search_filts=[{'key': ['EC2_client_token']}, + {'value': [client_token]}]) +</programlisting> + <para>Since my database contained many records -- over 1 million + metadata records and over 300,000 instance records in "deleted" + or "errored" states -- each search took ages. I decided to clean + up the database by first archiving a copy for backup and then + performing some deletions using the MySQL client. For example, I + ran the following SQL command to remove rows of instances deleted + for over a year:</para> + <screen><prompt>mysql></prompt> <userinput>delete from nova.instances where deleted=1 and terminated_at < (NOW() - INTERVAL 1 YEAR);</userinput></screen> + <para>Performance increased greatly after deleting the old records and + my new deployment continues to behave well.</para> + </section> </appendix> <!-- FIXME: Add more stories, especially more up-to-date ones --> <!-- STAUS: OK for Havana -->