diff --git a/doc/openstack-ops/app_crypt.xml b/doc/openstack-ops/app_crypt.xml index 13617cc0..3440876c 100644 --- a/doc/openstack-ops/app_crypt.xml +++ b/doc/openstack-ops/app_crypt.xml @@ -511,6 +511,73 @@ adm@cc12:/var/lib/nova/instances/instance-00000e05# ls -sh console.log a permanent resolution, but we look forward to the discussion at the next summit. +
+ Havana Haunted by the Dead + Felix Lee of Academia Sinica Grid Computing Centre in Taiwan + contributed this story. + I just upgraded OpenStack from Grizzly to Havana 2013.2-2 using + the RDO repository and everything was running pretty well + -- except the EC2 API. + I noticed that the API would suffer from a heavy load and + respond slowly to particular EC2 requests such as + RunInstances. + Output from /var/log/nova/nova-api.log on + Havana: + 2014-01-10 09:11:45.072 129745 INFO nova.ec2.wsgi.server +[req-84d16d16-3808-426b-b7af-3b90a11b83b0 +0c6e7dba03c24c6a9bce299747499e8a 7052bd6714e7460caeb16242e68124f9] +117.103.103.29 "GET +/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000001&InstanceInitiatedShutdownBehavior=terminate... +HTTP/1.1" status: 200 len: 1109 time: 138.5970151 + + This request took over two minutes to process, but executed + quickly on another co-existing Grizzly deployment using the same + hardware and system configuration. + Output from /var/log/nova/nova-api.log on + Grizzly: + 2014-01-08 11:15:15.704 INFO nova.ec2.wsgi.server +[req-ccac9790-3357-4aa8-84bd-cdaab1aa394e +ebbd729575cb404081a45c9ada0849b7 8175953c209044358ab5e0ec19d52c37] +117.103.103.29 "GET +/services/Cloud?AWSAccessKeyId=[something]&Action=RunInstances&ClientToken=[something]&ImageId=ami-00000007&InstanceInitiatedShutdownBehavior=terminate... +HTTP/1.1" status: 200 len: 931 time: 3.9426181 + + While monitoring system resources, I noticed + a significant increase in memory consumption while the EC2 API + processed this request. I thought it wasn't handling memory + properly -- possibly not releasing memory. If the API received + several of these requests, memory consumption quickly grew until + the system ran out of RAM and began using swap. Each node has + 48 GB of RAM and the "nova-api" process would consume all of it + within minutes. Once this happened, the entire system would become + unusably slow until I restarted the + nova-api service. + So, I found myself wondering what changed in the EC2 API on + Havana that might cause this to happen. Was it a bug or a normal + behavior that I now need to work around? + After digging into the Nova code, I noticed two areas in + api/ec2/cloud.py potentially impacting my + system: + + instances = self.compute_api.get_all(context, + search_opts=search_opts, + sort_dir='asc') + + sys_metas = self.compute_api.get_all_system_metadata( + context, search_filts=[{'key': ['EC2_client_token']}, + {'value': [client_token]}]) + + Since my database contained many records -- over 1 million + metadata records and over 300,000 instance records in "deleted" + or "errored" states -- each search took ages. I decided to clean + up the database by first archiving a copy for backup and then + performing some deletions using the MySQL client. For example, I + ran the following SQL command to remove rows of instances deleted + for over a year: + mysql> delete from nova.instances where deleted=1 and terminated_at < (NOW() - INTERVAL 1 YEAR); + Performance increased greatly after deleting the old records and + my new deployment continues to behave well. +