diff --git a/doc/admin-guide-cloud-rst/source/objectstorage-admin.rst b/doc/admin-guide-cloud-rst/source/objectstorage-admin.rst new file mode 100644 index 0000000000..4969ee8562 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage-admin.rst @@ -0,0 +1,12 @@ +======================================== +System administration for Object Storage +======================================== + +By understanding Object Storage concepts, you can better monitor and +administer your storage solution. The majority of the administration +information is maintained in developer documentation at +`docs.openstack.org/developer/swift/ `__. + +See the `OpenStack Configuration +Reference `__ +for a list of configuration options for Object Storage. diff --git a/doc/admin-guide-cloud-rst/source/objectstorage-monitoring.rst b/doc/admin-guide-cloud-rst/source/objectstorage-monitoring.rst new file mode 100644 index 0000000000..6ed516d356 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage-monitoring.rst @@ -0,0 +1,247 @@ +========================= +Object Storage monitoring +========================= + +Excerpted from a blog post by `Darrell +Bishop `__ + +An OpenStack Object Storage cluster is a collection of many daemons that +work together across many nodes. With so many different components, you +must be able to tell what is going on inside the cluster. Tracking +server-level meters like CPU utilization, load, memory consumption, disk +usage and utilization, and so on is necessary, but not sufficient. + +What are different daemons are doing on each server? What is the volume +of object replication on node8? How long is it taking? Are there errors? +If so, when did they happen? + +In such a complex ecosystem, you can use multiple approaches to get the +answers to these questions. This section describes several approaches. + +Swift Recon +~~~~~~~~~~~ + +The Swift Recon middleware (see +http://swift.openstack.org/admin_guide.html#cluster-telemetry-and-monitoring) +provides general machine statistics, such as load average, socket +statistics, ``/proc/meminfo`` contents, and so on, as well as +Swift-specific meters: + +- The MD5 sum of each ring file. + +- The most recent object replication time. + +- Count of each type of quarantined file: Account, container, or + object. + +- Count of "async\_pendings" (deferred container updates) on disk. + +Swift Recon is middleware that is installed in the object servers +pipeline and takes one required option: A local cache directory. To +track ``async_pendings``, you must set up an additional cron job for +each object server. You access data by either sending HTTP requests +directly to the object server or using the ``swift-recon`` command-line +client. + +There are some good Object Storage cluster statistics but the general +server meters overlap with existing server monitoring systems. To get +the Swift-specific meters into a monitoring system, they must be polled. +Swift Recon essentially acts as a middleware meters collector. The +process that feeds meters to your statistics system, such as +``collectd`` and ``gmond``, probably already runs on the storage node. +So, you can choose to either talk to Swift Recon or collect the meters +directly. + +Swift-Informant +~~~~~~~~~~~~~~~ + +Florian Hines developed the Swift-Informant middleware (see +https://github.com/pandemicsyn/swift-informant) to get real-time +visibility into Object Storage client requests. It sits in the pipeline +for the proxy server, and after each request to the proxy server, sends +three meters to a StatsD server (see +http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/): + +- A counter increment for a meter like ``obj.GET.200`` or + ``cont.PUT.404``. + +- Timing data for a meter like ``acct.GET.200`` or ``obj.GET.200``. + [The README says the meters look like ``duration.acct.GET.200``, but + I do not see the ``duration`` in the code. I am not sure what the + Etsy server does but our StatsD server turns timing meters into five + derivative meters with new segments appended, so it probably works as + coded. The first meter turns into ``acct.GET.200.lower``, + ``acct.GET.200.upper``, ``acct.GET.200.mean``, + ``acct.GET.200.upper_90``, and ``acct.GET.200.count``]. + +- A counter increase by the bytes transferred for a meter like + ``tfer.obj.PUT.201``. + +This is good for getting a feel for the quality of service clients are +experiencing with the timing meters, as well as getting a feel for the +volume of the various permutations of request server type, command, and +response code. Swift-Informant also requires no change to core Object +Storage code because it is implemented as middleware. However, it gives +you no insight into the workings of the cluster past the proxy server. +If the responsiveness of one storage node degrades, you can only see +that some of your requests are bad, either as high latency or error +status codes. You do not know exactly why or where that request tried to +go. Maybe the container server in question was on a good node but the +object server was on a different, poorly-performing node. + +Statsdlog +~~~~~~~~~ + +Florian's `Statsdlog `__ +project increments StatsD counters based on logged events. Like +Swift-Informant, it is also non-intrusive, but statsdlog can track +events from all Object Storage daemons, not just proxy-server. The +daemon listens to a UDP stream of syslog messages and StatsD counters +are incremented when a log line matches a regular expression. Meter +names are mapped to regex match patterns in a JSON file, allowing +flexible configuration of what meters are extracted from the log stream. + +Currently, only the first matching regex triggers a StatsD counter +increment, and the counter is always incremented by one. There is no way +to increment a counter by more than one or send timing data to StatsD +based on the log line content. The tool could be extended to handle more +meters for each line and data extraction, including timing data. But a +coupling would still exist between the log textual format and the log +parsing regexes, which would themselves be more complex to support +multiple matches for each line and data extraction. Also, log processing +introduces a delay between the triggering event and sending the data to +StatsD. It would be preferable to increment error counters where they +occur and send timing data as soon as it is known to avoid coupling +between a log string and a parsing regex and prevent a time delay +between events and sending data to StatsD. + +The next section describes another method for gathering Object Storage +operational meters. + +Swift StatsD logging +~~~~~~~~~~~~~~~~~~~~ + +StatsD (see +http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/) +was designed for application code to be deeply instrumented; meters are +sent in real-time by the code that just noticed or did something. The +overhead of sending a meter is extremely low: a ``sendto`` of one UDP +packet. If that overhead is still too high, the StatsD client library +can send only a random portion of samples and StatsD approximates the +actual number when flushing meters upstream. + +To avoid the problems inherent with middleware-based monitoring and +after-the-fact log processing, the sending of StatsD meters is +integrated into Object Storage itself. The submitted change set (see +https://review.openstack.org/#change,6058) currently reports 124 meters +across 15 Object Storage daemons and the tempauth middleware. Details of +the meters tracked are in the `Administrator's +Guide `__. + +The sending of meters is integrated with the logging framework. To +enable, configure ``log_statsd_host`` in the relevant config file. You +can also specify the port and a default sample rate. The specified +default sample rate is used unless a specific call to a statsd logging +method (see the list below) overrides it. Currently, no logging calls +override the sample rate, but it is conceivable that some meters may +require accuracy (sample\_rate == 1) while others may not. + +.. code: + + [DEFAULT] + ... + log_statsd_host = 127.0.0.1 + log_statsd_port = 8125 + log_statsd_default_sample_rate = 1 + +Then the LogAdapter object returned by ``get_logger()``, usually stored +in ``self.logger``, has these new methods: + +- ``set_statsd_prefix(self, prefix)`` Sets the client library stat + prefix value which gets prefixed to every meter. The default prefix + is the "name" of the logger such as "object-server", + "container-auditor", and so on. This is currently used to turn + "proxy-server" into one of "proxy-server.Account", + "proxy-server.Container", or "proxy-server.Object" as soon as the + Controller object is determined and instantiated for the request. + +- ``update_stats(self, metric, amount, sample_rate=1)`` Increments the supplied + metric by the given amount. This is used when you need to add or + subtract more that one from a counter, like incrementing + "suffix.hashes" by the number of computed hashes in the object + replicator. + +- ``increment(self, metric, sample_rate=1)`` Increments the given counter + metric by one. + +- ``decrement(self, metric, sample_rate=1)`` Lowers the given counter + metric by one. + +- ``timing(self, metric, timing_ms, sample_rate=1)`` Record that the given metric + took the supplied number of milliseconds. + +- ``timing_since(self, metric, orig_time, sample_rate=1)`` Convenience method to record + a timing metric whose value is "now" minus an existing timestamp. + +Note that these logging methods may safely be called anywhere you have a +logger object. If StatsD logging has not been configured, the methods +are no-ops. This avoids messy conditional logic each place a meter is +recorded. These example usages show the new logging methods: + +.. code-block:: bash + :linenos: + + # swift/obj/replicator.py + def update(self, job): + # ... + begin = time.time() + try: + hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'], + do_listdir=(self.replication_count % 10) == 0, + reclaim_age=self.reclaim_age) + # See tpooled_get_hashes "Hack". + if isinstance(hashed, BaseException): + raise hashed + self.suffix_hash += hashed + self.logger.update_stats('suffix.hashes', hashed) + # ... + finally: + self.partition_times.append(time.time() - begin) + self.logger.timing_since('partition.update.timing', begin) + +.. code-block:: bash + :linenos: + + # swift/container/updater.py + def process_container(self, dbfile): + # ... + start_time = time.time() + # ... + for event in events: + if 200 <= event.wait() < 300: + successes += 1 + else: + failures += 1 + if successes > failures: + self.logger.increment('successes') + # ... + else: + self.logger.increment('failures') + # ... + # Only track timing data for attempted updates: + self.logger.timing_since('timing', start_time) + else: + self.logger.increment('no_changes') + self.no_changes += 1 + +The development team of StatsD wanted to use the +`pystatsd `__ client library (not to +be confused with a `similar-looking +project `__ also hosted on GitHub), +but the released version on PyPI was missing two desired features the +latest version in GitHub had: the ability to configure a meters prefix +in the client object and a convenience method for sending timing data +between "now" and a "start" timestamp you already have. So they just +implemented a simple StatsD client library from scratch with the same +interface. This has the nice fringe benefit of not introducing another +external library dependency into Object Storage. diff --git a/doc/admin-guide-cloud-rst/source/objectstorage.rst b/doc/admin-guide-cloud-rst/source/objectstorage.rst index 890e309ac7..94baa41e8c 100644 --- a/doc/admin-guide-cloud-rst/source/objectstorage.rst +++ b/doc/admin-guide-cloud-rst/source/objectstorage.rst @@ -12,6 +12,8 @@ Contents objectstorage_features.rst objectstorage_characteristics.rst objectstorage_components.rst + objectstorage-monitoring.rst + objectstorage-admin.rst .. TODO (karenb) objectstorage_ringbuilder.rst @@ -19,6 +21,4 @@ Contents objectstorage_replication.rst objectstorage_account_reaper.rst objectstorage_tenant_specific_image_storage.rst - objectstorage_monitoring.rst - objectstorage_admin.rst objectstorage_troubleshoot.rst