Merge "Handle non-referenced and duplicated files"
This commit is contained in:
commit
8d32b87efb
@ -11,6 +11,9 @@
|
||||
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-arch.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-replication.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-account-reaper.xml"/>
|
||||
<xi:include href="../common/section_objectstorage_tenant-specific-image-storage.xml"/>
|
||||
<xi:include href="section_object-storage-monitoring.xml"/>
|
||||
<xi:include href="section_object-storage-admin.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-troubleshoot.xml"/>
|
||||
</chapter>
|
||||
|
@ -1,122 +0,0 @@
|
||||
<?xml version="1.0" encoding="UTF-8"?>
|
||||
<section xml:id="root-wrap-reference"
|
||||
xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0">
|
||||
<title>Secure with root wrappers</title>
|
||||
<para>The root wrapper enables the Compute
|
||||
unprivileged user to run a number of actions as the root user
|
||||
in the safest manner possible. Historically, Compute used a
|
||||
specific <filename>sudoers</filename> file that listed every
|
||||
command that the Compute user was allowed to run, and used
|
||||
<command>sudo</command> to run that command as
|
||||
<literal>root</literal>. However this was difficult to
|
||||
maintain (the <filename>sudoers</filename> file was in
|
||||
packaging), and did not enable complex filtering of parameters
|
||||
(advanced filters). The rootwrap was designed to solve those
|
||||
issues.</para>
|
||||
<simplesect>
|
||||
<title>How rootwrap works</title>
|
||||
<para>Instead of calling <command>sudo make me a
|
||||
sandwich</command>, Compute services start with
|
||||
nova- call <command>sudo nova-rootwrap
|
||||
/etc/nova/rootwrap.conf make me a sandwich</command>.
|
||||
A generic sudoers entry lets the Compute user run
|
||||
nova-rootwrap as root. The nova-rootwrap code looks for
|
||||
filter definition directories in its configuration file,
|
||||
and loads command filters from them. Then it checks if the
|
||||
command requested by Compute matches one of those filters,
|
||||
in which case it executes the command (as root). If no
|
||||
filter matches, it denies the request.</para>
|
||||
</simplesect>
|
||||
<simplesect>
|
||||
<title>Security model</title>
|
||||
<para>The escalation path is fully controlled by the root
|
||||
user. A sudoers entry (owned by root) allows Compute to
|
||||
run (as root) a specific rootwrap executable, and only
|
||||
with a specific configuration file (which should be owned
|
||||
by root). nova-rootwrap imports the Python modules it
|
||||
needs from a cleaned (and system-default) PYTHONPATH. The
|
||||
configuration file (also root-owned) points to root-owned
|
||||
filter definition directories, which contain root-owned
|
||||
filters definition files. This chain ensures that the
|
||||
Compute user itself is not in control of the configuration
|
||||
or modules used by the nova-rootwrap executable.</para>
|
||||
</simplesect>
|
||||
<simplesect>
|
||||
<title>Details of rootwrap.conf</title>
|
||||
<para>You configure nova-rootwrap in the
|
||||
<filename>rootwrap.conf</filename> file. Because it's
|
||||
in the trusted security path, it must be owned and
|
||||
writable by only the root user. Its location is specified
|
||||
both in the sudoers entry and in the
|
||||
<filename>nova.conf</filename> configuration file with
|
||||
the <code>rootwrap_config=entry</code>.</para>
|
||||
<para>It uses an INI file format with these sections and
|
||||
parameters:</para>
|
||||
<table rules="all" frame="border"
|
||||
xml:id="rootwrap-conf-table-filter-path" width="100%">
|
||||
<caption>rootwrap.conf configuration options</caption>
|
||||
<col width="50%"/>
|
||||
<col width="50%"/>
|
||||
<thead>
|
||||
<tr>
|
||||
<td><para>Configuration option=Default
|
||||
value</para></td>
|
||||
<td><para>(Type) Description</para></td>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><para>[DEFAULT]</para>
|
||||
<para>filters_path=/etc/nova/rootwrap.d,/usr/share/nova/rootwrap
|
||||
</para></td>
|
||||
<td><para>(ListOpt) Comma-separated list of
|
||||
directories containing filter definition
|
||||
files. Defines where filters for root wrap
|
||||
are stored. Directories defined on this
|
||||
line should all exist, be owned and
|
||||
writable only by the root
|
||||
user.</para></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</simplesect>
|
||||
<simplesect>
|
||||
<title>Details of .filters files</title>
|
||||
<para>Filters definition files contain lists of filters that
|
||||
nova-rootwrap will use to allow or deny a specific
|
||||
command. They are generally suffixed by .filters. Since
|
||||
they are in the trusted security path, they need to be
|
||||
owned and writable only by the root user. Their location
|
||||
is specified in the rootwrap.conf file.</para>
|
||||
<para>It uses an INI file format with a [Filters] section and
|
||||
several lines, each with a unique parameter name
|
||||
(different for each filter that you define):</para>
|
||||
<table rules="all" frame="border"
|
||||
xml:id="rootwrap-conf-table-filter-name" width="100%">
|
||||
<caption>rootwrap.conf configuration options</caption>
|
||||
<col width="50%"/>
|
||||
<col width="50%"/>
|
||||
<thead>
|
||||
<tr>
|
||||
<td><para>Configuration option=Default
|
||||
value</para></td>
|
||||
<td><para>(Type) Description</para></td>
|
||||
</tr>
|
||||
</thead>
|
||||
<tbody>
|
||||
<tr>
|
||||
<td><para>[Filters]</para>
|
||||
<para>filter_name=kpartx: CommandFilter,
|
||||
/sbin/kpartx, root</para></td>
|
||||
<td><para>(ListOpt) Comma-separated list
|
||||
containing first the Filter class to use,
|
||||
followed by that Filter arguments (which
|
||||
vary depending on the Filter class
|
||||
selected).</para></td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</table>
|
||||
</simplesect>
|
||||
</section>
|
@ -4,7 +4,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-account-reaper">
|
||||
<!-- ... Old module003-ch008-account-reaper edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Account reaper</title>
|
||||
<para>In the background, the account reaper removes data from the deleted accounts.</para>
|
||||
<para>A reseller marks an account for deletion by issuing a <code>DELETE</code> request on the account’s
|
||||
|
@ -8,7 +8,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-cluster-architecture">
|
||||
<!-- ... Old module003-ch007-swift-cluster-architecture edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Cluster architecture</title>
|
||||
<section xml:id="section_access-tier">
|
||||
<title>Access tier</title>
|
||||
|
@ -4,7 +4,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="objectstorage_characteristics">
|
||||
<!-- ... Old module003-ch003-obj-store-capabilities edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Object Storage characteristics</title>
|
||||
<para>The key characteristics of Object Storage are that:</para>
|
||||
<itemizedlist>
|
||||
|
@ -4,7 +4,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-components">
|
||||
<!-- ... Old module003-ch004-swift-building-blocks edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Components</title>
|
||||
<para>The components that enable Object Storage to deliver high availability, high
|
||||
durability, and high concurrency are:</para>
|
||||
|
@ -4,7 +4,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage_features">
|
||||
<!-- ... Old module003-ch002-features-benefits edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Features and benefits</title>
|
||||
|
||||
<informaltable>
|
||||
|
@ -4,7 +4,6 @@
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="section_objectstorage-intro">
|
||||
<!-- ... Old module003-ch001-intro-objstore edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Introduction to Object Storage</title>
|
||||
<para>OpenStack Object Storage (code-named swift) is open source software for creating
|
||||
redundant, scalable data storage using clusters of standardized servers to store petabytes
|
||||
|
@ -3,7 +3,6 @@
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||
xml:id="section_objectstorage-replication">
|
||||
<!-- ... Old module003-ch009-replication edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Replication</title>
|
||||
<para>Because each replica in Object Storage functions
|
||||
independently and clients generally require only a simple
|
||||
|
@ -3,7 +3,6 @@
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink" version="5.0"
|
||||
xml:id="section_objectstorage-ringbuilder">
|
||||
<!-- ... Old module003-ch005-the-ring edited, renamed, and stored in doc/common for use by both Cloud Admin and Operator Training Guides... -->
|
||||
<title>Ring-builder</title>
|
||||
<para>Use the swift-ring-builder utility to build and manage rings. This
|
||||
utility assigns partitions to devices and writes an optimized
|
||||
|
@ -22,6 +22,7 @@ options. For installation prerequisites and step-by-step walkthroughs, see the
|
||||
<xi:include href="../common/tables/keystone-api.xml"/>
|
||||
<xi:include href="../common/tables/keystone-assignment.xml"/>
|
||||
<xi:include href="../common/tables/keystone-auth.xml"/>
|
||||
<xi:include href="../common/tables/keystone-auth_token.xml"/>
|
||||
<xi:include href="../common/tables/keystone-cache.xml"/>
|
||||
<xi:include href="../common/tables/keystone-catalog.xml"/>
|
||||
<xi:include href="../common/tables/keystone-credential.xml"/>
|
||||
|
@ -93,6 +93,23 @@
|
||||
</section>
|
||||
|
||||
</section>
|
||||
<section xml:id="container-sync-realms-configuration">
|
||||
<title>Container sync realms configuration</title>
|
||||
<para>Find an example container sync realms configuration at
|
||||
<filename>etc/container-sync-realms.conf-sample</filename>
|
||||
in the source code repository.</para>
|
||||
<para>The available configuration options are:</para>
|
||||
<xi:include
|
||||
href="../common/tables/swift-container-sync-realms-DEFAULT.xml"/>
|
||||
<xi:include
|
||||
href="../common/tables/swift-container-sync-realms-realm1.xml"/>
|
||||
<xi:include
|
||||
href="../common/tables/swift-container-sync-realms-realm2.xml"/>
|
||||
<section xml:id="container-sync-realms-conf">
|
||||
<title>Sample container sync realms configuration file</title>
|
||||
<programlisting language="ini"><xi:include parse="text" href="http://git.openstack.org/cgit/openstack/swift/plain/etc/container-sync-realms.conf-sample?h=stable/icehouse"/></programlisting>
|
||||
</section>
|
||||
</section>
|
||||
<section xml:id="account-server-configuration">
|
||||
<title>Account server configuration</title>
|
||||
<para>Find an example account server configuration at
|
||||
@ -140,6 +157,8 @@
|
||||
href="../common/tables/swift-proxy-server-filter-cache.xml"/>
|
||||
<xi:include
|
||||
href="../common/tables/swift-proxy-server-filter-catch_errors.xml"/>
|
||||
<xi:include
|
||||
href="../common/tables/swift-proxy-server-filter-container_sync.xml"/>
|
||||
<xi:include
|
||||
href="../common/tables/swift-proxy-server-filter-dlo.xml"/>
|
||||
<xi:include
|
||||
|
@ -12,15 +12,15 @@
|
||||
</section>
|
||||
<section xml:id="associate-intro-object-store">
|
||||
<title>Introduction to Object Storage</title>
|
||||
<xi:include href="./module003-ch001-intro-objstore.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch001-intro-objectstore']/*[not(self::db:title)])">
|
||||
<xi:include href="../common/section_objectstorage-intro.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage-intro']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="associate-object-store-features-benefits">
|
||||
<title>Features and Benefits</title>
|
||||
<xi:include href="./module003-ch002-features-benefits.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch002-features-benefits']/*[not(self::db:title)])">
|
||||
<xi:include href="../common/section_objectstorage-features.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage_features']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
|
@ -12,42 +12,19 @@
|
||||
</section>
|
||||
<section xml:id="operator-intro-object-store">
|
||||
<title>Review Associate Introduction to Object Storage</title>
|
||||
<xi:include href="./module003-ch001-intro-objstore.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch001-intro-objectstore']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-object-store-features-benefits">
|
||||
<title>Review Associate Features and Benefits</title>
|
||||
<xi:include href="./module003-ch002-features-benefits.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch002-features-benefits']/*[not(self::db:title)])">
|
||||
<xi:include href="../common/section_objectstorage-intro.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'section_objectstorage-intro']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<xi:include href="../common/section_objectstorage-features.xml"/>
|
||||
<section xml:id="operator-object-store-node-administration-tasks">
|
||||
<title>Review Associate Administration Tasks</title>
|
||||
<para></para>
|
||||
</section>
|
||||
<section xml:id="operator-object-store-capabilities">
|
||||
<title>Object Storage Capabilities</title>
|
||||
<xi:include href="./module003-ch003-obj-store-capabilities.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch003-obj-store-capabilities']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-swift-building-blocks">
|
||||
<title>Object Storage Building Blocks</title>
|
||||
<xi:include href="./module003-ch004-swift-building-blocks.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch004-swift-building-blocks']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-swift-the-ring">
|
||||
<title>Swift Ring Builder</title>
|
||||
<xi:include href="./module003-ch005-the-ring.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch005-the-ring']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include></section>
|
||||
<xi:include href="../common/section_objectstorage-characteristics.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-components.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-ringbuilder.xml"/>
|
||||
<section xml:id="operator-swift-more-concepts">
|
||||
<title>More Swift Concepts</title>
|
||||
<xi:include href="./module003-ch006-more-concepts.xml"
|
||||
@ -55,25 +32,7 @@
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-swift-cluster-architecture">
|
||||
<title>Swift Cluster Architecture</title>
|
||||
<xi:include href="./module003-ch007-swift-cluster-architecture.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch007-cluster-architecture']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-swift-account-reaper">
|
||||
<title>Swift Account Reaper</title>
|
||||
<xi:include href="./module003-ch008-account-reaper.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch008-account-reaper']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<section xml:id="operator-swift-replication">
|
||||
<title>Swift Replication</title>
|
||||
<xi:include href="./module003-ch009-replication.xml"
|
||||
xpointer="xmlns(db=http://docbook.org/ns/docbook) xpath(//*[@xml:id = 'module003-ch009-replication']/*[not(self::db:title)])">
|
||||
<xi:fallback><para><mediaobject><imageobject><imagedata fileref="figures/openstack-training-remote-content-not-available.png" format="PNG"/></imageobject></mediaobject>Remote content not available</para><para>image source</para><para><link xlink:href="https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing">https://docs.google.com/drawings/d/1J2LZSxmc06xKyxMgPjv5fC0blV7qK6956-AeTmFOZD4/edit?usp=sharing</link></para></xi:fallback>
|
||||
</xi:include>
|
||||
</section>
|
||||
<xi:include href="../common/section_objectstorage-arch.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-account-reaper.xml"/>
|
||||
<xi:include href="../common/section_objectstorage-replication.xml"/>
|
||||
</chapter>
|
||||
|
@ -1,32 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch001-intro-objectstore">
|
||||
<title>Introduction to Object Storage</title>
|
||||
<para>OpenStack Object Storage (code-named Swift) is open source
|
||||
software for creating redundant, scalable data storage using
|
||||
clusters of standardized servers to store petabytes of
|
||||
accessible data. It is a long-term storage system for large
|
||||
amounts of static data that can be retrieved, leveraged, and
|
||||
updated. Object Storage uses a distributed architecture with
|
||||
no central point of control, providing greater scalability,
|
||||
redundancy and permanence. Objects are written to multiple
|
||||
hardware devices, with the OpenStack software responsible for
|
||||
ensuring data replication and integrity across the cluster.
|
||||
Storage clusters scale horizontally by adding new nodes.
|
||||
Should a node fail, OpenStack works to replicate its content
|
||||
from other active nodes. Because OpenStack uses software logic
|
||||
to ensure data replication and distribution across different
|
||||
devices, inexpensive commodity hard drives and servers can be
|
||||
used in lieu of more expensive equipment.</para>
|
||||
<para>Object Storage is ideal for cost effective, scale-out
|
||||
storage. It provides a fully distributed, API-accessible
|
||||
storage platform that can be integrated directly into
|
||||
applications or used for backup, archiving and data retention.
|
||||
Block Storage allows block devices to be exposed and connected
|
||||
to compute instances for expanded storage, better performance
|
||||
and integration with enterprise storage platforms, such as
|
||||
NetApp, Nexenta and SolidFire.</para>
|
||||
</chapter>
|
@ -1,204 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch002-features-benefits">
|
||||
<title>Features and Benefits</title>
|
||||
<para>
|
||||
<informaltable class="c19">
|
||||
<tbody>
|
||||
<tr>
|
||||
<th rowspan="1" colspan="1">Features</th>
|
||||
<th rowspan="1" colspan="1">Benefits</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Leverages commodity
|
||||
hardware</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>No
|
||||
lock-in, lower
|
||||
price/GB</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>HDD/node failure agnostic</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Self
|
||||
healingReliability, data redundancy protecting
|
||||
from
|
||||
failures</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Unlimited storage</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Huge
|
||||
& flat namespace, highly scalable
|
||||
read/write accessAbility to serve content
|
||||
directly from storage
|
||||
system</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Multi-dimensional scalability</emphasis>
|
||||
(scale out architecture)Scale vertically and
|
||||
horizontally-distributed storage</td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Backup
|
||||
and archive large amounts of data with linear
|
||||
performance</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Account/Container/Object
|
||||
structure</emphasis>No nesting, not a
|
||||
traditional file system</td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Optimized
|
||||
for scaleScales to multiple petabytes,
|
||||
billions of
|
||||
objects</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Built-in replication3x+ data
|
||||
redundancy</emphasis> compared to 2x on
|
||||
RAID</td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Configurable
|
||||
number of accounts, container and object
|
||||
copies for high
|
||||
availability</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Easily add capacity</emphasis> unlike
|
||||
RAID resize</td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Elastic
|
||||
data scaling with
|
||||
ease</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>No central database</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Higher
|
||||
performance, no
|
||||
bottlenecks</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>RAID not required</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Handle
|
||||
lots of small, random reads and writes
|
||||
efficiently</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Built-in management
|
||||
utilities</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Account
|
||||
Management: Create, add, verify, delete
|
||||
usersContainer Management: Upload, download,
|
||||
verifyMonitoring: Capacity, host, network, log
|
||||
trawling, cluster
|
||||
health</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Drive auditing</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Detect
|
||||
drive failures preempting data
|
||||
corruption</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Expiring objects</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Users
|
||||
can set an expiration time or a TTL on an
|
||||
object to control
|
||||
access</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Direct object access</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Enable
|
||||
direct browser access to content, such as for
|
||||
a control
|
||||
panel</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Realtime visibility into client
|
||||
requests</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Know
|
||||
what users are
|
||||
requesting</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Supports S3 API</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Utilize
|
||||
tools that were designed for the popular S3
|
||||
API</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Restrict containers per
|
||||
account</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Limit
|
||||
access to control usage by
|
||||
user</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Support for NetApp, Nexenta,
|
||||
SolidFire</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Unified
|
||||
support for block volumes using a variety of
|
||||
storage
|
||||
systems</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Snapshot and backup API for block
|
||||
volumes</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Data
|
||||
protection and recovery for VM
|
||||
data</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Standalone volume API
|
||||
available</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Separate
|
||||
endpoint and API for integration with other
|
||||
compute
|
||||
systems</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td rowspan="1" colspan="1"><emphasis role="bold"
|
||||
>Integration with Compute</emphasis></td>
|
||||
<td rowspan="1" colspan="1"
|
||||
>Fully
|
||||
integrated to Compute for attaching block
|
||||
volumes and reporting on usage</td>
|
||||
</tr>
|
||||
</tbody>
|
||||
</informaltable>
|
||||
</para>
|
||||
</chapter>
|
@ -1,100 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch003-obj-store-capabilities">
|
||||
<title>Object Storage Capabilities</title>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>OpenStack provides redundant, scalable object
|
||||
storage using clusters of standardized servers capable
|
||||
of storing petabytes of data</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Object Storage is not a traditional file system, but
|
||||
rather a distributed storage system for static data
|
||||
such as virtual machine images, photo storage, email
|
||||
storage, backups and archives. Having no central
|
||||
"brain" or master point of control provides greater
|
||||
scalability, redundancy and durability.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Objects and files are written to multiple disk
|
||||
drives spread throughout servers in the data center,
|
||||
with the OpenStack software responsible for ensuring
|
||||
data replication and integrity across the
|
||||
cluster.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Storage clusters scale horizontally simply by adding
|
||||
new servers. Should a server or hard drive fail,
|
||||
OpenStack replicates its content from other active
|
||||
nodes to new locations in the cluster. Because
|
||||
OpenStack uses software logic to ensure data
|
||||
replication and distribution across different devices,
|
||||
inexpensive commodity hard drives and servers can be
|
||||
used in lieu of more expensive equipment.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<para><guilabel>Swift Characteristics</guilabel></para>
|
||||
<para>The key characteristics of Swift include:</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para>All objects stored in Swift have a URL</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>All objects stored are replicated 3x in
|
||||
as-unique-as-possible zones, which can be defined as a
|
||||
group of drives, a node, a rack etc.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>All objects have their own metadata</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Developers interact with the object storage system
|
||||
through a RESTful HTTP API</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Object data can be located anywhere in the
|
||||
cluster</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>The cluster scales by adding additional nodes --
|
||||
without sacrificing performance, which allows a more
|
||||
cost-effective linear storage expansion vs. fork-lift
|
||||
upgrades</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Data doesn’t have to be migrated to an entirely new
|
||||
storage system</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>New nodes can be added to the cluster without
|
||||
downtime</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Failed nodes and disks can be swapped out with no
|
||||
downtime</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para>Runs on industry-standard hardware, such as Dell,
|
||||
HP, Supermicro etc.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<figure>
|
||||
<title>Object Storage(Swift)</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image39.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Developers can either write directly to the Swift API or use
|
||||
one of the many client libraries that exist for all popular
|
||||
programming languages, such as Java, Python, Ruby and C#.
|
||||
Amazon S3 and RackSpace Cloud Files users should feel very
|
||||
familiar with Swift. For users who have not used an object
|
||||
storage system before, it will require a different approach
|
||||
and mindset than using a traditional filesystem.</para>
|
||||
</chapter>
|
@ -1,295 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch004-swift-building-blocks">
|
||||
<title>Building Blocks of Swift</title>
|
||||
<para>The components that enable Swift to deliver high
|
||||
availability, high durability and high concurrency
|
||||
are:</para>
|
||||
<itemizedlist>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Proxy
|
||||
Servers:</emphasis>Handles all incoming API
|
||||
requests.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Rings:</emphasis>Maps
|
||||
logical names of data to locations on particular
|
||||
disks.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Zones:</emphasis>Each Zone
|
||||
isolates data from other Zones. A failure in one Zone
|
||||
doesn’t impact the rest of the cluster because data is
|
||||
replicated across the Zones.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Accounts &
|
||||
Containers:</emphasis>Each Account and Container
|
||||
are individual databases that are distributed across
|
||||
the cluster. An Account database contains the list of
|
||||
Containers in that Account. A Container database
|
||||
contains the list of Objects in that Container</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Objects:</emphasis>The
|
||||
data itself.</para>
|
||||
</listitem>
|
||||
<listitem>
|
||||
<para><emphasis role="bold">Partitions:</emphasis>A
|
||||
Partition stores Objects, Account databases and
|
||||
Container databases. It’s an intermediate 'bucket'
|
||||
that helps manage locations where data lives in the
|
||||
cluster.</para>
|
||||
</listitem>
|
||||
</itemizedlist>
|
||||
<figure>
|
||||
<title>Building Blocks</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image40.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para><guilabel>Proxy Servers</guilabel></para>
|
||||
<para>The Proxy Servers are the public face of Swift and
|
||||
handle all incoming API requests. Once a Proxy Server
|
||||
receive a request, it will determine the storage node
|
||||
based on the URL of the object, such as <literal>
|
||||
https://swift.example.com/v1/account/container/object
|
||||
</literal>. The Proxy Servers also coordinates responses,
|
||||
handles failures and coordinates timestamps.</para>
|
||||
<para>Proxy servers use a shared-nothing architecture and can
|
||||
be scaled as needed based on projected workloads. A
|
||||
minimum of two Proxy Servers should be deployed for
|
||||
redundancy. Should one proxy server fail, the others will
|
||||
take over.</para>
|
||||
<para><guilabel>The Ring</guilabel></para>
|
||||
<para>A ring represents a mapping between the names of entities
|
||||
stored on disk and their physical location. There are separate
|
||||
rings for accounts, containers, and objects. When other
|
||||
components need to perform any operation on an object,
|
||||
container, or account, they need to interact with the
|
||||
appropriate ring to determine its location in the
|
||||
cluster.</para>
|
||||
<para>The Ring maintains this mapping using zones, devices,
|
||||
partitions, and replicas. Each partition in the ring is
|
||||
replicated, by default, 3 times across the cluster, and the
|
||||
locations for a partition are stored in the mapping maintained
|
||||
by the ring. The ring is also responsible for determining
|
||||
which devices are used for hand off in failure
|
||||
scenarios.</para>
|
||||
<para>Data can be isolated with the concept of zones in the
|
||||
ring. Each replica of a partition is guaranteed to reside
|
||||
in a different zone. A zone could represent a drive, a
|
||||
server, a cabinet, a switch, or even a data center.</para>
|
||||
<para>The partitions of the ring are equally divided among all
|
||||
the devices in the OpenStack Object Storage installation.
|
||||
When partitions need to be moved around, such as when a
|
||||
device is added to the cluster, the ring ensures that a
|
||||
minimum number of partitions are moved at a time, and only
|
||||
one replica of a partition is moved at a time.</para>
|
||||
<para>Weights can be used to balance the distribution of
|
||||
partitions on drives across the cluster. This can be
|
||||
useful, for example, when different sized drives are used
|
||||
in a cluster.</para>
|
||||
<para>The ring is used by the Proxy server and several
|
||||
background processes (like replication).</para>
|
||||
<para>The Ring maps Partitions to physical locations on disk.
|
||||
When other components need to perform any operation on an
|
||||
object, container, or account, they need to interact with
|
||||
the Ring to determine its location in the cluster.</para>
|
||||
<para>The Ring maintains this mapping using zones, devices,
|
||||
partitions, and replicas. Each partition in the Ring is
|
||||
replicated three times by default across the cluster, and
|
||||
the locations for a partition are stored in the mapping
|
||||
maintained by the Ring. The Ring is also responsible for
|
||||
determining which devices are used for handoff should a
|
||||
failure occur.</para>
|
||||
<figure>
|
||||
<title>The Lord of the <emphasis role="bold"
|
||||
>Ring</emphasis>s</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image41.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>The Ring maps partitions to physical locations on
|
||||
disk.</para>
|
||||
<para>The rings determine where data should reside in the
|
||||
cluster. There is a separate ring for account databases,
|
||||
container databases, and individual objects but each ring
|
||||
works in the same way. These rings are externally managed,
|
||||
in that the server processes themselves do not modify the
|
||||
rings, they are instead given new rings modified by other
|
||||
tools.</para>
|
||||
<para>The ring uses a configurable number of bits from a
|
||||
path’s MD5 hash as a partition index that designates a
|
||||
device. The number of bits kept from the hash is known as
|
||||
the partition power, and 2 to the partition power
|
||||
indicates the partition count. Partitioning the full MD5
|
||||
hash ring allows other parts of the cluster to work in
|
||||
batches of items at once which ends up either more
|
||||
efficient or at least less complex than working with each
|
||||
item separately or the entire cluster all at once.</para>
|
||||
<para>Another configurable value is the replica count, which
|
||||
indicates how many of the partition->device assignments
|
||||
comprise a single ring. For a given partition number, each
|
||||
replica’s device will not be in the same zone as any other
|
||||
replica's device. Zones can be used to group devices based on
|
||||
physical locations, power separations, network separations, or
|
||||
any other attribute that would lessen multiple replicas being
|
||||
unavailable at the same time.</para>
|
||||
<para><guilabel>Zones: Failure Boundaries</guilabel></para>
|
||||
<para>Swift allows zones to be configured to isolate
|
||||
failure boundaries. Each replica of the data resides
|
||||
in a separate zone, if possible. At the smallest
|
||||
level, a zone could be a single drive or a grouping of
|
||||
a few drives. If there were five object storage
|
||||
servers, then each server would represent its own
|
||||
zone. Larger deployments would have an entire rack (or
|
||||
multiple racks) of object servers, each representing a
|
||||
zone. The goal of zones is to allow the cluster to
|
||||
tolerate significant outages of storage servers
|
||||
without losing all replicas of the data.</para>
|
||||
<para>As we learned earlier, everything in Swift is
|
||||
stored, by default, three times. Swift will place each
|
||||
replica "as-uniquely-as-possible" to ensure both high
|
||||
availability and high durability. This means that when
|
||||
choosing a replica location, Swift will choose a server
|
||||
in an unused zone before an unused server in a zone
|
||||
that already has a replica of the data.</para>
|
||||
<figure>
|
||||
<title>image33.png</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image42.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>When a disk fails, replica data is automatically
|
||||
distributed to the other zones to ensure there are
|
||||
three copies of the data</para>
|
||||
<para><guilabel>Accounts &
|
||||
Containers</guilabel></para>
|
||||
<para>Each account and container is an individual SQLite
|
||||
database that is distributed across the cluster. An
|
||||
account database contains the list of containers in
|
||||
that account. A container database contains the list
|
||||
of objects in that container.</para>
|
||||
<figure>
|
||||
<title>Accounts and Containers</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image43.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>To keep track of object data location, each account
|
||||
in the system has a database that references all its
|
||||
containers, and each container database references
|
||||
each object</para>
|
||||
<para><guilabel>Partitions</guilabel></para>
|
||||
<para>A Partition is a collection of stored data,
|
||||
including Account databases, Container databases, and
|
||||
objects. Partitions are core to the replication
|
||||
system.</para>
|
||||
<para>Think of a Partition as a bin moving throughout a
|
||||
fulfillment center warehouse. Individual orders get
|
||||
thrown into the bin. The system treats that bin as a
|
||||
cohesive entity as it moves throughout the system. A
|
||||
bin full of things is easier to deal with than lots of
|
||||
little things. It makes for fewer moving parts
|
||||
throughout the system.</para>
|
||||
<para>The system replicators and object uploads/downloads
|
||||
operate on Partitions. As the system scales up,
|
||||
behavior continues to be predictable as the number of
|
||||
Partitions is a fixed number.</para>
|
||||
<para>The implementation of a Partition is conceptually
|
||||
simple -- a partition is just a directory sitting on a
|
||||
disk with a corresponding hash table of what it
|
||||
contains.</para>
|
||||
<figure>
|
||||
<title>Partitions</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image44.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>*Swift partitions contain all data in the
|
||||
system.</para>
|
||||
<para><guilabel>Replication</guilabel></para>
|
||||
<para>In order to ensure that there are three copies of
|
||||
the data everywhere, replicators continuously examine
|
||||
each Partition. For each local Partition, the
|
||||
replicator compares it against the replicated copies
|
||||
in the other Zones to see if there are any
|
||||
differences.</para>
|
||||
<para>How does the replicator know if replication needs to
|
||||
take place? It does this by examining hashes. A hash
|
||||
file is created for each Partition, which contains
|
||||
hashes of each directory in the Partition. Each of the
|
||||
three hash files is compared. For a given Partition,
|
||||
the hash files for each of the Partition's copies are
|
||||
compared. If the hashes are different, then it is time
|
||||
to replicate and the directory that needs to be
|
||||
replicated is copied over.</para>
|
||||
<para>This is where the Partitions come in handy. With
|
||||
fewer "things" in the system, larger chunks of data
|
||||
are transferred around (rather than lots of little TCP
|
||||
connections, which is inefficient) and there are a
|
||||
consistent number of hashes to compare.</para>
|
||||
<para>The cluster has eventually consistent behavior where
|
||||
the newest data wins.</para>
|
||||
<figure>
|
||||
<title>Replication</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image45.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>*If a zone goes down, one of the nodes containing a
|
||||
replica notices and proactively copies data to a
|
||||
handoff location.</para>
|
||||
<para>To describe how these pieces all come together, let's walk
|
||||
through a few scenarios and introduce the components.</para>
|
||||
<para><guilabel>Bird-eye View</guilabel></para>
|
||||
<para><emphasis role="bold">Upload</emphasis></para>
|
||||
|
||||
<para>A client uses the REST API to make a HTTP request to PUT
|
||||
an object into an existing Container. The cluster receives
|
||||
the request. First, the system must figure out where the
|
||||
data is going to go. To do this, the Account name,
|
||||
Container name and Object name are all used to determine
|
||||
the Partition where this object should live.</para>
|
||||
<para>Then a lookup in the Ring figures out which storage
|
||||
nodes contain the Partitions in question.</para>
|
||||
<para>The data then is sent to each storage node where it is
|
||||
placed in the appropriate Partition. A quorum is required
|
||||
-- at least two of the three writes must be successful
|
||||
before the client is notified that the upload was
|
||||
successful.</para>
|
||||
<para>Next, the Container database is updated asynchronously
|
||||
to reflect that there is a new object in it.</para>
|
||||
<figure>
|
||||
<title>When End-User uses Swift</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image46.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para><emphasis role="bold">Download</emphasis></para>
|
||||
<para>A request comes in for an Account/Container/object.
|
||||
Using the same consistent hashing, the Partition name is
|
||||
generated. A lookup in the Ring reveals which storage
|
||||
nodes contain that Partition. A request is made to one of
|
||||
the storage nodes to fetch the object and if that fails,
|
||||
requests are made to the other nodes.</para>
|
||||
|
||||
</chapter>
|
@ -1,146 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch005-the-ring">
|
||||
<title>Ring Builder</title>
|
||||
<para>The rings are built and managed manually by a utility called
|
||||
the ring-builder. The ring-builder assigns partitions to
|
||||
devices and writes an optimized Python structure to a gzipped,
|
||||
serialized file on disk for shipping out to the servers. The
|
||||
server processes just check the modification time of the file
|
||||
occasionally and reload their in-memory copies of the ring
|
||||
structure as needed. Because of how the ring-builder manages
|
||||
changes to the ring, using a slightly older ring usually just
|
||||
means one of the three replicas for a subset of the partitions
|
||||
will be incorrect, which can be easily worked around.</para>
|
||||
<para>The ring-builder also keeps its own builder file with the
|
||||
ring information and additional data required to build future
|
||||
rings. It is very important to keep multiple backup copies of
|
||||
these builder files. One option is to copy the builder files
|
||||
out to every server while copying the ring files themselves.
|
||||
Another is to upload the builder files into the cluster
|
||||
itself. Complete loss of a builder file will mean creating a
|
||||
new ring from scratch, nearly all partitions will end up
|
||||
assigned to different devices, and therefore nearly all data
|
||||
stored will have to be replicated to new locations. So,
|
||||
recovery from a builder file loss is possible, but data will
|
||||
definitely be unreachable for an extended time.</para>
|
||||
<para><guilabel>Ring Data Structure</guilabel></para>
|
||||
<para>The ring data structure consists of three top level
|
||||
fields: a list of devices in the cluster, a list of lists
|
||||
of device ids indicating partition to device assignments,
|
||||
and an integer indicating the number of bits to shift an
|
||||
MD5 hash to calculate the partition for the hash.</para>
|
||||
<para><guilabel>Partition Assignment
|
||||
List</guilabel></para>
|
||||
<para>This is a list of array(‘H’) of devices ids. The
|
||||
outermost list contains an array(‘H’) for each
|
||||
replica. Each array(‘H’) has a length equal to the
|
||||
partition count for the ring. Each integer in the
|
||||
array(‘H’) is an index into the above list of devices.
|
||||
The partition list is known internally to the Ring
|
||||
class as _replica2part2dev_id.</para>
|
||||
<para>So, to create a list of device dictionaries assigned
|
||||
to a partition, the Python code would look like:
|
||||
devices = [self.devs[part2dev_id[partition]] for
|
||||
part2dev_id in self._replica2part2dev_id]</para>
|
||||
<para>That code is a little simplistic, as it does not
|
||||
account for the removal of duplicate devices. If a
|
||||
ring has more replicas than devices, then a partition
|
||||
will have more than one replica on one device; that’s
|
||||
simply the pigeonhole principle at work.</para>
|
||||
<para>array(‘H’) is used for memory conservation as there
|
||||
may be millions of partitions.</para>
|
||||
<para><guilabel>Fractional Replicas</guilabel></para>
|
||||
<para>A ring is not restricted to having an integer number
|
||||
of replicas. In order to support the gradual changing
|
||||
of replica counts, the ring is able to have a real
|
||||
number of replicas.</para>
|
||||
<para>When the number of replicas is not an integer, then
|
||||
the last element of _replica2part2dev_id will have a
|
||||
length that is less than the partition count for the
|
||||
ring. This means that some partitions will have more
|
||||
replicas than others. For example, if a ring has 3.25
|
||||
replicas, then 25% of its partitions will have four
|
||||
replicas, while the remaining 75% will have just
|
||||
three.</para>
|
||||
<para><guilabel>Partition Shift Value</guilabel></para>
|
||||
<para>The partition shift value is known internally to the
|
||||
Ring class as _part_shift. This value used to shift an
|
||||
MD5 hash to calculate the partition on which the data
|
||||
for that hash should reside. Only the top four bytes
|
||||
of the hash is used in this process. For example, to
|
||||
compute the partition for the path
|
||||
/account/container/object the Python code might look
|
||||
like: partition = unpack_from('>I',
|
||||
md5('/account/container/object').digest())[0] >>
|
||||
self._part_shift</para>
|
||||
<para>For a ring generated with part_power P, the
|
||||
partition shift value is 32 - P.</para>
|
||||
<para><guilabel>Building the Ring</guilabel></para>
|
||||
<para>The initial building of the ring first calculates the
|
||||
number of partitions that should ideally be assigned to
|
||||
each device based the device’s weight. For example, given
|
||||
a partition power of 20, the ring will have 1,048,576
|
||||
partitions. If there are 1,000 devices of equal weight
|
||||
they will each desire 1,048.576 partitions. The devices
|
||||
are then sorted by the number of partitions they desire
|
||||
and kept in order throughout the initialization
|
||||
process.</para>
|
||||
<para>Note: each device is also assigned a random tiebreaker
|
||||
value that is used when two devices desire the same number
|
||||
of partitions. This tiebreaker is not stored on disk
|
||||
anywhere, and so two different rings created with the same
|
||||
parameters will have different partition assignments. For
|
||||
repeatable partition assignments, RingBuilder.rebalance()
|
||||
takes an optional seed value that will be used to seed
|
||||
Python’s pseudo-random number generator.</para>
|
||||
<para>Then, the ring builder assigns each replica of each
|
||||
partition to the device that desires the most partitions
|
||||
at that point while keeping it as far away as possible
|
||||
from other replicas. The ring builder prefers to assign a
|
||||
replica to a device in a regions that has no replicas
|
||||
already; should there be no such region available, the
|
||||
ring builder will try to find a device in a different
|
||||
zone; if not possible, it will look on a different server;
|
||||
failing that, it will just look for a device that has no
|
||||
replicas; finally, if all other options are exhausted, the
|
||||
ring builder will assign the replica to the device that
|
||||
has the fewest replicas already assigned. Note that
|
||||
assignment of multiple replicas to one device will only
|
||||
happen if the ring has fewer devices than it has
|
||||
replicas.</para>
|
||||
<para>When building a new ring based on an old ring, the
|
||||
desired number of partitions each device wants is
|
||||
recalculated. Next the partitions to be reassigned are
|
||||
gathered up. Any removed devices have all their assigned
|
||||
partitions unassigned and added to the gathered list. Any
|
||||
partition replicas that (due to the addition of new
|
||||
devices) can be spread out for better durability are
|
||||
unassigned and added to the gathered list. Any devices
|
||||
that have more partitions than they now desire have random
|
||||
partitions unassigned from them and added to the gathered
|
||||
list. Lastly, the gathered partitions are then reassigned
|
||||
to devices using a similar method as in the initial
|
||||
assignment described above.</para>
|
||||
<para>Whenever a partition has a replica reassigned, the time
|
||||
of the reassignment is recorded. This is taken into
|
||||
account when gathering partitions to reassign so that no
|
||||
partition is moved twice in a configurable amount of time.
|
||||
This configurable amount of time is known internally to
|
||||
the RingBuilder class as min_part_hours. This restriction
|
||||
is ignored for replicas of partitions on devices that have
|
||||
been removed, as removing a device only happens on device
|
||||
failure and there’s no choice but to make a
|
||||
reassignment.</para>
|
||||
<para>The above processes don’t always perfectly rebalance a
|
||||
ring due to the random nature of gathering partitions for
|
||||
reassignment. To help reach a more balanced ring, the
|
||||
rebalance process is repeated until near perfect (less 1%
|
||||
off) or when the balance doesn’t improve by at least 1%
|
||||
(indicating we probably can’t get perfect balance due to
|
||||
wildly imbalanced zones or too many partitions recently
|
||||
moved).</para>
|
||||
</chapter>
|
@ -1,93 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<!DOCTYPE section [
|
||||
<!ENTITY % openstack SYSTEM "../common/entities/openstack.ent">
|
||||
%openstack;
|
||||
]>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch007-cluster-architecture">
|
||||
<title>Cluster architecture</title>
|
||||
<para><guilabel>Access Tier</guilabel></para>
|
||||
<figure>
|
||||
<title>Object Storage cluster architecture</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image47.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>Large-scale deployments segment off an "Access Tier".
|
||||
This tier is the “Grand Central” of the Object Storage
|
||||
system. It fields incoming API requests from clients and
|
||||
moves data in and out of the system. This tier is composed
|
||||
of front-end load balancers, ssl- terminators,
|
||||
authentication services, and it runs the (distributed)
|
||||
brain of the Object Storage system — the proxy server
|
||||
processes.</para>
|
||||
<para>Having the access servers in their own tier enables
|
||||
read/write access to be scaled out independently of
|
||||
storage capacity. For example, if the cluster is on the
|
||||
public Internet and requires SSL-termination and has high
|
||||
demand for data access, many access servers can be
|
||||
provisioned. However, if the cluster is on a private
|
||||
network and it is being used primarily for archival
|
||||
purposes, fewer access servers are needed.</para>
|
||||
<para>A load balancer can be incorporated into the access tier,
|
||||
because this is an HTTP addressable storage service.</para>
|
||||
<para>Typically, this tier comprises a collection of 1U
|
||||
servers. These machines use a moderate amount of RAM and
|
||||
are network I/O intensive. It is wise to provision them with
|
||||
two high-throughput (10GbE) interfaces, because these systems
|
||||
field each incoming API request. One interface is
|
||||
used for 'front-end' incoming requests and the other for
|
||||
'back-end' access to the Object Storage nodes to put and
|
||||
fetch data.</para>
|
||||
<para><guilabel>Factors to consider</guilabel></para>
|
||||
<para>For most publicly facing deployments as well as
|
||||
private deployments available across a wide-reaching
|
||||
corporate network, SSL is used to encrypt traffic
|
||||
to the client. SSL adds significant processing load to
|
||||
establish sessions between clients; it adds more capacity to
|
||||
the access layer that will need to be provisioned. SSL may
|
||||
not be required for private deployments on trusted
|
||||
networks.</para>
|
||||
<para><guilabel>Storage Nodes</guilabel></para>
|
||||
<figure>
|
||||
<title>Object Storage (Swift)</title>
|
||||
<mediaobject>
|
||||
<imageobject>
|
||||
<imagedata fileref="figures/image48.png"/>
|
||||
</imageobject>
|
||||
</mediaobject>
|
||||
</figure>
|
||||
<para>The next component is the storage servers themselves.
|
||||
Generally, most configurations should provide each of the
|
||||
five Zones with an equal amount of storage capacity.
|
||||
Storage nodes use a reasonable amount of memory and CPU.
|
||||
Metadata needs to be readily available to quickly return
|
||||
objects. The object stores run services not only to field
|
||||
incoming requests from the Access Tier, but to also run
|
||||
replicators, auditors, and reapers. Object stores can be
|
||||
provisioned with a single gigabit or a 10-gigabit network
|
||||
interface depending on expected workload and desired
|
||||
performance.</para>
|
||||
<para>Currently, a 2 TB or 3 TB SATA disk delivers
|
||||
good performance for the price. Desktop-grade drives can
|
||||
be used where there are responsive remote hands in the
|
||||
datacenter, and enterprise-grade drives can be used where
|
||||
this is not the case.</para>
|
||||
<para><guilabel>Factors to Consider</guilabel></para>
|
||||
<para>Desired I/O performance for single-threaded requests
|
||||
should be kept in mind. This system does not use RAID,
|
||||
so each request for an object is handled by a single
|
||||
disk. Disk performance impacts single-threaded
|
||||
response rates.</para>
|
||||
<para>To achieve apparent higher throughput, the object
|
||||
storage system is designed with concurrent
|
||||
uploads/downloads in mind. The network I/O capacity
|
||||
(1GbE, bonded 1GbE pair, or 10GbE) should match your
|
||||
desired concurrent throughput needs for reads and
|
||||
writes.</para>
|
||||
</chapter>
|
@ -1,58 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch008-account-reaper">
|
||||
<title>Account Reaper</title>
|
||||
<para>The Account Reaper removes data from deleted accounts in the
|
||||
background.</para>
|
||||
<para>An account is marked for deletion by a reseller issuing a
|
||||
DELETE request on the account’s storage URL. This simply puts
|
||||
the value DELETED into the status column of the account_stat
|
||||
table in the account database (and replicas), indicating the
|
||||
data for the account should be deleted later.</para>
|
||||
<para>There is normally no set retention time and no undelete; it
|
||||
is assumed the reseller will implement such features and only
|
||||
call DELETE on the account once it is truly desired the
|
||||
account’s data be removed. However, in order to protect the
|
||||
Swift cluster accounts from an improper or mistaken delete
|
||||
request, you can set a delay_reaping value in the
|
||||
[account-reaper] section of the account-server.conf to delay
|
||||
the actual deletion of data. At this time, there is no utility
|
||||
to undelete an account; one would have to update the account
|
||||
database replicas directly, setting the status column to an
|
||||
empty string and updating the put_timestamp to be greater than
|
||||
the delete_timestamp. (On the TODO list is writing a utility
|
||||
to perform this task, preferably through a ReST call.)</para>
|
||||
<para>The account reaper runs on each account server and scans the
|
||||
server occasionally for account databases marked for deletion.
|
||||
It will only trigger on accounts that server is the primary
|
||||
node for, so that multiple account servers aren’t all trying
|
||||
to do the same work at the same time. Using multiple servers
|
||||
to delete one account might improve deletion speed, but
|
||||
requires coordination so they aren’t duplicating efforts. Speed
|
||||
really isn’t as much of a concern with data deletion and large
|
||||
accounts aren’t deleted that often.</para>
|
||||
<para>The deletion process for an account itself is pretty
|
||||
straightforward. For each container in the account, each
|
||||
object is deleted and then the container is deleted. Any
|
||||
deletion requests that fail won’t stop the overall process,
|
||||
but will cause the overall process to fail eventually (for
|
||||
example, if an object delete times out, the container won’t be
|
||||
able to be deleted later and therefore the account won’t be
|
||||
deleted either). The overall process continues even on a
|
||||
failure so that it doesn’t get hung up reclaiming cluster
|
||||
space because of one troublesome spot. The account reaper will
|
||||
keep trying to delete an account until it eventually becomes
|
||||
empty, at which point the database reclaim process within the
|
||||
db_replicator will eventually remove the database
|
||||
files.</para>
|
||||
<para>Sometimes a persistent error state can prevent some object
|
||||
or container from being deleted. If this happens, you will see
|
||||
a message such as “Account <name> has not been reaped
|
||||
since <date>” in the log. You can control when this is
|
||||
logged with the reap_warn_after value in the [account-reaper]
|
||||
section of the account-server.conf file. By default this is 30
|
||||
days.</para>
|
||||
</chapter>
|
@ -1,101 +0,0 @@
|
||||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<chapter xmlns="http://docbook.org/ns/docbook"
|
||||
xmlns:xi="http://www.w3.org/2001/XInclude"
|
||||
xmlns:xlink="http://www.w3.org/1999/xlink"
|
||||
version="5.0"
|
||||
xml:id="module003-ch009-replication">
|
||||
<title>Replication</title>
|
||||
<para>Because each replica in swift functions independently, and
|
||||
clients generally require only a simple majority of nodes
|
||||
responding to consider an operation successful, transient
|
||||
failures like network partitions can quickly cause replicas to
|
||||
diverge. These differences are eventually reconciled by
|
||||
asynchronous, peer-to-peer replicator processes. The
|
||||
replicator processes traverse their local filesystems,
|
||||
concurrently performing operations in a manner that balances
|
||||
load across physical disks.</para>
|
||||
<para>Replication uses a push model, with records and files
|
||||
generally only being copied from local to remote replicas.
|
||||
This is important because data on the node may not belong
|
||||
there (as in the case of handoffs and ring changes), and a
|
||||
replicator can’t know what data exists elsewhere in the
|
||||
cluster that it should pull in. It’s the duty of any node that
|
||||
contains data to ensure that data gets to where it belongs.
|
||||
Replica placement is handled by the ring.</para>
|
||||
<para>Every deleted record or file in the system is marked by a
|
||||
tombstone, so that deletions can be replicated alongside
|
||||
creations. The replication process cleans up tombstones after
|
||||
a time period known as the consistency window. The consistency
|
||||
window encompasses replication duration and how long transient
|
||||
failure can remove a node from the cluster. Tombstone cleanup
|
||||
must be tied to replication to reach replica
|
||||
convergence.</para>
|
||||
<para>If a replicator detects that a remote drive has failed, the
|
||||
replicator uses the get_more_nodes interface for the ring to
|
||||
choose an alternate node with which to synchronize. The
|
||||
replicator can maintain desired levels of replication in the
|
||||
face of disk failures, though some replicas may not be in an
|
||||
immediately usable location. Note that the replicator doesn’t
|
||||
maintain desired levels of replication when other failures,
|
||||
such as entire node failures occur, because most failure are
|
||||
transient.</para>
|
||||
<para>Replication is an area of active development, and likely
|
||||
rife with potential improvements to speed and
|
||||
accuracy.</para>
|
||||
<para>There are two major classes of replicator - the db
|
||||
replicator, which replicates accounts and containers, and the
|
||||
object replicator, which replicates object data.</para>
|
||||
<para><guilabel>DB Replication</guilabel></para>
|
||||
<para>The first step performed by db replication is a low-cost
|
||||
hash comparison to determine whether two replicas already
|
||||
match. Under normal operation, this check is able to
|
||||
verify that most databases in the system are already
|
||||
synchronized very quickly. If the hashes differ, the
|
||||
replicator brings the databases in sync by sharing records
|
||||
added since the last sync point.</para>
|
||||
<para>This sync point is a high water mark noting the last
|
||||
record at which two databases were known to be in sync,
|
||||
and is stored in each database as a tuple of the remote
|
||||
database id and record id. Database ids are unique amongst
|
||||
all replicas of the database, and record ids are
|
||||
monotonically increasing integers. After all new records
|
||||
have been pushed to the remote database, the entire sync
|
||||
table of the local database is pushed, so the remote
|
||||
database can guarantee that it is in sync with everything
|
||||
with which the local database has previously
|
||||
synchronized.</para>
|
||||
<para>If a replica is found to be missing entirely, the whole
|
||||
local database file is transmitted to the peer using
|
||||
rsync(1) and vested with a new unique id.</para>
|
||||
<para>In practice, DB replication can process hundreds of
|
||||
databases per concurrency setting per second (up to the
|
||||
number of available CPUs or disks) and is bound by the
|
||||
number of DB transactions that must be performed.</para>
|
||||
<para><guilabel>Object Replication</guilabel></para>
|
||||
<para>The initial implementation of object replication simply
|
||||
performed an rsync to push data from a local partition to
|
||||
all remote servers it was expected to exist on. While this
|
||||
performed adequately at small scale, replication times
|
||||
skyrocketed once directory structures could no longer be
|
||||
held in RAM. We now use a modification of this scheme in
|
||||
which a hash of the contents for each suffix directory is
|
||||
saved to a per-partition hashes file. The hash for a
|
||||
suffix directory is invalidated when the contents of that
|
||||
suffix directory are modified.</para>
|
||||
<para>The object replication process reads in these hash
|
||||
files, calculating any invalidated hashes. It then
|
||||
transmits the hashes to each remote server that should
|
||||
hold the partition, and only suffix directories with
|
||||
differing hashes on the remote server are rsynced. After
|
||||
pushing files to the remote server, the replication
|
||||
process notifies it to recalculate hashes for the rsynced
|
||||
suffix directories.</para>
|
||||
<para>Performance of object replication is generally bound by
|
||||
the number of uncached directories it has to traverse,
|
||||
usually as a result of invalidated suffix directory
|
||||
hashes. Using write volume and partition counts from our
|
||||
running systems, it was designed so that around 2% of the
|
||||
hash space on a normal node will be invalidated per day,
|
||||
which has experimentally given us acceptable replication
|
||||
speeds.</para>
|
||||
</chapter>
|
Loading…
x
Reference in New Issue
Block a user