Merge "[arch-design] Convert multi-site sections"

This commit is contained in:
Jenkins 2015-11-17 08:09:22 +00:00 committed by Gerrit Code Review
commit 55b172c189
2 changed files with 312 additions and 0 deletions

View File

@ -2,3 +2,155 @@
Operational considerations
==========================
Multi-site OpenStack cloud deployment using regions requires that the
service catalog contains per-region entries for each service deployed
other than the Identity service. Most off-the-shelf OpenStack deployment
tools have limited support for defining multiple regions in this
fashion.
Deployers should be aware of this and provide the appropriate
customization of the service catalog for their site either manually, or
by customizing deployment tools in use.
.. note::
As of the Kilo release, documentation for implementing this feature
is in progress. See this bug for more information:
https://bugs.launchpad.net/openstack-manuals/+bug/1340509.
Licensing
~~~~~~~~~
Multi-site OpenStack deployments present additional licensing
considerations over and above regular OpenStack clouds, particularly
where site licenses are in use to provide cost efficient access to
software licenses. The licensing for host operating systems, guest
operating systems, OpenStack distributions (if applicable),
software-defined infrastructure including network controllers and
storage systems, and even individual applications need to be evaluated.
Topics to consider include:
* The definition of what constitutes a site in the relevant licenses,
as the term does not necessarily denote a geographic or otherwise
physically isolated location.
* Differentiations between "hot" (active) and "cold" (inactive) sites,
where significant savings may be made in situations where one site is
a cold standby for disaster recovery purposes only.
* Certain locations might require local vendors to provide support and
services for each site which may vary with the licensing agreement in
place.
Logging and monitoring
~~~~~~~~~~~~~~~~~~~~~~
Logging and monitoring does not significantly differ for a multi-site
OpenStack cloud. The tools described in the `Logging and monitoring
chapter <http://docs.openstack.org/openstack-ops/content/logging_monitoring.html>`__
of the Operations Guide remain applicable. Logging and monitoring can be
provided on a per-site basis, and in a common centralized location.
When attempting to deploy logging and monitoring facilities to a
centralized location, care must be taken with the load placed on the
inter-site networking links.
Upgrades
~~~~~~~~
In multi-site OpenStack clouds deployed using regions, sites are
independent OpenStack installations which are linked together using
shared centralized services such as OpenStack Identity. At a high level
the recommended order of operations to upgrade an individual OpenStack
environment is (see the `Upgrades
chapter <http://docs.openstack.org/openstack-ops/content/ops_upgrades-general-steps.html>`__
of the Operations Guide for details):
#. Upgrade the OpenStack Identity service (keystone).
#. Upgrade the OpenStack Image service (glance).
#. Upgrade OpenStack Compute (nova), including networking components.
#. Upgrade OpenStack Block Storage (cinder).
#. Upgrade the OpenStack dashboard (horizon).
The process for upgrading a multi-site environment is not significantly
different:
#. Upgrade the shared OpenStack Identity service (keystone) deployment.
#. Upgrade the OpenStack Image service (glance) at each site.
#. Upgrade OpenStack Compute (nova), including networking components, at
each site.
#. Upgrade OpenStack Block Storage (cinder) at each site.
#. Upgrade the OpenStack dashboard (horizon), at each site or in the
single central location if it is shared.
Compute upgrades within each site can also be performed in a rolling
fashion. Compute controller services (API, Scheduler, and Conductor) can
be upgraded prior to upgrading of individual compute nodes. This allows
operations staff to keep a site operational for users of Compute
services while performing an upgrade.
Quota management
~~~~~~~~~~~~~~~~
Quotas are used to set operational limits to prevent system capacities
from being exhausted without notification. They are currently enforced
at the tenant (or project) level rather than at the user level.
Quotas are defined on a per-region basis. Operators can define identical
quotas for tenants in each region of the cloud to provide a consistent
experience, or even create a process for synchronizing allocated quotas
across regions. It is important to note that only the operational limits
imposed by the quotas will be aligned consumption of quotas by users
will not be reflected between regions.
For example, given a cloud with two regions, if the operator grants a
user a quota of 25 instances in each region then that user may launch a
total of 50 instances spread across both regions. They may not, however,
launch more than 25 instances in any single region.
For more information on managing quotas refer to the `Managing projects
and users
chapter <http://docs.openstack.org/openstack-ops/content/projects_users.html>`__
of the OpenStack Operators Guide.
Policy management
~~~~~~~~~~~~~~~~~
OpenStack provides a default set of Role Based Access Control (RBAC)
policies, defined in a ``policy.json`` file, for each service. Operators
edit these files to customize the policies for their OpenStack
installation. If the application of consistent RBAC policies across
sites is a requirement, then it is necessary to ensure proper
synchronization of the ``policy.json`` files to all installations.
This must be done using system administration tools such as rsync as
functionality for synchronizing policies across regions is not currently
provided within OpenStack.
Documentation
~~~~~~~~~~~~~
Users must be able to leverage cloud infrastructure and provision new
resources in the environment. It is important that user documentation is
accessible by users to ensure they are given sufficient information to
help them leverage the cloud. As an example, by default OpenStack
schedules instances on a compute node automatically. However, when
multiple regions are available, the end user needs to decide in which
region to schedule the new instance. The dashboard presents the user
with the first region in your configuration. The API and CLI tools do
not execute commands unless a valid region is specified. It is therefore
important to provide documentation to your users describing the region
layout as well as calling out that quotas are region-specific. If a user
reaches his or her quota in one region, OpenStack does not automatically
build new instances in another. Documenting specific examples helps
users understand how to operate the cloud, thereby reducing calls and
tickets filed with the help desk.

View File

@ -2,3 +2,163 @@
Technical considerations
========================
There are many technical considerations to take into account with regard
to designing a multi-site OpenStack implementation. An OpenStack cloud
can be designed in a variety of ways to handle individual application
needs. A multi-site deployment has additional challenges compared to
single site installations and therefore is a more complex solution.
When determining capacity options be sure to take into account not just
the technical issues, but also the economic or operational issues that
might arise from specific decisions.
Inter-site link capacity describes the capabilities of the connectivity
between the different OpenStack sites. This includes parameters such as
bandwidth, latency, whether or not a link is dedicated, and any business
policies applied to the connection. The capability and number of the
links between sites determine what kind of options are available for
deployment. For example, if two sites have a pair of high-bandwidth
links available between them, it may be wise to configure a separate
storage replication network between the two sites to support a single
Swift endpoint and a shared Object Storage capability between them. An
example of this technique, as well as a configuration walk-through, is
available at
http://docs.openstack.org/developer/swift/replication_network.html#dedicated-replication-network.
Another option in this scenario is to build a dedicated set of tenant
private networks across the secondary link, using overlay networks with
a third party mapping the site overlays to each other.
The capacity requirements of the links between sites is driven by
application behavior. If the link latency is too high, certain
applications that use a large number of small packets, for example RPC
calls, may encounter issues communicating with each other or operating
properly. Additionally, OpenStack may encounter similar types of issues.
To mitigate this, Identity service call timeouts can be tuned to prevent
issues authenticating against a central Identity service.
Another network capacity consideration for a multi-site deployment is
the amount and performance of overlay networks available for tenant
networks. If using shared tenant networks across zones, it is imperative
that an external overlay manager or controller be used to map these
overlays together. It is necessary to ensure the amount of possible IDs
between the zones are identical.
.. note::
As of the Kilo release, OpenStack Networking was not capable of
managing tunnel IDs across installations. So if one site runs out of
IDs, but another does not, that tenant's network is unable to reach
the other site.
Capacity can take other forms as well. The ability for a region to grow
depends on scaling out the number of available compute nodes. This topic
is covered in greater detail in the section for compute-focused
deployments. However, it may be necessary to grow cells in an individual
region, depending on the size of your cluster and the ratio of virtual
machines per hypervisor.
A third form of capacity comes in the multi-region-capable components of
OpenStack. Centralized Object Storage is capable of serving objects
through a single namespace across multiple regions. Since this works by
accessing the object store through swift proxy, it is possible to
overload the proxies. There are two options available to mitigate this
issue:
* Deploy a large number of swift proxies. The drawback is that the
proxies are not load-balanced and a large file request could
continually hit the same proxy.
* Add a caching HTTP proxy and load balancer in front of the swift
proxies. Since swift objects are returned to the requester via HTTP,
this load balancer would alleviate the load required on the swift
proxies.
Utilization
~~~~~~~~~~~
While constructing a multi-site OpenStack environment is the goal of
this guide, the real test is whether an application can utilize it.
The Identity service is normally the first interface for OpenStack users
and is required for almost all major operations within OpenStack.
Therefore, it is important that you provide users with a single URL for
Identity service authentication, and document the configuration of
regions within the Identity service. Each of the sites defined in your
installation is considered to be a region in Identity nomenclature. This
is important for the users, as it is required to define the region name
when providing actions to an API endpoint or in the dashboard.
Load balancing is another common issue with multi-site installations.
While it is still possible to run HAproxy instances with
Load-Balancer-as-a-Service, these are defined to a specific region. Some
applications can manage this using internal mechanisms. Other
applications may require the implementation of an external system,
including global services load balancers or anycast-advertised DNS.
Depending on the storage model chosen during site design, storage
replication and availability are also a concern for end-users. If an
application can support regions, then it is possible to keep the object
storage system separated by region. In this case, users who want to have
an object available to more than one region need to perform cross-site
replication. However, with a centralized swift proxy, the user may need
to benchmark the replication timing of the Object Storage back end.
Benchmarking allows the operational staff to provide users with an
understanding of the amount of time required for a stored or modified
object to become available to the entire environment.
Performance
~~~~~~~~~~~
Determining the performance of a multi-site installation involves
considerations that do not come into play in a single-site deployment.
Being a distributed deployment, performance in multi-site deployments
may be affected in certain situations.
Since multi-site systems can be geographically separated, there may be
greater latency or jitter when communicating across regions. This can
especially impact systems like the OpenStack Identity service when
making authentication attempts from regions that do not contain the
centralized Identity implementation. It can also affect applications
which rely on Remote Procedure Call (RPC) for normal operation. An
example of this can be seen in high performance computing workloads.
Storage availability can also be impacted by the architecture of a
multi-site deployment. A centralized Object Storage service requires
more time for an object to be available to instances locally in regions
where the object was not created. Some applications may need to be tuned
to account for this effect. Block Storage does not currently have a
method for replicating data across multiple regions, so applications
that depend on available block storage need to manually cope with this
limitation by creating duplicate block storage entries in each region.
OpenStack components
~~~~~~~~~~~~~~~~~~~~
Most OpenStack installations require a bare minimum set of pieces to
function. These include the OpenStack Identity (keystone) for
authentication, OpenStack Compute (nova) for compute, OpenStack Image
service (glance) for image storage, OpenStack Networking (neutron) for
networking, and potentially an object store in the form of OpenStack
Object Storage (swift). Deploying a multi-site installation also demands
extra components in order to coordinate between regions. A centralized
Identity service is necessary to provide the single authentication
point. A centralized dashboard is also recommended to provide a single
login point and a mapping to the API and CLI options available. A
centralized Object Storage service may also be used, but will require
the installation of the swift proxy service.
It may also be helpful to install a few extra options in order to
facilitate certain use cases. For example, installing Designate may
assist in automatically generating DNS domains for each region with an
automatically-populated zone full of resource records for each instance.
This facilitates using DNS as a mechanism for determining which region
will be selected for certain applications.
Another useful tool for managing a multi-site installation is
Orchestration (heat). The Orchestration service allows the use of
templates to define a set of instances to be launched together or for
scaling existing sets. It can also be used to set up matching or
differentiated groupings based on regions. For instance, if an
application requires an equally balanced number of nodes across sites,
the same heat template can be used to cover each site with small
alterations to only the region name.