BP-2083-ROBO-Deployment
BP-2083-ROBO-Deployment
SOLUTIONS DOCUMENT
Contents
1. Executive Summary................................................................................. 4
4. Failure Scenarios................................................................................... 25
Failed Hard Drive.............................................................................................................................. 25
Failed Node....................................................................................................................................... 27
5. Appendix................................................................................................. 28
Best Practices Checklist................................................................................................................... 28
About Nutanix.............................................................................................33
List of Figures.............................................................................................................................................34
ROBO Deployment and Operations Best Practices
1. Executive Summary
Nutanix provides a powerful converged compute and storage system that offers one-
click simplicity and high availability for remote and branch office (ROBO) locations. This
document describes best practices for deploying and operating ROBO locations on
Nutanix, including guidance on choosing the right Nutanix cluster for seeding data to
overcome slow remote network links.
The Nutanix platform's self-healing design reduces operational and support costs, such
as unnecessary site visits and overtime. With Nutanix, you can proactively schedule
projects and site visits on a regular cadence rather than working around emergencies.
Prism Central, our end-to-end infrastructure management tool, streamlines remote
cluster operations through one-click upgrades and provides simple orchestration for
multiple cluster upgrades. Nutanix makes deploying and operating ROBO locations as
easy as deploying to the public cloud, but with control and security on your terms.
This document covers the following topics:
• Overview of the Nutanix solution for managing multiple remote offices
• Nutanix cluster selection for remote sites
• Network sizing for management and disaster recovery traffic
• Best practices for remote management with Nutanix Cloud Manager Intelligent
Operations
• How to design for failure at remote sites
Table: Document Version History
Version Number Published Notes
1.0 June 2017 Original publication.
1.1 March 2018 Added information on one- and
two-node clusters.
1.2 February 2019 Updated Nutanix overview.
Cluster Selection
Picking the right solution always involves trade-offs. While a remote site isn't a
datacenter, uptime is still a crucial concern. Financial constraints and physical layout also
affect what counts as the best architecture for your environment. Nutanix offers a variety
of clusters for remote locations. You can select single- and dual-socket cluster options
and options that can reduce licensing costs.
Three-Node Clusters
Although a three-node system might cost more money up front, it's the gold standard
for ROBO locations. Three-node clusters provide excellent data protection by always
committing two copies of your data, which means that your data is safe even during
failures. Three-node clusters also rebuild your data within 60 seconds of a node going
down. Distributed storage rebuilds the data on the downed node without any user
intervention.
A self-healing three-node Nutanix cluster also prevents unnecessary trips to remote
sites. We recommend designing these systems with enough capacity to handle an entire
node going down, which allows the loss of multiple hard drives, one at a time. Because
this solution doesn't rely on RAID, the cluster can lose and heal drives, one after the
next, until available space runs out. For sites with high availability requirements or sites
that are difficult to visit, we recommend additional capacity above the n + 1 node count.
Three-node clusters can scale up to eight nodes with 1 Gbps networking and up to any
scale when using 10 Gbps and higher networking. With our reliability and availability, you
can focus on expanding your business rather than wasting resources on emergency site
visits.
Two-Node Clusters
Two-node clusters offer reliability for smaller sites that must be cost-effective and run
with tight margins. These clusters use a witness only in failure scenarios to coordinate
rebuilding data and automatic upgrades. You can deploy the witness offsite up to 500 ms
away for ROBO locations and 200 ms when you use Metro Availability. Multiple clusters
can use the same witness for two-node and Metro clusters. Nutanix supports two-node
clusters with Nutanix AHV and VMware ESXi only.
One-Node Clusters
One-node clusters are a perfect fit if you have low availability requirements and need
strong overall management for multiple sites. One-node clusters provide resilience
against the loss of a hard drive while still offering great remote management. Nutanix
supports one-node clusters with AHV and ESXi only.
Hypervisor Selection
Nutanix supports a wide range of hypervisors to meet your enterprise's needs. The
three main considerations for choosing the right hypervisor for your environment are
supportability, operations, and licensing costs.
Supportability can include support for your applications, the training your staff needs to
support daily activities, and break-fix. When it comes to supportability, the path of least
resistance often shapes hypervisor selection. Early on, when customers could afford to
use virtualization, ESXi was a prime candidate. Because many environments run ESXi,
the Nutanix 1000 series offers a mixed-hypervisor deployment consisting of two ESXi
nodes and one AHV storage node. The mixed-hypervisor deployment option provides
the same benefits as a full three-node cluster but removes the CPU licensing required by
some hypervisor licensing models.
Operationally, Nutanix aspires to build infrastructure that's invisible to the people using it.
We recommend that customers who want a fully integrated solution select AHV as their
hypervisor. With AHV, virtual machine (VM) and data placement happen automatically
without any required settings. Nutanix also hardens systems by default to meet security
requirements and provides the automation necessary to maintain that security. Nutanix
supplies Security Technical Information Guidelines (STIGs) in machine-readable code for
both AHV and the storage controller.
Nutanix offers cross-hypervisor disaster recovery to replicate VMs from AHV to ESXi
or ESXi to AHV to avoid switching hypervisors in the main datacenter. In the event of a
disaster, administrators can restore their AHV VM to ESXi for quick recovery or replicate
the VM to the remote site with easy workflows.
Witness Requirements
The witness VM requires the following minimum specifications:
• 2 vCPU
• 6 GB of memory
• 25 GB of storage
The witness VM must reside in a separate failure domain, which means that you need
independent power and network connections from each of the two-node clusters. We
recommend locating the witness VM in a third physical site with dedicated network
connections to sites one and two to avoid a single point of failure.
Communication with the witness happens over port TCP 9440; therefore, this port must
be open for the Controller VMs (CVMs) on any two-node clusters using the witness.
Network latency between each two-node cluster and the witness VM must be less than
500 ms for ROBO locations.
The witness VM can reside on any supported hypervisor and run on either Nutanix or
non-Nutanix hardware. You can register multiple (different) two-node cluster pairs to a
single witness VM. One witness VM can support up to 100 ROBO clusters or sites.
Nutanix Prism
Nutanix Prism provides central access for administrators to configure, monitor, and
manage virtual environments. Powered by advanced data analytics, heuristics, and rich
automation, Nutanix Prism offers unprecedented simplicity by combining several aspects
of datacenter management into a single, consumer-grade solution. Using innovative
machine learning technology, Nutanix Prism can mine large volumes of system data
easily and quickly and generate actionable insights for optimizing all aspects of virtual
infrastructure management. Nutanix Prism is a part of every Nutanix deployment and has
two core components: Prism Element and Prism Central.
Prism Element
Prism Element is a service built into the platform for every Nutanix cluster deployed. It
provides the ability to fully configure, manage, and monitor Nutanix clusters running any
hypervisor.
Because Prism Element manages only the cluster it’s part of, each deployed Nutanix
cluster has a unique Prism Element instance for management. As you deploy multiple
Nutanix clusters, you need to be able to manage all of them from a single Nutanix Prism
instance, so Nutanix introduced Prism Central.
Prism Central
Prism Central offers an organizational view into a distributed Nutanix deployment,
with the ability to attach all remote and local Nutanix clusters to a single Prism Central
deployment. This global management experience offers a single place to monitor
performance, health, and inventory for all Nutanix clusters. Prism Central is available in
a standard version included with every Nutanix deployment and as a separately licensed
Nutanix Cloud Manager Intelligent Operations version that enables several advanced
features.
The standard version of Prism Central offers all the great features of Prism Element
under one umbrella, with a single sign-on for your entire Nutanix environment, and
makes day-to-day management easier by placing all your applications at your fingertips
with the entity explorer. The entity explorer offers customizable tagging for applications
so that, even if they are dispersed among different sites, you can better analyze their
aggregated data in one central location.
Intelligent Operations has additional features to manage large deployments and prevent
emergencies and unnecessary site visits, including the following:
• Customizable dashboards
• Capacity runway to safeguard against exhausting resources
• Capacity planning to safely reclaim resources from old projects and just-in-time
forecasting for new projections
• Advanced search to streamline access to features with minimal training
• Simple multicluster upgrades
We recommend the following best practices when you deploy Prism Central.
Network
• Prism Central uses TCP port 9440 to communicate with the CVMs in a Nutanix cluster.
If your network and servers have a firewall enabled, open port 9440 between the
CVMs and the Prism Central VM to allow access.
• Always deploy with DNS. Prism Central occasionally performs a request on itself; if it
can't resolve the DNS name, some cluster statistics might not be present.
• If you use LDAP or LDAPS for authentication, open port 3268 (for LDAP) or 3269 (for
LDAPS) on the firewall.
Note: Upgrading from a smaller installation to a larger one isn’t as simple as changing CPU and memory. In
some cases, you need to add more storage and edit the configuration files. Contact Nutanix Support before
changing Prism Central sizes.
Statistics
• Prism Central keeps 13 weeks of raw metrics and 53 weeks of hourly metrics.
• Nutanix Support can help you keep statistics over a longer period if needed. However,
once you change the retention time, only stats written after the change have the new
retention time.
Nutanix Central
Nutanix Central is a software as a service (SaaS) unified cloud console for viewing and
managing your Nutanix environments and services deployed on-premises or on Nutanix
Cloud Clusters (NC2). Nutanix customers use multiple Prism Central instances to
manage clusters deployed in their environments that can span on-premises datacenters
and NC2 on public cloud providers. Although Prism Central can manage multiple
clusters, each Prism Central instance represents a separate management domain,
meaning that customers must frequently switch between multiple consoles to manage
their domains. This approach consumes unnecessary time and resources and leads to
productivity loss due to siloed domain management and metrics views.
Nutanix Central simplifies and streamlines the management of multiple Prism Central
instances by providing a single management console. By registering your Prism Central
domains to Nutanix Central you can accomplish the following:
• Access and manage multiple Prism Central domains through one panel.
• View a unified dashboard with cluster-level domain metrics, including capacity usage
and alert summary statistics.
• Seamlessly navigate to individual Prism Central domains through the unified cloud
console.
• Discover Nutanix portfolio products and preferred partner apps, deploy them in the
Prism Central domain of your choice through a domain-specific marketplace, and
easily manage them through My Apps.
• Use an app switcher to seamlessly move between apps and domains.
Seeding
When you deal with a remote site that has a limited network connection back to the
main datacenter, you might need to seed data to overcome network speed deficits.
Seeding involves using a separate device to ship the data to a remote location. Instead
of replication taking weeks or months, depending on the amount of data you need to
protect, you can copy the data locally to a separate Nutanix node and ship it to your
remote site.
Nutanix checks the snapshot metadata before seeding the device to prevent
unnecessary duplication. Nutanix can apply its native data protection to a seed cluster by
placing VMs in a protection domain and replicating them to a seed cluster. A protection
domain is a collection of VMs that have a similar recovery point objective (RPO). You
must ensure, however, that the seeding snapshot doesn’t expire before you can copy the
data to the destination.
Seed Procedure
The following procedure lets you use seed cluster storage capacity to bypass the
network replication step. During this procedure, the administrator stores a snapshot of
VMs on the seed cluster while it’s installed in the ROBO site, and then physically ships it
to the main datacenter.
1. Install and configure application VMs on a ROBO cluster.
2. Create a protection domain called PD1 on the ROBO cluster for the VMs and
volume groups.
3. Create an out-of-band snapshot (S1) for the protection domain on ROBO with no
expiration.
4. Create an empty protection domain called PD1 (the same name used in Step 2) on
the seed cluster.
5. Deactivate PD1 on the seed cluster.
6. Create remote sites on the ROBO cluster and the seed cluster.
7. Retrieve snapshot S1 from the ROBO cluster to the seed cluster using Prism
Element on the seed cluster.
8. Ship the seed cluster to the datacenter.
9. Re-IP the seed cluster.
10. Create remote sites on the ROBO cluster and the datacenter main cluster (DC1).
11. Create PD1 (same name used in Steps 2 and 4) on DC1.
12. Deactivate PD1 on DC1.
13. Retrieve S1 from the seed cluster to DC1 (using Prism on DC1).
Nutanix Prism generates an alert. Although it appears to be a full data
replication, the seed cluster transferred metadata information only.
Upgrade Modes
For environment-specific service-level agreements, choose from two upgrade modes:
simultaneous mode or staggered mode.
Simultaneous Mode
Simultaneous mode is important when you're short on time—for example, when
performing a critical update or a security update that you must push to all ROBO sites
and clusters quickly. Simultaneous mode upgrades all the clusters immediately, in
parallel.
Staggered Mode
Staggered mode allows rolling, sequential upgrades of ROBO sites as a batch job,
without any manual intervention—one site upgrades only after the previous site upgrades
successfully. This feature is advantageous because it limits exposure to an issue (if one
emerges) to one site rather than multiple sites. This safeguard is especially valuable for
centralized administrators and others managing multiple ROBO sites. Staggered mode
also lets you choose a custom order for the upgrade sequence.
If the check reports a status other than PASS, resolve the reported issues before you
proceed. If you can't resolve the issues, contact Nutanix Support for assistance.
If you need to upgrade multiple clusters, use Prism Central.
Note: Create cluster labels for clusters in similar business units. If you need to meet an upgrade window,
you can upgrade all the selected clusters to run in parallel. We recommend running one upgrade first before
continuing to all the clusters.
Remote sites are a logical construct. First, configure an AHV cluster—either physical
or cloud-based—that functions as the snapshot destination and as a remote site from
the perspective of the source cluster. Similarly, on this secondary cluster, configure
the primary cluster as a remote site before snapshots from the secondary cluster start
replicating to it.
By configuring backup on Nutanix, you can use its remote site as a replication target.
You can back up data to this site and retrieve snapshots from it to restore locally, but
you can't enable failover protection (running failover VMs directly from the remote site).
Backup also supports using multiple hypervisors. You can configure the disaster recovery
option to use the remote site as both a backup target and a source for dynamic recovery.
In this arrangement, failover VMs run directly from the remote site. Nutanix provides
cross-hypervisor disaster recovery between AHV and ESXi clusters.
For data replication to succeed, configure forward (DNS A) and reverse (DNS PTR) DNS
entries for each ESXi management host on the DNS servers that the Nutanix cluster
uses.
We also recommend that you keep both sites at the same AHV version. If both sites
require compression, both must have the compression feature licensed and enabled.
Enable Proxy
The enable proxy option redirects all egress remote replication traffic through one node.
This remote site proxy is different from the Prism proxy. When you select Enable Proxy,
replication traffic goes to the remote site proxy, which forwards it to other nodes in the
cluster. This arrangement significantly reduces the number of firewall rules you need to
set up and maintain.
It is best practice to use the remote site proxy with the external address.
Capabilities
The disaster recovery option requires that both sites either support cross-hypervisor
disaster recovery or have the same hypervisor. Today, Nutanix supports only AHV and
ESXi for cross-hypervisor disaster recovery. When you use the backup option, the sites
can use different hypervisors, but you can't restore VMs on the remote side. The backup
option also works when you back up to AWS and Azure.
Maximum Bandwidth
Maximum bandwidth throttles traffic between sites when no network device can limit
replication traffic. The maximum bandwidth option allows different settings throughout
the day, so you can assign a maximum bandwidth policy when your sites are busy with
production data and turn off the policy when they're less busy. Maximum bandwidth
doesn't imply maximum observed throughput. If you plan to replicate data during the
day, create separate policies for business hours to avoid flooding the outbound network
connection.
Note: When talking with your networking teams, note that this setting is in megabytes per second (MBps),
not megabits per second (Mbps).
Remote Container
vStore name mapping identifies the container on the remote cluster used as the
replication target. When you establish the vStore name mapping, we recommend
creating a new, separate remote container with no VMs running on it on the remote
side. This configuration quickly distinguishes failed-over VMs and applies policies on the
remote side in case of a failover.
Network Mapping
AHV supports network mapping for disaster recovery migrations moving to and from
AHV. When you delete or change the network attached to a VM specified in the network
map, modify the network map accordingly.
Scheduling
Make the snapshot schedule the same as your RPO. In practical terms, the RPO
determines how much data you can afford to lose in the event of a failure. Taking a
snapshot every 60 minutes for a server that changes infrequently or when you don't need
a low RPO takes up resources that can benefit more critical services.
Set the RPO from the local site. If you schedule a snapshot every hour, bandwidth and
available space at the remote site determine if you can achieve the RPO. In constrained
environments, limited bandwidth might cause the replication to take longer than the one-
hour RPO. We list guidelines for sizing bandwidth and capacity to avoid this scenario
later in this document.
You can create multiple schedules for a protection domain, and you can have multiple
protection domains. The previous figure shows seven daily snapshots, four weekly
snapshots, and three monthly snapshots to cover a three-month retention policy. This
policy is more efficient for managing metadata on the cluster than a daily snapshot with a
180-day retention policy.
The following are the best practices for scheduling:
• Stagger replication schedules across protection domains to spread out replication
impact on performance and bandwidth. If you have a protection domain that starts
every hour, stagger the protection domains by half of the most commonly used RPO.
• Configure snapshot schedules to retain the lowest number of snapshots while still
meeting the retention policy, as shown in the previous figure.
Remote snapshots expire based on how many snapshots exist and how frequently you
take them. For example, if you take daily snapshots and keep a maximum of five, the first
snapshot expires on the sixth day. At that point, you can't recover from the first snapshot
because the system deletes it automatically.
In case of a prolonged network outage, Nutanix always retains the last snapshot to
ensure that you never lose all the snapshots. You can modify the retention schedule from
the nCLI by changing the min-snap-retention-count. This value ensures that you retain
at least the specified number of snapshots, even if all the snapshots have reached the
expiry time. This setting works at the protection domain level.
Local Snapshots
To size storage for local snapshots at the remote site, you need to account for the rate
of change in your environment and how long you plan to keep your snapshots on the
cluster. Reduced snapshot frequency might increase the rate of change due to the
greater chance of common blocks changing before the next snapshot.
To find the space needed to meet your RPO, use the following formula. As you decrease
the RPO for asynchronous replication, you might need to account for an increased rate
of transformed garbage. Transformed garbage is space the system allocated for I/O
optimization or assigned but that no longer has associated metadata. If you're replicating
only once each day, you can remove (change rate per frequency × # of snapshots in a
full Curator scan × 0.1) from the following formula. A full Curator scan runs every six
hours.
snapshot reserve = (frequency of snapshots × full change rate per frequency)
+ (change rate per frequency × # of snapshots in a Curator scan × 0.1)
You can look at your backups and compare their incremental differences to find the
change rate.
Using the local snapshot reserve formula and assuming for demonstration purposes that
the change rate is 35 GB of data every six hours and that we keep 10 snapshots, we get
a 363 GB snapshot reserve:
snapshot reserve = (frequency of snapshots × change rate per frequency) +
(change rate per frequency × # of snapshots in a full curator scan × 0.1)
= (10 × 35,980 MB) + (35,980 MB × 1 × 0.1)
= 359,800 + (35,980 × 1 × 0.1)
= 359,800 + 3,598
= 363,398 MB
= 363 GB
Remote Snapshots
Remote snapshots use the same process, but you must include the first full copy of the
protection domain plus delta changes based on the set schedule.
snapshot reserve = (frequency of snapshots × change rate per frequency) +
(change rate per frequency × # of snapshots in a full Curator scan × 0.2) +
total size of the source protection domain
To minimize the storage space you need at the remote site, use 130 percent of the
protection domain as an average.
Bandwidth
You must have enough available bandwidth to keep up with the replication schedule. If
you are still replicating when the next snapshot is scheduled, the current replication job
finishes first. The newest outstanding snapshot then starts to replicate the newest data to
the remote side first. To help replication run faster with limited bandwidth, seed data on a
secondary cluster at the primary site before you ship that cluster to the remote site.
To figure out the needed throughput, you must know your RPO. If you set the RPO to
one hour, you must be able to replicate the changes within that time.
Assuming that you know your change rate based on incremental backups or local
snapshots, you can calculate the bandwidth you need. The next example uses a 15 GB
change rate and a one-hour RPO. We didn't use deduplication in the calculation, partly
so that the dedupe savings can serve as a buffer in the overall calculation and partly
because the one-time cost for deduped data going over the wire has less impact once
the data is present at the remote site. We assumed an average of 30 percent bandwidth
savings for compression on the wire.
Bandwidth needed = (RPO change rate × (1 – compression on wire savings %)) /
RPO
Example:
(15 GB × (1 – 0.3)) / 3,600 s
(15 GB × 0.7) / 3,600 s
10.5 GB / 3,600 s
(10.5 × 1,000 MB) / 3,600 s (changing the unit to MBps)
(10,500 MB) / 3,600 s
10,500 MB / 3,600 = 2.92 MBps
Bandwidth needed = 23.33 Mbps
You can perform the calculation online using the WolframAlpha computational knowledge
engine.
If you have problems meeting your replication schedule, either increase your bandwidth
or increase your RPO. To allow more RPO flexibility, you can run different schedules on
the same protection domain. For example, set one daily replication schedule and create
a separate schedule to take local snapshots every few hours.
4. Failure Scenarios
Companies need to resolve problems at remote branches as quickly as possible, but
some branches are harder to access. Accordingly, you need a self-healing system with
break-fix procedures that less technically skilled staff can manage. In the following
section, we cover the most basic remote branch failure scenarios: losing a hard drive and
losing a node in a three-node cluster.
Marking a disk offline triggers an alert, and the system immediately removes the offline
disk from the storage pool. Curator then identifies all extents stored on the failed disk and
the distributed storage fabric and makes additional copies of the associated replicas to
restore the desired replication factor. By the time administrators learn of the disk failure
from Prism Element, SNMP trap, or email notification, distributed storage is already
healing the cluster.
The distributed storage data rebuild architecture provides faster rebuild times than
traditional RAID data protection schemes, with no performance impact on workloads.
RAID groups or sets usually have a small number of disks. When a RAID set performs
a rebuild operation, it typically selects one disk as the rebuild target. The other disks
in the RAID set must divert enough resources to quickly rebuild the data on the failed
disk. This process can lead to performance penalties for workloads served by the
degraded RAID set. Distributed storage allocates remote copies found on any individual
disk to the remaining disks in the Nutanix cluster. As a result, distributed storage
replication operations are background processes with no impact on cluster operations
or performance. Moreover, Nutanix storage accesses all disks in the cluster at any given
time as a single, unified pool of storage resources.
Every node in the cluster participates in replication, which means that as the cluster size
grows, disk failure recovery time decreases. Because distributed storage allocates the
data needed to rebuild a disk throughout the cluster, more disks contribute to the rebuild
process and accelerate the additional replication of affected extents.
Nutanix maintains consistent performance during the rebuild operations. For hybrid
systems, Nutanix rebuilds cold data to cold data so that large hard drives do not flood the
SSD caches. For all-flash systems, Nutanix protects user I/O by implementing quality of
service for back-end I/O.
In addition to a many-to-many rebuild approach to data availability, the distributed
storage data rebuild architecture ensures that all healthy disks are always available.
Unlike most traditional storage arrays, Nutanix clusters don't need hot spare or standby
drives. Because data can rebuild to any of the remaining healthy disks, you don't need to
reserve physical resources for failures. Once healed, you can lose the next drive or node.
Failed Node
A Nutanix cluster must have at least three nodes. Minimum configuration clusters
provide the same protections as larger clusters, and a three-node cluster can continue
normally after a node failure. The cluster starts to rebuild all the user data right away
because of the separation of cluster data and metadata. However, one condition applies
to three-node clusters: When a node failure occurs in a three-node cluster, you can't
dynamically remove the failed node from the cluster. The cluster continues running
without interruption on two healthy nodes and one failed node, but you can't remove the
failed node until the cluster has three or more healthy nodes. Therefore, the cluster isn't
fully protected until you either fix the problem with the existing node or add a new node to
the cluster and remove the failed one. This condition doesn't apply to clusters with four or
more nodes, where you can dynamically remove the failed node to bring the cluster back
to full health. The newly configured cluster still has at least three nodes, so the cluster is
fully protected. You can then replace the failed hardware for that node and add the node
back into the cluster as a new node.
5. Appendix
Note: Upgrading from a smaller installation to a larger one isn’t as simple as changing CPU and memory. In
some cases, you need to add more storage and edit the configuration files. Contact Nutanix Support before
changing Prism Central sizes.
Statistics:
• Prism Central keeps 13 weeks of raw metrics and 53 weeks of hourly metrics.
• Nutanix Support can help you keep statistics over a longer period if needed. However,
once you change the retention time, only stats written after the change have the new
retention time.
Cluster registration and licensing:
• Each node registered to and managed by Nutanix Cloud Manager Intelligent
Operations requires you to apply for an Intelligent Operations license through the
Prism Central web console. For example, if you register and manage 10 Nutanix
nodes (regardless of the individual node or cluster license level), you need to apply for
10 Intelligent Operations licenses through the Prism Central web console.
• Each consistency group using application-consistent snapshots can contain only one
VM.
Disaster recovery and backup:
• Configure forward (DNS A) and reverse (DNS PTR) DNS entries for each ESXi
management host on the DNS servers used by the Nutanix cluster.
starting each hour, stagger the protection domains by half of the most commonly used
RPO.
• Configure snapshot schedules to retain the fewest snapshots while still meeting the
retention policy.
• Configure the CVM external IP address.
• Obtain the mobility driver from Nutanix Guest Tools.
• Avoid migrating VMs with delta disks (hypervisor-based snapshots) or SATA disks.
• Ensure that protected VMs have an empty IDE CD-ROM attached.
• Run AOS 6.5 or later in both clusters.
• Ensure that network mapping is complete.
Sizing:
• Use the application's change rate to size local and remote snapshot usage.
Bandwidth:
• Seed locally for replication if WAN bandwidth is limited.
• Set a high initial retention time for the first replication when you seed.
Single-node backup:
• Keep all protection domains, combined, under 30 VMs total.
• Limit backup retention to a three-month policy. We recommend seven daily, four
weekly, and three monthly backups as a policy.
• Map an NX-1155S to one physical cluster only.
• Set the snapshot schedule to at least six hours.
• Turn off deduplication.
One- and two-node backup:
• Keep all protection domains to no more than five VMs per node.
About Nutanix
Nutanix offers a single platform to run all your apps and data across multiple clouds
while simplifying operations and reducing complexity. Trusted by companies worldwide,
Nutanix powers hybrid multicloud environments efficiently and cost effectively. This
enables companies to focus on successful business outcomes and new innovations.
Learn more at Nutanix.com.