Azure

Automating Azure Site Recovery VMs with ARM and some magic

11 min read

Actually, I got you there’s no magic, well, slightly. We will be using some intuitive way to wrap the whole thing. Now that I got your attention, let’s talk about Azure Site Recovery, or for short ASR.

As you are building your solution, you will want to automate the recovery process of your Virtual Machines so you have some piece of mind when it comes to your Disaster Recovery process.

In this post, I will talk about specifically about Azure Site Recovery for Azure to Azure recovery.

Quick intro

Azure Site Recovery is a product in the Azure family to help ensure one can attain his business continuity and disaster recovery (BCDR) strategy. Site Recovery works by replicating your disks to another region. When you register your virtual machine in ASR, it installs a utility called Site Mobility that monitors the writes in the VM and transfers them into a cache storage account located in the same region as the VM its protecting. ASR then monitors that cache storage account and transfers that data to a (managed) disk or a target storage account. After the data is processed, crash-consistent recovery points are generated every five minutes. App-consistent recovery points are generated according to the setting specified in the replication policy (minimum is every hour).

Azure Site Recovery replication process

Azure Site Recovery replication process

When you initiate a failover, the VMs along with the NICs are created in the target resource group, then added to the target availability zone / target availability set, associated to target virtual network, and target subnet. During a failover, you can use any recovery point1.

You can set some target names for the resources that ASR will create when it fails over. Once failed over, and you are satisfied with your recovery snapshot, you can commit the change, which will delete all the snapshots prior to that point.

Once you have failed over, you can re-protect your VM which will do the same operation but in the opposite direction (fail back operation). Your target location now becomes your main location and your main location becomes your target location.

Azure Site Recovery also introduces Recovery Plans. Recovery plans help you to define a systematic recovery process that defines how machines fail over, and how they start and recover after failover. They also help impose order so that recovery is consistently accurate, repeatable, and automated2.

Vocabulary

The ARM resources for ASR introduces certain concepts that are not described when you do everything in the portal. You are exposed to these when you do the setup using PowerShell.

Fabric:

The fabric object in the vault represents an Azure region. The primary fabric object is created to represent the Azure region that the source virtual machines being protected to the vault belong to.

  • Only one fabric object can be created per region.
  • If you’ve previously enabled Site Recovery replication for a VM in the Azure portal, Site Recovery creates a fabric object automatically. If a fabric object exists for a region, you can’t create a new one.

The recovery fabric object represents the recovery Azure location. If there’s a failover, virtual machines are replicated and recovered to the recovery region represented by the recovery fabric3.

ReplicationProtectionContainer:

The protection container is a container used to group replicated items within a fabric.

Mappings:

Mappings are always created 2 ways: Primary => Secondary and Secondary => Primary.

ContainerMappings:

A protection container mapping maps the protection container with a recovery protection container and a replication policy.

NetworkMappings:

A network mapping maps virtual networks in the primary region to virtual networks in the recovery region. The network mapping specifies the Azure virtual network in the recovery region, that a virtual machine in the primary virtual network should fail over to. One Azure virtual network can be mapped to only a single Azure virtual network in a recovery region.

ReplicationProtectedItem:

The item that is being protected and replicated.

ReplicationPolicy:

A replication policy defines the settings for the retention history of recovery points. The policy also defines the frequency of app-consistent snapshots4.

Automation using ARM

I’ve added in my GitHub repository a sample replication arm template with comments where we protect 2 virtual machines web1 and web2.

In the same template, I’ve also added the following 2 things:

  • Azure Automation which we will use in the Recovery plan
  • A recovery plan which helps streamline and create consistency when defining the failover process.

I’ve also added a few artifacts used for automation in the repository. The ARM template should be deployed in your DR region.

The ARM template I provided is an example of the pieces and how they are all linked to each other. Obviously adapt this to your scenario.

Things I’ve came across

I’ve came across a few things that gave me a hard time and I want to share them with you so you don’t fall into the same trap.

As of writing, the portal uses an older API version for displaying data

As of writing the API version for displaying data in the portal is the 2016 version. The one with more information is 2018-07-10. You will need to query the API with that version to see the properties that are not shown in the Azure Portal, such as the recoveryNicName.

Diagnostics Storage Account

The diagnostics storage account can be set in the ARM for the recovery VMs (when setting up the protected item). It should already be created otherwise, protecting the virtual machine will fail when creating the protected item. When failing back (using the UI), you will need to re-configure the diagnostic storage account of the failed over VMs. This can be done using automation or manually (obviously, if you have many VMs, automation would be best!)

Recovery plan

In the ARM definition, in a group (an object of the groups property), you can set the groupType. In fact, the group type should be Boot. When you add items to the boot group (that is ASR will boot those VMs together), they are automatically added to the Failover and Shutdown groups, which is managed by ASR.

Moreover, the Recovery Plan, once deployed, cannot be updated through ARM. For instance you want protect another VM or add pre/post actions, and want to use ARM, you will need to delete the Recovery Plan and let ARM create it. It seems to be a limitation of the platform. If you know a way to update the plan through ARM, definitely let me know!

Automation using Azure Automation

ASR clones your Virtual Machine configuration 1 to 1. There are things that it does not do, such as enabling/setting a managed identity or adding extra IPs to NICs (or even when failing back, setting the boot diagnostic storage account). For that you will need to script them. In my example, I’ve scripted the managed identity association for user identities. See section below to see how it can be done.

Re-protection on the main location

I am still questioning myself on how to do this programmatically. Using the failover plan UI, you can re-protect back, but then it will ask you to chose the cache storage account (or create a new one). I believe you can do the re-protection using the New-AzRecoveryServicesAsrReplicationProtectedItem cmdlet (I have not tried). If you have ideas on how to do this, please reach out!

RBAC

Since ASR uses an automation account to run commands on your resources (along with updating the mobility agent), make sure the automation account has the proper RBAC permissions on those resources. For instance, to be able to update the mobility agents, it should be contributor on the Recovery Vault resource. If you want to associate Manage Identities to the VMs, if should be able to update the VM (for instance have the Virtual Machine Contributor role) and it should be to read the managed identity (for instance can be assigned the Managed Identity Operator role).

Cleanup

ASR takes care of cleaning up the resources it creates. For instance, if you failover, ASR will create the VMs, the NICs and will attach the disks to those VMs. When you re-protect (replicate from DR->Main), it will create a replica disk in the main region, but leaving your VMs and NICs intact. When you fail back, and re-protect (replicate from Main->DR), it will cleanup the NICs and VMs resources it created in the DR region.

Miscellaneous

If you created your VMs through ARM and added data disks, you may have used the option "CreateOption": "Empty". When you fail back (DR->Main), ASR will use the "CreateOption": "Attach". Make sure you have a condition in your ARM to take care of that. CreationOption cannot be changed; it is a read-only property.

Changing the protected items recovery network properties

To change the recovery properties for the NIC that ASR will create, such as the name of the (recovery) NIC, you will need to use the API or use the cmdlets. If you use the cmdlets, use the New-AzRecoveryServicesAsrVMNicConfig cmdlet to create an object which will contain the names/settings you want. Update the protected item with the resulting config object using the Set-AzRecoveryServicesAsrReplicationProtectedItem cmdlet.

The NicId (parameter of the cmdlet) can be found in the replication protected item definition (Get-AzRecoveryServicesAsrReplicationProtectedItem), in the NicDetailsList array property: $replicationProtectedItem.NicDetailsList[0].NicId

Here are the properties that can be useful for you on the New-AzRecoveryServicesAsrVMNicConfig cmdlet:

Property Description
RecoveryNicName Specifies the name of the recovery NIC.
RecoveryNicStaticIPAddress Specifies the IP address of the recovery NIC.
RecoveryVMSubnetName Specifies the name of the recovery subnet.
RecoveryVMNetworkId Specifies the ID of the recovery virtual network. This is the resourceId of your recovery virtual network.
RecoveryNicResourceGroupName Specifies the name of the recovery NIC resource group. This is where the NIC will be created.
RecoveryLBBackendAddressPoolId Specifies the IDs of backend address pools for the recovery NIC. This is the IDs of the backend address pool of the load balancer for the virtual machines. Can be found through the Get-AzLoadBalancerBackendAddressPool cmdlet.

Note: changing the network (NIC) properties can only be done once the item is fully protected.

Automation Account scripts for ASR recovery plan

Runbooks can be invoked automatically by the Recovery Plan, when added in the either the start or end actions groups. The Recovery Plan invokes your scripts with a context object. Here’s an example of the context variable.

You can see a real full example, of what ASR passed as context plan parameter to your script, if you look at the job history of the automation account associated to your recovery vault. The GUID you see in the VmMap object is the ID that is generated the first time you protect an item by ASR. This id is used all throughout the lifecycle of your protected item. You can get the ID programmatically using the following snippet, or in the Azure Portal UI, if you click on the protected item, under properties blade:

protected item lifecycle id

LifecycleId of the protected item

The ID, for each of the protected item, is important. It will help associate/link, in the automation account runbook scripts, the virtual machines that were part of the plan and were failed over.

Using automation account variables

As a general rule, your runbook scripts should not contain those (hardcoded) IDs. Those IDs should be captured and associated with a complex object that will be used in the automation runbook. That complex object is saved into the automation account (used by ASR) under the variables blade. Note that complex variables cannot be displayed in the UI portal. To view the content of such variables, you will need to use the CLI. See the Get-AzAutomationVariable cmdlet. To deconstruct the variable into a Hashtable, use the function I created and available in the script in my repository. You can also use the .ToObject([hashtable]) method on the object if your object is 1 level (that is that it contains only properties on the 1st level and not nested).

In the automation runbook script, we use the Get-AutomationVariable cmdlet which is only available internally when running a runbook. See the documentation for more information.

In order for the SiteRecovery-SetUserManagedIdentity runbook to work, it needs the automation account variable VMManagedIdentities. To save that complex variable into this automation variable, I saved the following into a JSON file, and passed the path of that file to the Import-AutomationVariables helper script.

Note the AsrVmIds. This array will be filled automatically, within the helper script, with the proper lifecycleIds from the protected items.

In the runbook that will be invoked by ASR, we can then map those lifecycle IDs, with the meta-data in the variable, that we’ve saved in the automation account, to get the information needed to update the virtual machines that were just failed over.

Last note

As you saw, you can bring a lot of automation into your disaster recovery process. I’ve navigated a lot between the API definitions, the resources explorer and the support team from Azure to get me all the information I needed. Once implemented, your SRE team will love you forever. This will help to bring down your RTO by a lot. As for the pricing, Azure Site Recovery replication is free for the first 31 days, then it is 25$USD per protected item (see the pricing here). Keep me posted on how this goes for you!