How To Run Folding@home on Azure Spot VMs to Help Fight COVID-19

Quickstart

If you want to skip the details and deploy this solution in Azure, here is a link to my GitHub repo with everything used in this post: https://github.com/covid19folding/AzureSpotVMWorker

Overview

Azure Spot VMs are a special offering in Azure that offer major savings over traditional VMs (up to 90% sometimes!), but do not come with any kind of guaranteed up-time. Think of them as the leftover resources in an Azure datacenter that don’t get allocated – you get to use those at a fraction of the cost until someone else needs them. This type of workload won’t be a good fit for many applications, but they are excellent for things like batch jobs, rendering farms, or compute workloads that can be stopped and resumed.

How does Spot VM eviction work? A limit is set for every Spot VM when deployed, and the VM instance gets evicted (deallocated) when the limit is breached. There are two types limits. The first is a limit on availability – when the SKU of your VM runs out in a particular Azure region, your Spot VM will be the first to go. You get no guarantee on availability with Spot Instances, but that’s why they are a fraction of the price. The second type of limit is based on price. You can set a maximum price you want to pay per hour for a Spot VM, and as demand goes up for that SKU, so will the price. If your price limit is breached or if the SKU becomes unavailable, then the VM gets deallocated.

There is a lot more info, including a FAQ, on Spot VMs on Microsoft’s official site linked above.

Given current events with the global pandemic of the COVID-19 Coronavirus, I have decided to deploy Azure Spot VMs to run client workloads for the Folding@home project at Stanford University. If you haven’t heard of Folding@home, it is a distributed computing project that leverages client CPU and GPU cycles to simulate protein folding. The project has been directly responsible for published papers and progress on cures for Alzheimer’s, cancer, diseases, and now the Coronavirus. It’s a project that is very worthy of your support if you have some extra CPU/GPU cycles available.

Spot VMs are a great fit for Folding@home clients because they are so inexpensive and don’t require 100% up-time. It wouldn’t be economical to run folding workloads on standard VMs, but if we can run them at 1/10th of the cost for 23 hours out of the day, that is quite a bargain. Even more, there are now GPU SKUs available for Spot Instances, which are much more efficient at folding than CPUs alone. The folding client works well in this scenario as it runs on its own and resumes workloads automatically.

Azure Configuration

Before writing this post, I deployed 20 Folding@home workers running in Azure to evaluate cost and performance to see would even be economical. I used the following configuration on my Spot VMs to maximize performance versus cost. I won’t cover these in-depth.

  • VM size: Standard NV4as_v4 (4 vcpus, 14 GiB memory) – Azure Spot SKU with GPU.
  • Standard HDD disks – the cheapest disk available which will not impact CPU/GPU folding performance.
  • Gen 2 – this configuration allows for faster VM start/stop times.
  • Azure Bastion – this service allows use of a single IP to remotely access all VMs through the Azure portal.
  • OS: Windows 10 Pro – to optimize GPU performance with client OS drivers. No server infrastructure required.
  • Azure Region: South Central US – only 2 regions currently support Spot VMs, and South Central currently has lowest demand.
  • Automation: a simple automation runbook from the Azure gallery will be used to start all available VMs every hour in case they get evicted.
  • Virtual Network – a simple virtual network with a single subnet was used for this deployment.

Deployment

Deploying 20 workers manually through the Azure portal would take quite a bit of time. Using ARM templates for deployment is much faster and will help keep your configuration consistent. The ARM template and parameter JSON file I used for this deployment can be found in the GitHub repo here: https://github.com/covid19folding/AzureSpotVMWorker

You may find it easier to deploy this template from the Azure portal where you can easily modify settings and view the deployment status, but know there are several different ways to deploy these templates. To deploy the template from the Azure portal, search for “template” in the top search bar and open Templates.

Click Add Template and give it a name and description.

Paste the template.JSON contents into the ARM Template.

Click Add to complete the Template creation. Select the new deploy and choose Deploy.

At the custom deployment screen, open “Edit Parameters.” Paste or import the parameters.JSON in the link above for the easiest configuration. You will need to update these parameters to match your own Azure resources – the networkSecurityGroupId, virtualNetworkId, virtualMachineRG, and subnetName. You should also update the networkInterfaceName and virtualMachineName to follow your own naming scheme. Save the parameters when done.

Fill in the Resource Group and enter a Password for your login. When ready, accept the terms and conditions and purchase.

You can view the deployment status in the notification area. Open it to view the details. Any errors will also show here.

Note: this deployment will not configure a Public IP to access your VM. This is to attain the lowest cost possible for worker performance. You may need to add a Public IP to your VM for remote access. If deploying several VMs, you may want to use Azure Bastion to access all VMs remotely in a secure fashion with a single Public IP.

Go to your new VM resource when the deployment is complete.

Here you can see the folding worker VM was deployed using an Azure Spot instance, in the proper region, with the proper size and name from the provided parameters.

The last step is to create an Azure Automation account to handle the automatic start of this VM and any others added to the resource group. This is essential for Spot VMs, since they can be evicted and deallocated at anytime if there’s no availability. Create an Automation resource from the portal.

Open your new Automation resource and find the “Start Azure V2 VMs” runbook in the Runbook gallery.

Import this runbook to use it in your environment. This particular runbook has the ability to start all VMs in your folding resource group.

Find your new runbook and open it.

Create a new schedule for your runbook and enter the details for your resource group, start date, and recurrence. I used an hourly recurrence, which was the most frequent available.

You can verify that your runbook is working properly by reviewing the Jobs blade. This will start a deallocated worker VM if there is availability.

Your VM will now be able to stop and start on its own in an automated fashion. This is perfect for an Azure Spot workload.

VM Configuration for Folding

I won’t go into too much detail in this section because it’s pretty basic, especially for anyone who has run the Folding@Home client before. The software install should be automated using PowerShell or a Configuration Management tool for anything at scale. You could also build it into a custom VM image. These VMs are essentially being deployed as containers for the Folding workload.

First, you’ll need to connect to your VM using RDP or Bastion. Some configuration may be required for this.

The latest GPU drivers for the NV4as_v4 instance are available from Microsoft here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/n-series-amd-driver-setup

Since these are Azure Spot VMs that may be evicted at any time and then restarted from our automation runbook, we want these to log in and start working automatically. To take care of the automatic login, run the Sysinternals Autologon tool found here on the VM: https://docs.microsoft.com/en-us/sysinternals/downloads/autologon

Finally, you can install the Folding@Home client for the VM here: https://foldingathome.org/start-folding/ . I found that the Express install works fine, but you may need to configure the client to start working automatically. If you need a team to join, feel free to join ours – 236437!

Performance

So what’s the performance like? Folding@Home uses a point system where you earn points for completed CPU/GPU workloads on your local client. Each worker VM in this tutorial earns around 12,000 folding points per day when under full load. This is quite a bit lower than a physical workstation with a modern CPU/GPU – those typically earn closer to 100,000-200,000 points per day. But when you factor in the scale you can achieve through cloud computing combined with the low cost of Spot VMs, you can see how easily this can grow into a powerful contributor to the Folding@home project.

Cost

It’s important to note that Azure Spot pricing is variable and totally based on regional demand. While it will never exceed the cost of a standard non-Spot VM, you probably don’t want to operate anywhere near that cost if possible. To mitigate this, you can deploy your VM using price as your limit instead of capacity (used in this tutorial). You can see this option when deploying a Spot VM manually through the portal. This also allows you to compare pricing across regions for the selected VM size.

In this environment with 20 Spot VMs running in South Central, I was able to get the cost per VM down to around $0.40/day. For comparison, a non-Spot VM at the same size is currently priced around $5.60/day. That is a massive amount of savings! Even lower costs could be achieved by using a smaller custom disk for each VM compared to what is offered in the Azure gallery. I recommend monitoring daily costs for Spot VMs in Azure Cost Management.

Another interesting note is that prior to this post, none of my 20 worker nodes have been evicted in the South Central US region yet. I deployed the same workers in East US and they were evicted within a day – probably due to less GPU Spot VM availability in that region. It’s advisable to set up alerts for these evictions for visibility – see my Azure Monitor post for more info on how to do that.

With proper management and automation, Azure Spot VMs can be a powerful, cost-effective tool for your cloud infrastructure workloads.

 

 

 

10 thoughts on “How To Run Folding@home on Azure Spot VMs to Help Fight COVID-19”

    1. Both are great ways to contribute, but I think the Spot VMs are more efficient with their very low pricing. The Spot VMs allow you to do GPU folding, which is much more efficient than the CPU batch jobs in your link. You can run a Spot VM with a GPU attached for around $12 for the entire month. If you were running a non-Spot VM, it would be 10x that price, so probably not worth it. The batch job is easier to deploy and scale – the Spot VM solution takes a little more configuration in my opinion. The cost/performance ratio may also change as Spot VMs become widely available and demand changes.

      1. Ok thanks Josh,

        I’m having a bit of trouble understanding the steps, although I’d consider myself enthusiast level tech savvy. I’ve never had experience with setting anything on Azure before. But I’ll have another look and see if I can work it out.

        One thing that isn’t clear is you mentioned:
        “You will need to update these parameters to match your own Azure resources – the etworkSecurityGroupId, virtualNetworkId, virtualMachineRG, and subnetName.”

        Can I just specify something random for that? Or do I have to find that somewhere in my account?

        1. So that ARM template will deploy resources to an existing Resource Group and Virtual Network. You have to have that in place first – and all of those parameters can be found on the resource properties. It’s a pretty basic script in the way… a fancier script could probably check for those resources and create them if needed, or pull the existing properties.

          If you’re newer to Azure, you might find it easier to deploy the VM manually, at least initially. That wizard will create the Virtual Network, Resource Group, etc automatically. You can also choose Azure Spot and the VM size in the portal deployment process. The script I shared is just an export from deploying the first worker VM in the portal.

          Hope that helps!

  1. I keep getting an error when I try to provision via the template. When I try to spin up a Windows 10 VM manually, the NV4as_v4 is greyed out with NV6 being the cheapest available. However, when I go to review and validate, the validation fails with essentially the same error. Is this something you are seeing or have seen before? This only seems to be the case with spot VM’s.

    Error as follows:
    The template deployment failed with error: ‘The resource with id: ‘*’ failed validation with message: ‘The requested size for resource ‘*’ is currently not available in location ‘southcentralus’ zones ” for subscription ‘*’. Please try another size or deploy to a different location or zones. See https://aka.ms/azureskunotavailable for details.’.’. (Code: InvalidTemplateDeployment)

    1. Hey Matt, it’s possible that no Spot VMs are available in the region based off that error. That can change by the minute. You may also want to check your subscription and see what kind of quota is set for Spot VM cores – you may need to request an increase to deploy these.

  2. Hi Josh. Thanks for this great post! A quick question hopefully…

    How did you install the client software and drivers? Did you install those manually on each VM? Or, did you create something like a custom image, etc…?

    Thanks again.

    1. Hey Brian,

      The client software is really simple so I didn’t automate it in the guide. For the systems I deployed, I installed it manually, but you could easily script it or pre-install it in a custom VM image, which would also be a smaller and cheaper disk in Azure!

Leave a Comment

Your email address will not be published. Required fields are marked *