Chocolatey issue with DevOps self hosted build agent

This week we had a bit of an issue where our self-hosted build agents started having an error when we were running a build. Thanks to Dean Lewis on our team who spotted a github issue which helped identify the root cause. I then needed to modify our environment build terraform to implement a workaround for this issue which is discussed in the test of this article below.

Error Message

The error message we are getting is

.NET Framework 4.8 was installed, but a reboot is required

Overview of the Solution

If as the reader you just want to know what we did to fix it ill outline that here to allow you to get it resolved but if you want to learn a bit more about our situation then ill go into more detail below.

The fix was:

Our self hosted build agent uses an extension which runs a custom powershell script to install a couple of prerequsites
One of these pre-reqs is chocolatey
We modified the script so it installed v1.4.0 of chocolatey rather than the latest
That fixes the problem

Scenario

In our scenario we have a self hosted build agent which is used for a data platform built using Azure Synapse. It is all private networked and the build agent is used from Azure DevOps to run builds for things that need to be deployed inside the network such as the Synapse updates.

Recently our builds which had been working fine started getting an error (outlined above).

On our self hosted build agent we need to have the agent install a few pre-reqs and we use chocolatey to install them, We are using the agent image which is the Visual Studio 2019 latest image.

To install the pre-reqs we have a powershell script in a storage account which is then configured as a custom extension which the VM scaleset automatically runs when spinning up an instance.

Powershell Script

Below is the powershell we run in the extension.



#Install Chocolatey so we can simplify the install of some tools and apps for the build agent
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))

#Install the Azure CLI
choco install azure-cli -y

#After we install the Azure CLI we need to add to the path environment variable so that az commands will work
$env:Path += ";C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin"

#Install NUGET for Powershell Gallery installations
Install-PackageProvider NuGet -Force
Import-PackageProvider NuGet -Force

#Set the powershell gallery as trusted
Set-PSRepository -Name PSGallery -InstallationPolicy Trusted

We then use terraform to upload the powershell script to storage as part of the build using the below widget in terraform. Note that I have other terraform resources which create the storage account etc.


resource "azurerm_storage_blob" "custom_extension_powershell_script" {
    name                   = local.file_name_powershell_prereq_script
    storage_account_name   = azurerm_storage_account.build_agent_public_storage.name
    storage_container_name = azurerm_storage_container.custom_extensions.name
    type                   = "Block"
    source                 = "BuildAgent-CustomExtension.ps1"

    metadata = {
            #This tag is used to make the script be refreshed everytime
            last_updated = timestamp()
    }    
}

VM Scale-Set

I have a terraform resource which builds a VM scaleset which is defined below.


#Virtual scaleset for devops pipeline agents
resource "azurerm_windows_virtual_machine_scale_set" "vm_scaleset_build_agent" {
    name                        = "vm-build-agent-${lower(var.environment_name)}"
    resource_group_name         = data.azurerm_resource_group.resource_group.name
    location                    = data.azurerm_resource_group.resource_group.location      
    
    single_placement_group      = false
    overprovision               = false 
    instances                   = 0

    sku                         = "Standard_D2s_v3"
    admin_username              = "[Removed]"
    admin_password              = random_password.build_agent_admin_password.result
    computer_name_prefix        = "edms${lower(var.environment_name)}"
    
    source_image_reference  {
        publisher = "microsoftvisualstudio"
        offer     = "visualstudio2019latest"
        sku       = "vs-2019-ent-latest-ws2019"
        version   = "latest"
    }

    os_disk  {    
        caching                 = "ReadWrite"
        storage_account_type    = "Standard_LRS"
    }

    network_interface {

        name    = "nic-vmss-build-agent-${lower(var.environment_name)}"
        primary = true

        ip_configuration {
            name       = "internal"
            primary    = true
            subnet_id  = data.azurerm_subnet.build_agent_subnet.id

        }
    }
    
    tags = merge(
        var.default_tags,
        {
            #This tag is needed by azure devops and must match what is the name of the devops agent pool
            __AzureDevOpsElasticPool = var.devops_build_agent_pool_name

            logicalResourceName = "vm-build-agent"
            UsedBy = "Azure DevOps Build Agent"
            Description = "Scale set used as a private self hosted built agent"    
            TerraformReference = "azurerm_windows_virtual_machine_scale_set.vm_scaleset_build_agent"
        }
    )

    lifecycle {
        ignore_changes = [
            tags["__AzureDevOpsElasticPool"],
            tags["__AzureDevOpsElasticPoolTimeStamp"]            
        ]
    }
}

I then will apply a custom extension to that vm scale set using the below terraform. This will make it point to the powershell in my storage account and run that powershell on start up.


resource "azurerm_virtual_machine_scale_set_extension" "custom_powershell_extension" {
    name                                = "Install-BuildAgent-PreReqs"    
    virtual_machine_scale_set_id        = azurerm_windows_virtual_machine_scale_set.vm_scaleset_build_agent.id
    publisher                           = "Microsoft.Compute"
    type                                = "CustomScriptExtension"
    type_handler_version                = "1.9"

    #This is to allow changing this value to force an update
    force_update_tag                    = "1"


    protected_settings = <<PROTECTED_SETTINGS
        {
        "commandToExecute": "powershell.exe -Command \"./${azurerm_storage_blob.custom_extension_powershell_script.name}; exit 0;\""
        }
   PROTECTED_SETTINGS

    settings = <<SETTINGS
    {
        "fileUris": [
            "${azurerm_storage_blob.custom_extension_powershell_script.url}"
        ]
    }
    SETTINGS
}

The problem we identified was that when a VM instance is started by Azure DevOps when it wants to do a build the VM fails to execute the extensions identifying that a restart is required because .net 4.8 needs a restart so the extension fails and the DevOps pipeline fails as result.

Root cause

We spent some time looking into the issue and Dean spotted a github issue about this problem, the below link indicates there is a newer version of Chocolatey which changes the dependancy for chocolatey to .net 4.8. When we run the script to install chocolatey it will automatically try to install .net 4.8 if it is not present on the machine and this will cause the agent extension to fail because a restart is required. If you restart the instance then this could cause problems because the DevOps agent might want a clean agent from the VM Scale set for each build.

https://github.com/microsoft/AL-Go/issues/560

How we handled it

We have a couple of options here, we could look to get a build agent with .net 4.8 already installed. This was something that would take some time to sort out however and we may find other issues in getting a newer image at this time.

Fortunately we can work around this quite easily. We can add an environment variable to the script to install a specific version of chocolatey. We added the below to the top of the powershell script.

# We add this variable to force the chocolatey install to use this version which doesnt have a dependency on
# .net 4.8 which will require a reboot on the build agent
$env:chocolateyVersion = '1.4.0'

We then ran terraform which would update the build agent and the powershell script in storage to this modified version.

We could then run a build and it would install the older version of chocolatey and the build that was failing will now work.

Longer term we will plan to update the build agent to an image with .net 4.8 already installed and probably one with Azure CLI already installed too. This was the original reason we needed the custom script as the azure cli install install need an admin priveledge which we would not do inside the pipeline run but we can do in the extension execution.

There is info about installing a specific version of chocolatey here:

https://docs.chocolatey.org/en-us/choco/setup#installing-a-particular-version-of-chocolatey

Lessons Learnt

One of the lessons learnt here was that we hadnt ran a build for a few weeks while some development was happening so we didnt spot this issue until an inconvenient time. If we had identified this sooner we could have spent more time exploring updating the build agent rather than working around the issue.

One thing we do have (which id also recommend as a good practice if you have a self-hosted build agent) is a diagnostics pipeline in DevOps. This pipeline is configured to run on the build agent and it will execute a simplified version of some tasks to verify the builds we have should run successfully. In my case the diagnostics pipeline did fail if we ran it.

I have changed the diagnostics pipeline to run on a weekly schedule so we can try and mitigate issues like this in the future by finding them quicker.

Blow is an example of my diagnostics pipeline where I check some of the pre-reqs are installed on the machine and that we can run some terraform commands etc.