Recently I was helping to troubleshoot a performance issue with a web app on Azure App Service. It was quite interesting troubleshooting the issue and just making some notes for future ref.
The Quick Read
If you have vnet integration setup and are having performance issues check the below setting is present in the app settings
WEBSITE_DNS_SERVER = 168.63.129.16
Longer Read
Symptoms
The symptoms of the problem were:
- The code on the app was all working and the API call being made did do all of the things it was supposed to do and returned data successfully
- On the first call after non activity for a while we would see extended call duration of something in the region of a minute.
- We would usually see subsequent calls being a lot quicker
- The app insights logs and dependencies showed that every call out to an external dependency was taking around 7 seconds. This included calls to dataverse and calls to SQL Azure and Azure AD.
Setup
The setup was that the Web App was configured to have Vnet integration for outbound traffic and then we had some firewall rules on KeyVault, SQL to restrict traffic to be from the VNET. Note that we didnt have private endpoints enabled.
The VNet was using a local DNS server rather than the Azure DNS and it was configured to be peered with other VNets in a hub and spoke model to support on premise integration scenarios.
Diagnosis
We did various bits of investigation and all of the VNet integration looked to be setup correctly. The code would run locally without any issues and we rules out that there was anything in the start up code which might be causing an issue. It really felt like it might be DNS related.
I then remembered the old magic setting for WEBSITE_DNS_SERVER and ive never really explored what happens if this setting isnt added but it looked like what happened was that the outbound traffic from the web app is using the on prem DNS to attempt to resolve addresses and there was a latency overhead doing that on the first hit. We added the WEBSITE_DNS_SERVER setting and the resolution of the public dependencies immediately dropped right down and the call duration for that first call dropped from 57 secs overall to around 400ms. All of the calls out to external dependencies which were taking 7 secs dropped to 30ms or so.
One to remember for next time, hope this helps a few other people.