Sometimes when troubleshooting a cost spike the root cause isnt always easy to identify. I had that myself this week.

Fortunately I have been using Turbo360 to keep an eye on my credit usage so I caught this sooner and didnt waste too much money but it was interesting to troubleshoot so I thought it was worth sharing.

Responding to the Alert

I got an email alert about my budget going red and I took a look at the Turbo360 portal. You can see one of my resource group costs has gone up.

I did a little more analysis on this and I can see the cost driver is actually coming from my Azure Monitor costs for logs rather than the Logic App.

From this I can tell that my Logs normally cost next to nothing (green bar) and my Logic App plan is a stable cost each day.

I next took a look at the costs for individual resources over the cost of the month and I can see my logs are costing way more than normal.

I clicked on the name to open up the single view of a resource and I can see there is definitely something funky going on.

Looking at the meter costs I can see the cost is all associated with pay as you go data ingestion so I must be doing something on the plan that is different to normal.

I know I have a few demos on the plan so I decided to run some queries to see what is happening in the logs.

Which Table is using the most Log Data?

The first query was to identify which part of the log is used more than usual. I could see there are a lot of errors and also the traces is high too.

The below query allows me to see the amount of data logged in each table in the last 30 days.

let since = 30d;
traces
| where timestamp > ago(since)
| summarize Events = count() | extend Table="traces"
| union (
    requests
    | where timestamp > ago(since)
    | summarize Events = count() | extend Table="requests"
)
| union (
    dependencies
    | where timestamp > ago(since)
    | summarize Events = count() | extend Table="dependencies"
)
| union (
    exceptions
    | where timestamp > ago(since)
    | summarize Events = count() | extend Table="exceptions"
)
| union (
    pageViews
    | where timestamp > ago(since)
    | summarize Events = count() | extend Table="pageViews"
)
| union (
    availabilityResults
    | where timestamp > ago(since)
    | summarize Events = count() | extend Table="availabilityResults"
)
| order by Events desc

Which apps are logging the most?

I then wanted to know which of my apps are logging the most. The query below showed me that there was only one app logging events at the moment.

let since = 30d;
union isfuzzy=true traces, requests, dependencies, exceptions, pageViews, availabilityResults
| where timestamp > ago(since)
| summarize TotalEvents = count() by cloudRole = iff(isempty(cloud_RoleName), "(none)", cloud_RoleName)
| order by TotalEvents desc

Lets look at the errors

I then checked the exceptions table to have a look at some of those errors.

exceptions
| order by timestamp desc

I can see the below error in it

Messaging entity 'sb://[Mikes-ServiceBus].servicebus.windows.net/ms-railcar-gps-events' is currently disabled. For more information please see https://aka.ms/ServiceBusExceptions . TrackingId:4f84dc3e301a42aab38b5f3675b5222d_G42, SystemTracker:gateway10, Timestamp:2025-08-19T16:15:15 (MessagingEntityDisabled). For troubleshooting information, see https://aka.ms/azsdk/net/servicebus/exceptions/troubleshoot.

I can also see in App Insights there are a lot of Service Bus exceptions.

Aha Moment!

This is when I realized the stupid thing id done. A few days ago Id been messing around with this demo showing off the fancy feature of disabling a service bus queue so that you could stop a receiver from receiving messages.

In my excitement I had forgotten to enable the queue again once finished.

On the face of it this should not cause an issue because no one can receive messages so what can go wrong right?

Well thats where your wrong!!

The problem is that my Logic App was still trying to process messages from the queue even though id disabled it for receivers. You can see below that my logic app uses the service bus connector to receive messages.

The other day I had looked at this Logic App when I noticed my demo wasnt working and I had re-enabled the queue to get it working again and I did notice there was a gap in the trigger history but most importantly there are no errors in the trigger history!

I also noticed that there were no workflow trigger failures on the metrics for the Logic App at the app level rather than the workflow level using the Workflow Trigger Failed Rate counter. This is problematic because I have errors which I am not seeing here.

The issue here was that the Logic App is still polling quite aggressively for messages and it logging errors whenever it cant connect to service bus. I can use the below query to see how many trace events were service bus listeners getting an error per hour.

traces
| order by timestamp desc 
| where customDimensions.Category == "Microsoft.Azure.WebJobs.ServiceBus.Listeners.ServiceBusListener"
| where customDimensions.LogLevel == "Error"
| summarize Events = count() by Hour = bin(timestamp, 1h)
| order by Hour desc

I could also use the below query to see how many exceptions were being logged per hour.

exceptions
| where cloud_RoleName == "ms-blog-railcar-gps"
| where type == "Azure.Messaging.ServiceBus.ServiceBusException"
| summarize Events = count() by Hour = bin(timestamp, 1h)
| order by Hour desc

Im my case between those 2 tables im logging 30k events per hour just for errors that the Logic App cant connect to service bus.

How did I fix the cost issue?

This was quite simple. I enabled the queue and everything returned to normal

What should I have done?

When I disabled the queue I should also have disabled the Logic App or the workflow. This would have stopped it polling for messages.

What can I do to catch this next time rather than just relying on my cost guard rail?

The cost alert was a back stop that caught my issue here because it was affecting my costs, but ideally id like to catch this sooner from an operational perspective.

Unfortunately the Logic App itself isnt reporting any trigger failures or counters going into an unhealthy state.

There are some other options I can do however to help me.

Option 1 – Monitor App Insights with a query

I can use the below query to monitor the logs in app insights for any service bus listener messages which are reporting an error.

traces
| where cloud_RoleName == "ms-blog-railcar-gps"
| where customDimensions.Category == "Microsoft.Azure.WebJobs.ServiceBus.Listeners.ServiceBusListener"
| where customDimensions.LogLevel == "Error"
| order by timestamp desc

One option may be to use this query and do a summarize and use Turbo360 to poll this query periodically and raise an alert if my Logic App isnt talking to my queue.

In my case what I ended up doing is monitoring a modified version of the above query and if I detect more than 1000 log entries in the last 5 mins trying to call service bus for the trigger then I used Turbo360 to trigger an action that would disable my Logic App as there might be a problem.

Option 2 – Monitor Queue State with Turbo360

The 2nd option here is to monitor the state of the service bus queue and raise an alert if the queue is not in the active state.

Option 3 – Monitor queue length

I could monitor the queue length and if I start building up a backlog of messages I can raise an alert for this

Option 4 – Monitor Service Bus User Errors

I can monitor for user errors accessing service bus which indicates an application might be having problems accessing messages. This is a bit more generic across the namespace.

What about options to tune the logging?

I have blogged before about options to tune logging. I think in this case one of the challenges is that the tuning might then affect the supportability in other ways. This article talks about some options. Although its about function apps, the same rules would apply as im using the built in connectors here.

You might include Exception in the sampling