Latency Spike Explainer

PagerDuty fires. The alert reads: checkout-service p99 latency > 2s (currently 4.7s), triggered 3 min ago. You already know what is slow. You need to know why.

You open your AI assistant, connected to SigNoz via the MCP server, and start asking.

Prerequisites

Connect your AI assistant to SigNoz using the MCP Server guide.
Make sure your services are instrumented with distributed tracing. See Instrument Your Application if you haven't set this up.

Step 1: Inspect a Slow Trace

Show me traces from checkout-service slower than 2 seconds in the last 30 minutes. Break down the spans for the slowest one.

The span tree comes back:

POST /api/checkout (checkout-service, 4,712ms)
  |-- ValidateCart (checkout-service, 8ms)
  |-- GetCustomerProfile (customer-service, 41ms)
  |-- ProcessPayment (payment-service, 4,480ms)  <-- 95% of total
  |     |-- ChargeCard (stripe-gateway, 4,430ms)
  |-- SendConfirmation (notification-service, skipped, upstream failure)

95% of the time is in the ChargeCard call to the Stripe gateway.

Step 2: Is This All Requests or Just the Tail?

Show me the p50 and p99 latency for checkout-service /api/checkout over the last 2 hours, broken down in 5-minute intervals.

Both p50 and p99 were stable at ~400ms until 1:47 AM, then both jumped. p50 is at 3.8s, p99 at 4.7s. This is not tail latency. Nearly every request is affected. Something broke at 1:47 AM.

Step 3: Compare With a Healthy Trace

Find me a trace from checkout-service between 2 and 3 hours ago where duration was under 500ms.

A healthy trace from before the spike:

POST /api/checkout (checkout-service, 387ms)
  |-- ValidateCart (checkout-service, 6ms)
  |-- GetCustomerProfile (customer-service, 38ms)
  |-- ProcessPayment (payment-service, 291ms)
  |     |-- ChargeCard (stripe-gateway, 248ms)
  |-- SendConfirmation (notification-service, 31ms)

Same call chain. The only difference: ChargeCard went from 248ms to 4,430ms. The problem is not in your code. It is downstream.

Step 4: Check the Dependency

Show me p99 latency for payment-service over the last 2 hours in 5-minute intervals. Also pull any error or warning logs from payment-service in the last 30 minutes.

Payment-service latency spiked at the exact same time. The logs show the cause:

01:47:12 WARN  Stripe endpoint config reloaded: region changed us-east-1 -> eu-west-1
01:47:14 WARN  ChargeCard latency elevated (2,341ms), retrying
01:47:15 ERROR ChargeCard timeout after 5000ms
01:47:18 WARN  ChargeCard latency elevated (4,102ms)

A config change at 1:47 AM switched the Stripe endpoint to a different region. Every charge request is now making a cross-Atlantic round trip.

Step 5: Quantify and Decide

Show me total request count and error rate for checkout-service over the last 2 hours in 5-minute intervals. What percentage of requests are slower than 2 seconds?

847 requests since the spike. 94% are over 2 seconds. Error rate is 12% (timeouts). The trend is flat, not worsening, but nearly every customer is getting a degraded experience. You revert the config change.

Tips for Your Own Investigations

Check percentiles, not just p99. If p50 is fine but p99 is bad, only a subset of requests are slow. If both are bad, something systemic broke.
Follow the dependency chain. If the bottleneck span is a call to another service, check that service directly. Correlate latency spikes and error logs across both.
Quantify before you act. Know the blast radius before you wake someone up or trigger a rollback.

Under the Hood

During this investigation, the MCP server called these tools:

Step	MCP Tool	What It Did
1	`signoz_search_traces`	Found traces matching the duration and time range filter
1	`signoz_get_trace_details`	Returned the full span tree for the slowest trace
2	`signoz_aggregate_traces`	Computed p50/p99 latency in time-series buckets
3	`signoz_search_traces`	Found a healthy baseline trace from before the spike
4	`signoz_get_service_top_operations`	Got latency breakdown for the downstream service
4	`signoz_search_logs`	Pulled error and warning logs from payment-service
5	`signoz_aggregate_traces`	Computed request counts and error rates over time

Next Steps

Natural Language Log Exploration - Search and analyze logs without writing queries.
Reconstruct a Bug from a Trace ID - Debug a support ticket with a trace ID.

If you need help with the steps in this topic, please reach out to us on SigNoz Community Slack.

If you are a SigNoz Cloud user, please use in product chat support located at the bottom right corner of your SigNoz instance or contact us at cloud-support@signoz.io.