How Alien Intelligence Built an AI SRE Workflow with SigNoz

Last Updated: June 11, 202617 min read

AI workflows are everywhere - for better or worse, everyone is preaching about it. At SigNoz, we’re busy building the observability layer for AI agents. We believe humans and agents can augment each other to solve harder engineering problems when it comes to observability.

We recently launched our hosted MCP server, and our users are building interesting workflows to make their lives easier.

We got on a call with one such user. Leo Blondel, CTO at Alien Intelligence, was kind to take the time to discuss their AI SRE workflow, which they built with SigNoz. If you’re currently building such workflows, this blog will give you a clear look at how they built it, what worked, and what other teams can borrow from it.

What Leo Built at Alien Intelligence

Alien Intelligence is a three-person engineering team running production Kubernetes for a fairly complex product. Their stack includes multiple Kubernetes clusters, PostgreSQL, MinIO, Qdrant, Argo Workflows, FastAPI services, ArgoCD GitOps pipelines, and mTLS-based cross-cluster communication.

For a small team, that creates a familiar problem: the infrastructure is production-grade, but the incident response process still depends on a founder checking Slack at the right time.

When Leo started looking for a solution, he came across Datadog. But they quoted him over $2,000 a month.

They gave me a free trial for two weeks. At the end of it, they came back and said, 'okay, the trial's over – it's going to cost you over 2k.' I was like, sorry, what?

-Leo on Datadog’s cost for monitoring his tech stack with AI workflows

He already knew about SigNoz and wanted to use an OpenTelemetry-native solution. He also liked the fact that we’re open source.

Using SigNoz’s hosted MCP server, he built an AI SRE workflow that takes over the first pass of alert triage: checking whether an alert is real, gathering context, and sending him a Slack message only when it needs attention. It also gave the team a pattern they can reuse for future workflows where humans and agents work together.

Here’s an overview of how the AI SRE workflow looks.

*AI SRE workflow that Alien Intelligence built with SigNoz*

The workflow starts with SigNoz alerts. When something fires, SigNoz POSTs a webhook to a small channel server that forwards the event into a long-running Claude agent. The agent uses the SigNoz MCP server to inspect traces, logs, and metrics. It also has read-only access to Kubernetes and GitLab, a Slack channel for two-way conversation with the on-call human, and its own runbook with per-alert playbooks.

The agent handles the first pass that usually falls to an engineer. It checks whether the alert is real, gathers evidence, looks for the likely root cause, and decides whether to ignore, escalate, or suggest a safe fix.

Start with Alerts, Not Dashboards

Leo described the problem plainly: he hates going through alerts because too often the alert turns out to be transient, and the check still costs ten minutes.

That is the problem the agent takes on first. It starts with alerts, specifically the ones already interrupting the team, instead of trying to "monitor everything" from a dashboard.

An open-ended monitoring job gives the agent too much to do with no clear success condition. It spends context producing observations that still require a human to make sense of. That is not much better than staring at a dashboard.

Alerts are different. Alien Intelligence already had rules in SigNoz covering high error rates, pod crash loops, ArgoCD sync failures, PVC warnings, latency anomalies, database errors, and Falco security events. The team had already decided what mattered enough to fire. The agent's job was to handle the first question every alert demanded:

Is it real? Is it noise? Because there's a lot of noise in alerts as well.

-Leo on the first question his AI SRE agent asks for every SigNoz alert

A transient spike that resolves in 10 minutes does not need to wake anyone. A known dev-cluster heartbeat issue can be logged and dismissed. But when the same alert fires ten times in a day and keeps being marked as noise, the runbook tells the agent to surface it. Maybe the evaluation window is too short. Maybe a test namespace should be excluded. Maybe the threshold is wrong.

The agent handles today's alert and flags what needs to change so tomorrow's alerts are more trustworthy.

Give the Agent the Context a Human SRE Would Use

Say a 401 shows up in SigNoz. The trace can show where it happened, but Leo still needs the next layer: which service changed, whether a merge request caused it, and whether it affects one user or many.

If the agent only reports that there was a 401, it has repeated what the dashboard already showed. He can find that himself in SigNoz in a few clicks. What he wants is the question after that, and the question after that.

What I want is there's been a 401. Why has there been a 401? Okay, it's here. Which services is that? Was there a GitLab merge request? Did something change in the code? Because that 401 wasn't there before.

-Leo on why logs and traces need to be connected with deploys, code, and service context

Following that trail means the agent needs more than telemetry. SigNoz gives it traces, logs, metrics, and alert context. Kubernetes tells it what is happening in the cluster. GitLab tells it what changed recently. The local codebase lets it inspect the suspected service directly. Slack connects it to a human when a decision needs to be made.

Leo’s design follows a simple rule. Give the agent the same sources a human would check. Without those, any investigation stops exactly where a dashboard already stops.

Memory Turns Repeated Noise into Something Useful

By the tenth time the same alert fires for the same service and ends with the same verdict, the agent has enough history to treat it as a pattern instead of another one-off notification. It logs the verdict, moves on for now, and remembers it for the daily noise report.

That is the difference between a one-shot investigation and a system that learns. The agent writes every verdict to a SQL database: what fired, what the root cause was, what action was taken, which merge request was involved, how it resolved.

I've seen this alert. It's happened 10 times today. It was noise, but actually, I'm going to report it because this noise is recurrent.

-Leo on using agent memory to spot noisy alert patterns

His runbook includes a noise-report workflow that groups recurring noisy alerts and proposes concrete fixes at the end of the day: increase the evaluation window, add a minimum duration, exclude a test namespace, adjust the threshold. Ten silent dismissals become one useful conversation about what to fix.

There are two memory layers in the workflow. SQL keeps the operational memory: alerts, verdicts, root causes, actions, and the summary the agent reloads after a restart. Notion keeps the resolved incident record. When an incident closes, the agent writes a full report to a Notion table, so the team has a durable history outside the live triage loop.

Keep the Main Agent Lightweight

Your context window is your lifeline. If it fills up, you die and must restart.

-Leo on why the main agent delegates investigations to smaller sub-agents

Leo reads this directly from his CLAUDE.md. It is the first rule of the system, and it is not metaphorical. In his experience, once the context window crosses roughly 50-60 percent, the model's reasoning starts to degrade fast enough to notice.

Every alert brings in logs, traces, kubectl output, git diffs, and cluster state. If that accumulates in one context window over days, the agent becomes unreliable before you catch it. He treated this as a hard design constraint.

The main agent, running on Claude Opus, handles coordination. It receives the alert, reads the metadata, filters obvious noise, and spawns a focused sub-agent for the actual investigation. The sub-agent, running on Claude Sonnet or Haiku, works in a fresh context window and gets one self-contained task: check the systems it needs, return a verdict, and leave the main agent with only the summary.

*Runtime view of Alien Intelligence’s AI SRE workflow: a main agent dispatches fresh sub-agents, stores verdicts in SQLite memory, and escalates only when human attention is needed.*

That model split matches the work each layer is doing. Opus stays as the always-on orchestrator, but it mostly sees short alert metadata and investigation summaries. Sonnet or Haiku handles the bursty, token-heavy digging through logs, traces, Kubernetes output, and Git history.

The final safety net is a daily restart at 4am. The agent runs on a VM with PM2, and the startup script pulls the latest SQL summary before the agent starts again. That gives it the operational memory it needs without carrying yesterday's full context window forward.

For anyone building a long-running agent, Leo's setup offers a useful pattern. Keep the main agent small, delegate the noisy work, and store memory outside the model.

Playbooks Are Where the Team's Knowledge Goes

A high-error-rate alert arrives. The agent doesn't guess what to do.

It opens the playbook for that alert and starts at the top: identify the affected service, query recent error traces in SigNoz, check pod health, inspect rollout history, look for GitLab MRs merged in the last 30 to 60 minutes. If a recent merge shows up, it is the prime suspect.

Every alert has a playbook like this. Pod crash loops have one. ArgoCD degraded apps have one. Falco security events have one. The open-source repo covers high error rate, high latency, pod crash loops, ArgoCD degradation, database errors, PVC capacity, workflow failures, and Trivy vulnerabilities.

Each alert maps to one playbook file. When the workflow misses something, the fix is small: update the alert or update the playbook. The agent itself does not need to change.

The beauty of it is that it's with this playbook and this alert system. Both of them are modular. And so each alert has a playbook file. So each alert comes in and knows what to do with this alert.

-Leo on why pairing each alert with a playbook makes the workflow easy to improve

A client reported a bug recently that the system had not caught. No alert existed for that failure mode. Leo had the agent analyze the logs, find the failure pattern, and write the alert rule it would have needed. The alert went into SigNoz. The playbook was added. The next time that failure happens, the agent catches it.

The runbook grows from real misses, not upfront guesses.

The Agent Can Act, but Only Inside Hard Limits

The agent has real access. It can read the Kubernetes cluster, inspect GitLab, look at the codebase, talk through Slack. In certain situations, it can act: restart a crashed deployment, delete a stuck pod, clear a stuck Argo Workflow, refresh a degraded ArgoCD application.

That is a lot of access.

The list of what it cannot do is longer. And it is not enforced by a prompt.

RBAC blocks namespace deletion, PVC deletion, database access, scaling services to zero, git pushes, and ArgoCD application deletion. The runbook says not to do these things, and the service account permissions make sure the agent never had the ability to begin with.

I don't let the AI run my GitOps and my Kubernetes stack. I let it analyze what's happening, tell me why it's happening, tell me remediation strategies.

-Leo on keeping the agent focused on investigation while hard systems control production changes

A prompt can tell an agent not to delete a PVC. RBAC makes sure it cannot. That distinction is what makes Leo's system stable enough to trust with real production access.

His framing is three layers working together: a knowledge base that gives the agent situational awareness, deterministic systems like Kubernetes RBAC, GitOps, CI/CD, and merge requests that block dangerous changes, and an agentic layer that investigates and recommends while humans retain the judgment calls.

That framing also comes from Leo's background. Before Alien Intelligence, he worked in academia on human-AI interaction, including how AI systems can help people understand complex data. So when he talks about keeping the human in the loop, it is not a throwaway line.

I don't want the AI to replace something. I want the AI to augment.

-Leo on why the human still stays in the loop

What a Month of This Workflow Actually Changed

Before this workflow, an alert meant a decision.

His phone buzzed. Sometimes he was in a meeting. Sometimes it was late, after the baby was down. Sometimes his 14-month-old was screaming in the next room. The question was always the same: real or noise?

More often than not: noise. But he still had to check.

It's more than time. It's brain time. Like it's responsibility and it frees up mental space for other things.

-Leo on how the workflow changed the mental load of being responsible for alerts

Now, if the alert is real, Leo gets a Slack message. What fired. What was checked. What changed recently. What the agent recommends. If a critical alert goes unacknowledged, the agent follows up. It does not forget about something real just because Leo was unavailable.

He shared how the workflow had behaved so far:

False positives? No. False negatives? Yes. Has done it once.

-Leo on how the AI SRE workflow behaved in its first month

In other words, Leo had not seen the agent mark a serious alert as noise. He had seen one missed case, which he used to update the alert and playbook. For a new system running complex production infrastructure, that changes how you relate to a buzzing phone.

I know that if there's a real alert, I actually get a real message. And it means that I need to look at it. And so I don't worry in a way.

-Leo on trusting the agent enough to stop checking every alert himself

The real change is that Leo now spends less judgment on noise, which is exactly what a small, focused team needs.

Dashboards Are Still Useful, but Not Always the First Screen

Leo has dashboards in SigNoz. He set them up when he was first configuring the system. He likes them.

The last time he opened them was to take screenshots for an ISO certification.

I have dashboards just because I set them up at the beginning. I haven't opened them except for screenshots for my ISO certification.

-Leo on how his day-to-day workflow moved from dashboards to agent conversations

Leo pushed further in the call. When you open a dashboard, you see lines, curves, and gauges. A panel might tell you memory pressure is high, but it still does not tell you what to do next. His view is that dashboards matter at scale, when there are dedicated teams watching dedicated screens. For a three-person team where the agent surfaces actionable signals through Slack, the dashboard is one more interface that rarely has more to say than what the agent already sent.

Dashboards are not going away. They are still the right tool for exploration, capacity planning, and understanding system behavior over time. The more precise point is that observability data now has two distinct readers. Humans need dashboards, trace views, and query interfaces to understand. Agents need structured access to the same data through MCP to investigate and summarize.

That means observability tools need two access paths: visual interfaces for humans and structured access for agents.

Why SigNoz Worked for This Workflow

You guys, MCP's pretty dope, honestly. That was really neat.

-Leo on setting up SigNoz MCP for the AI SRE workflow

That was Leo's reaction after setting it up. Unfiltered. It is the most useful kind of feedback: someone who tried a thing and it worked the way it was supposed to.

He had been leaning toward SigNoz before building any of this. OpenTelemetry-native, open source, and a pricing model that worked for his team. After Datadog didn't work out, he configured everything in SigNoz and found the setup faster than expected. The documentation being structured for AI consumption also cut down the prompt engineering needed to get the agent querying correctly.

Two things made SigNoz fit this workflow specifically: direct MCP access to telemetry, and an open-source foundation the agent could reason about.

MCP gave the agent a direct path to the same data Leo would check. Alerts triggered the webhook. SigNoz MCP let the agent read traces, logs, and metrics from the same source. No scraping. No translation layer.

The open-source foundation mattered too. It meant fewer black boxes for the agent to stumble through.

That nuance matters because MCP alone was not the full differentiator. He said Datadog also has an MCP server, and that it is "not bad." What made SigNoz easier for this workflow was the combination of MCP access, OpenTelemetry-native data, open source code and docs, and a pricing model that worked for his team.

He described SigNoz as "well done" and said it collects data in a way that's "well formatted and well documented." For agents, the bigger point was simple: "there is no black box."

SigNoz gave the agent structured access to telemetry without scraping, translation layers, or black-box guesswork.

Leo’s view was that SigNoz being open and well documented gave the agent more public context to work with than a closed system would. That matters when an agent is trying to query telemetry correctly and understand what it is looking at.

How to Build Your First AI SRE Workflow

Leo's workflow took about three and a half weeks to build. It did not start as a platform.

Start with the part that already hurts. Pick a few alerts that fire often and cost someone their attention. Wire them to an agent through a webhook. Connect the agent to SigNoz MCP. Give it read-only access to the systems a developer would normally check. Write one small playbook per alert.

The first version does not need to remediate anything. A read-only agent that classifies alerts and sends a clear Slack summary is already a meaningful improvement over triaging every notification by hand.

A practical sequence:

Pick high-volume alerts that already interrupt the team.
Add a webhook path from SigNoz to your agent.
Connect the agent to SigNoz MCP.
Give the agent read-only access to Kubernetes, code, and recent deploys.
Write alert-specific playbooks.
Store investigation verdicts in a SQL database outside the model context.
Review missed issues and noisy alerts every week.
Add RBAC-scoped safe actions only after the read-only workflow is reliable.

The broader shift Leo's workflow points to is that observability data now has two kinds of readers. Humans decide what matters and make the judgment calls. Agents gather context, connect signals, and surface the right information at the right time, without requiring someone to pause what they are doing for every alert that fires.

SigNoz gives this workflow the pieces it needs: alerts, logs, metrics, traces, dashboards, and MCP access in one OpenTelemetry-native platform.

You can try this with SigNoz Cloud and the SigNoz MCP server. You can also explore Leo’s open-source AI SRE implementation on GitHub.