We are at the
beginning of the agentic era for operations. Current AIOps tools summarize
dashboards and surface correlations, but most don’t actually investigate
incidents. So engineers still spend hours manually investigating complex
incidents.
Vinod
Jayaraman, co-founder and
head of engineering at NeuBird.ai,
thinks it’s time for that to change. Agentic AI systems can now perform actual
SRE investigation work. The company has applied agentic AI to
production operations. It automatically correlates telemetry data across AWS
services without human intervention, surfacing root causes that Jayaraman says
save engineers hours.
This marks the
first time AI can reason over telemetry like an engineer would. It builds a
service map to understand how components connect before starting an
investigation, and then explores multiple hypotheses in parallel faster than
human experts could.
That’s already happening, but now he wants to take
things further. “One of the topics that is close to our hearts for this
year is how we can close the SRE loop and also get closer to code
generation,” he says. “We also want to have a reasoning graph that
explains how we came up with a root cause analysis (RCA).”
The future is
autonomous investigation, not better dashboards
Alert storms
and dashboard sprawl are symptoms of architectural scale, not solutions to it.
About 95 percent of alerts generated from low-level metric thresholds don’t
need to be investigated. High CPU usage might be good (a compilation job
running) or bad (a runaway process). Context matters.
Humans cannot
keep pace with modern telemetry volume. The next stage of evolution is to
reduce human dependency on observability, not to throw more tools at the
problem. NeuBird streamlines incident response by
handling the investigation work autonomously.
AI agents must
become the first responder to incidents. When incidents are queued up in the
background, the system can refine RCA results using reasoning models without
time pressure. Low confidence scores (say, less than 60 percent) trigger additional investigation
passes rather than immediate escalation.
Things need to
become more intuitive for SREs, DevOps, and Platform Engineering. “We want them to describe the end outcome that they want to avoid, and
do so using natural language, which we call semantic monitoring,” he says
That means
moving beyond simply setting thresholds on CPU or memory. Instead, an SRE might
say “Monitor for any pod failures in this Kubernetes cluster that last more
than two minutes.” The system breaks that down into various things that need to
be monitored under the hood.
What The Future
Looks Like For SREs And DevOps
Instead of
stressing out at the battlefront, site reliability engineers and those tasked with application reliability will soon spend at least some of their time supervising AI
agents instead.
In fact, while
a human stays in control, it’s agents all the way down. Supervisor agents can
monitor other agents’ investigations to keep them on the right track. When an
agent gets stuck looking at the same metrics repeatedly, the supervisor
redirects its attention elsewhere. It’s like having a senior engineer guide a
junior through their first incident response.
This agentic
support won’t replace people, but it will reduce a lot of the
correlation work that they currently face when tracking problems across complex
distributed systems.
Closing The
SRE/DevOps Loop
The next step
for NeuBird is to go beyond solving immediate problems as a discrete practice
by integrating with other parts of the incident response chain. He wants to
close the problem resolution loop, where engineers
diagnose problems, produce resolutions, update software, and prevent
recurrence.
“When you
come up with a probable RCA, things might get updated, but that’s also often
where the ball gets dropped,” says Jayaraman. “We want to shift left,
getting closer to the development cycle where once the RCA is produced, you’re
able to continue on and surface what you
discovered, passing it on to the next step in the pipeline.”
This is where
his concept of code generation comes in. In the future, an SRE agent might
deliver details of a problem to a coding agent that generates a fix and then
writes a pull request to fix an issue.
Explainability
And Reasoning Graphs
As NeuBird.ai
explores these broader automation opportunities, trust will be central to
getting engineers on board.
Trust remains
the limiting factor. Some customers want single, high-confidence root cause
analyses rather than multiple hypotheses. Low confidence scores trigger
additional investigation passes. For deployment changes versus code changes,
the system can apply fixes and verify results.
That’s why
humans are very much still in this loop. NeuBird is developing guardrails for
autonomous approval of simple infrastructure changes (adding a single node for
resource starvation, for instance). More complex fixes need a person to check
the work and pull the lever.
Security
permissions determine how far automation goes. Simple problems might get
autonomous fixes within strict guardrails. Everything else stays in
recommendation mode, with humans making final calls.
That in turn
demands transparency, which is something that the company is building more of
into its systems. Users can double-click to see why the system arrived at a
particular RCA. The system will show its work: which metrics it checked, what
logs it parsed, and which correlations it found.
“We
present the solution, but also the path by which we arrived at the
solution,” says Jayaraman. “We present the queries that were executed
to get out the information, which led us to believe that this is what the root
cause is.”
Consider a
message queue overflow scenario. The consumer falls behind. NeuBird AI SRE
provides context and recommends scaling out by adding another node. It produces
specific Terraform script adjustments.
The product
generates an internal confidence score on each RCA. High confidence means a
clear answer. Low confidence triggers deeper investigation. Either way, the
system presents the queries it executed to reach its conclusion, not just the
final answer.
Anyone that has
seen two AIs argue with each other will appreciate how fascinating this process
is. Hawkeye uses adversarial thinking to sharpen accuracy. Two models analyze
the same incident independently. Agreement means high confidence. Disagreement
flags uncertainty. The system uses LLMs as judges to evaluate the quality of
work done.
Virtual SREs
don’t need to chat just with each other. Static reports are giving way to
conversations, with engineers asking follow-up questions about an RCA,
requesting different analyses, or exploring alternative hypotheses. The system
learns from each interaction, incorporating feedback to improve future incident
analysis.
These reasoning
systems are another area that NeuBird will take further. The future vision is
for these reasoning graphs to evolve into comprehensive automated runbooks.
They will go beyond solving specific problems to address similar issues in the
future. That way, operational memory persists even as teams change. And it can
continuously evolve, updating runbooks as Hawkeye generates more understanding.
Why This Shift
Is Happening Now
SREs and DevOps have needed
capabilities like these for a while. Cloud complexity has exceeded human
cognitive limits. The scale isn’t a technology problem anymore. “In a
complex piece of software, there are many pieces of code. When they all come
together with the variability of your environment and the user traffic that
comes in, there is no perfect code,” Jayaraman points out.
Talent
shortages have compounded the problem. It’s hard to scale SRE teams when skills
are short.
The demand
might have been there for some time, but the capability wasn’t. That’s
changing, as LLM maturity enables reasoning across telemetry data. Models can
now process information more quickly while maintaining accuracy. NeuBird’s
architecture helps here; the company breaks agent tasks into bite-sized chunks,
each using a different model optimized for that task. Smaller reasoning models
can often produce superior results to large, cumbersome, foundational models
designed to do everything.
The evolution
of infrastructure as code also makes autonomous operations both feasible and
critical. Agents can now address resource starvation problems through
Terraform, while changes tracked in GitHub repositories close the operational
loop.
Driving Adoption
Aside from
trust, removing security friction and deployment constraints is critical for
this technology. For AI systems to handle production incidents, they need
bulletproof security. Hawkeye processes telemetry in real-time without storing
it persistently. Customer data stays inside their AWS environment. When
reasoning happens, only abstracted metadata reaches Amazon Bedrock.
The trust model
relies on read-only permissions for AWS services, protected by AWS IAM.
Customers control access through their own trust policies and can revoke
permissions instantly.
NeuBird is also
easing adoption through flexible deployment options. Half of the company’s
customers run purely in the cloud, while the rest
use virtual private cloud (VPC) or hybrid setups.
The system
connects to Prometheus in a customer’s VPC or on-premises just as easily as
cloud telemetry. From NeuBird’s perspective, both look identical.
When those
customers get an EC2 instance failure, NeuBird
gets there first. The AI agent examines CloudWatch metrics, logs, and
configuration changes to understand what happened, accessing telemetry sources
through read-only connections. By the time the on-call engineer gets to the
issue, there’s already a root cause analysis waiting.
That’s exciting
enough, but the company clearly has great things planned for this year. It’s
working quickly, too, expecting a declarative definition for the problem
resolution pipeline ready by end of Q1. All eyes on NeuBird AI to see what it delivers next.
Sponsored by NeuBird.ai
