Nebius selects Komodor's AI SRE platform for reliability

Thu, 25th Jun 2026

Nebius has selected Komodor's autonomous AI SRE platform for reliability operations across its AI cloud. The deployment covers a large Kubernetes- and GPU-based environment.

The move gives Nebius a tool to automate incident investigation in a cloud platform built for AI workloads, where operators must monitor large numbers of clusters, custom resources, and GPU-intensive services.

Nebius runs what it describes as a full-stack AI platform spanning data, model training, and production deployment. That setup has increased the burden on site reliability engineering teams, particularly in environments where troubleshooting depends on correlating data from dashboards, logs, configuration changes, and autoscaling behaviour across distributed systems.

Komodor's platform will give Nebius a single view across its cloud-native infrastructure and continuously correlate topology, telemetry, and configuration data. It is designed to work with highly customised environments, including the specialised abstractions and components used in large AI cloud deployments.

Operational strain

The agreement reflects a broader issue facing cloud providers that serve AI developers and enterprises. As demand for GPU-backed computing rises, operators are building larger Kubernetes estates and more complex orchestration layers, making manual investigation slower and more expensive.

In Nebius' case, the environment includes custom GPU scheduling layers and ClusterAPI-based fleet management, according to the companies. Those additions can help manage infrastructure at scale, but they also create more dependencies for engineers to track when service incidents occur.

Komodor's Klaudia Agentic AI product is intended to investigate production incidents by correlating signals across multiple clusters and identifying likely root causes. The goal is to reduce the manual work SRE teams would otherwise do across separate tools and datasets to understand what triggered a problem.

Komodor described the deployment as a shift away from investigations that depend heavily on engineering time and specialist knowledge. Nebius plans to use the platform to shorten the path from an observed service issue to root cause analysis while keeping existing SRE workflows in place.

Growing complexity

The adoption highlights how AI infrastructure is changing reliability operations. Running conventional cloud applications at scale has already pushed companies toward more automation, but AI cloud platforms add another layer of complexity because they must manage scarce GPU resources, training jobs, production inference services, and custom scheduling policies at the same time.

That operational mix also has financial implications. Delays in identifying faults can leave expensive GPU resources underused or misallocated, while outages or degraded performance can directly affect customers running model training or deployment workloads.

For providers such as Nebius, reliability work is increasingly tied to cost control as well as uptime. Automation vendors are positioning their products around that overlap, arguing that incident response systems must account for both application behaviour and infrastructure economics.

Komodor, which focuses on cloud-native operations, has raised USD $90 million in venture funding. The company built its business around Kubernetes troubleshooting and incident management and is now placing greater emphasis on AI-assisted and autonomous tools for SRE teams.

The Nebius deployment is also notable because of the type of customer involved. AI cloud providers operate some of the most complex infrastructure stacks in the market, and a production use case in that segment offers a test of whether autonomous troubleshooting tools can cope with highly tailored cloud architectures rather than standard enterprise deployments.

Executive views

Komodor linked the decision to the pressure on operations teams.

"As AI workloads amplify operational complexity, the burden on SRE teams to manually manage reliability and cost becomes untenable," said Itiel Shwartz, Co-Founder and CTO of Komodor. "Acting as an autonomous AI SRE layer, Komodor dramatically reduces mean time to resolution (MTTR) in the most complex, distributed environments in the world like the Nebius AI Cloud."

Nebius said it selected the platform to help engineers investigate incidents more quickly across a large Kubernetes footprint.

"Nebius operates AI cloud infrastructure at scale. Uptime and performance are mission-critical, and require fast, well-grounded incident investigation across complex Kubernetes environments," said Danila Shtan, CTO at Nebius. "Komodor helps our teams correlate the signals that matter and shorten the path from symptom to root cause, while fitting into our existing SRE workflows."

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google

Image: Itiel Shwartz and Danila Shtan