Imagine two high-energy particles colliding at such great pressure and velocity that they bring about nuclear fusion, producing a combined new element, along with a shower of radiation that affects everything around them. This is an analogy to describe the collision of two major trends in information technology—Observability (shortened as “O11y”) and Artificial Intelligence (AI). The fusion of these two trends is only just now occurring, and will continue to produce showers of radiant energy across the enterprise over the coming decade.

Observability gives us the ability to see into systems and understand their behavior, from the elementary instrumentality in order to produce the atomic-level telemetry, to the agents and collectors to extract and transport that telemetry, to the deep and massive analytical engines to process and query all of that telemetry, to sophisticated real-time methods to visualize it and act upon it.

Observability, when best practices are applied to AI, allows organizations to understand if their AI systems are well-behaving, and, if not, provides guideposts for how to make a better AI. Conversely, AI can make for better observability.

For AI observability, many established observability vendors have flocked into this space, but it is also of interest to see how fledgling startups like Honeyhive and pure open source tools like Arize’s Phoenix are changing the landscape: evaluation, experimentation, testing, debugging, troubleshooting, monitoring and optimizing of LLMs and AI agents are now production-ready capabilities.

On top of these tactical tools, OpenTelemetry (OTel) has recently been addressing more strategic issues, like the need for semantic conventions for AI. Without such conventions—without common protocols and open standards—it leaves AI observability in the land of Babel. It must be said that there is a long road ahead for such conventions to percolate into general use, from basic AI instrumentality to toolchains and higher-level observability solutions and services. We’re still in the early days of how to make AI—both LLMs as well as agents—an easily manageable, readily observable thing.

Meanwhile, in the converse configuration, AI has unleashed myriad new methods to digest and interpret observability telemetry, as well as metadata surrounding observability issues. Whether for pattern recognition of expected behavior (such as seasonality) or, conversely, anomaly detection (deviations from expected behavior), AI offers new capabilities to rationalize the petabytes of data generated daily by modern enterprises.

As an industry we’ve been talking about AI for [IT] Operations—“AIOps”—since as far back as 2016. Large Language Models (LLMs) have been around based on pioneering research work stretching back to the turn of the millennium (if not before), and definitely more broadly since the advent of Google Neural Machine Translation in 2016. Observability was first discussed publicly in the seminal blog by Twitter in 2013. Thus, many may dismiss these concepts as “old news.”

That first wave of AIOps can be questioned as more of a hypecycle phenomenon; by 2020 critics dubbed it a “misleading promise.” So it is understandable for AI skeptics to be wary of claims today. Yet what’s happening in 2025 is fundamentally different. The inflection point in the change in trajectory can be traced to the release of ChatGPT 3.5 in late 2022. Three years later there’s been a Cambrian explosion of entirely new tools, better methodologies, and deeper insights.

LLMs for Natural Language Processing (NLP) have always been a natural fit for summarizing data into textual output. For observability, this has direct utility for generating incident reports, call summaries, hypothesizing on root-cause analysis, ticket-to-issue ratio analysis, and so on. Keep (keephq.dev) is an open source AIOps platform that focuses on enrichments, correlations and incident context gathering, workflow, integration and dashboards.

At the next level, NLP can be used to ask open-ended questions of data-driven observability systems; useful (in theory) for formulating real-time diagnostic strategies and exploring root cause analysis. However because of the prevalence for AIs to hallucinate and randomize their answers, plus their tendency to lose context over time, we have to question the results any AI provides. While it might be useful to generate hypotheses for investigation, it can’t be relied on as anything more than a bystander’s opinion—and potentially a highly misleading one.

However, these are more of a technical support or SRE meta-conversation, organizing information for human-level discussions about incidents and outages. It’s not the raw processing of the telemetry itself. 

Open source projects like K8sGPT goes even deeper. It scans Kubernetes clusters, diagnosing and triaging issues. Claiming to have “SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.” K8sGPT allows users the option of putting the AI in the driver seat to conduct autoremediation—automatically applying suggested fixes. 

There are elements of each of these levels of depth available in commercial offerings, such as DataDog’s Bits AI. It provides higher-level meta-conversation analysis and incident management, as well as some lower-level debugging, code fixes and autoremediation. 

Let’s keep going deeper.

At the lowest level, you have methods to use vector similarity search to discern patterns in raw telemetry. Rather than simply rely upon text indexes and literal string text searches, you can now use vector indexing and algorithms like Hierarchical Navigable Small Worlds (HNSW) to find phenomena that are like other issues your systems have seen and experienced in the past. Google wrote about this in 2024 as pertains to BigQuery (see here); Apache Pinot also added vector indexing in 2024, allowing users to do similarity search for observability using open-source real-time analytics.

So while terms like “Observability” and “AIOps” are each closing in on a decade of history, what’s occurring in 2025—the fusion of both—is novel. As well, both trends are operating at completely different scales and capabilities, by orders of magnitude. If you dismissed these trends in past years, now is the chance for you and your team to reconsider and revisit what’s possible for your organization today. You might just save yourself from excessive alert fatigue, get more proactive in your practices, reduce your MTTR, and achieve significant, measurable ROI.