The rapidly changing landscape of observability and AI

Published: April 23rd, 2025

Imagine two high-energy particles colliding at such great pressure and velocity that they bring about nuclear fusion, producing a combined new element, along with a shower of radiation that affects everything around them. This is an analogy to describe the collision of two major trends in information technology—Observability (shortened as “O11y”) and Artificial Intelligence (AI). The fusion of these two trends is only just now occurring, and will continue to produce showers of radiant energy across the enterprise over the coming decade.

Observability gives us the ability to see into systems and understand their behavior, from the elementary instrumentality in order to produce the atomic-level telemetry, to the agents and collectors to extract and transport that telemetry, to the deep and massive analytical engines to process and query all of that telemetry, to sophisticated real-time methods to visualize it and act upon it.

Observability, when best practices are applied to AI, allows organizations to understand if their AI systems are well-behaving, and, if not, provides guideposts for how to make a better AI. Conversely, AI can make for better observability.

For AI observability, many established observability vendors have flocked into this space, but it is also of interest to see how fledgling startups like Honeyhive and pure open source tools like Arize’s Phoenix are changing the landscape: evaluation, experimentation, testing, debugging, troubleshooting, monitoring and optimizing of LLMs and AI agents are now production-ready capabilities.

On top of these tactical tools, OpenTelemetry (OTel) has recently been addressing more strategic issues, like the need for semantic conventions for AI. Without such conventions—without common protocols and open standards—it leaves AI observability in the land of Babel. It must be said that there is a long road ahead for such conventions to percolate into general use, from basic AI instrumentality to toolchains and higher-level observability solutions and services. We’re still in the early days of how to make AI—both LLMs as well as agents—an easily manageable, readily observable thing.

Meanwhile, in the converse configuration, AI has unleashed myriad new methods to digest and interpret observability telemetry, as well as metadata surrounding observability issues. Whether for pattern recognition of expected behavior (such as seasonality) or, conversely, anomaly detection (deviations from expected behavior), AI offers new capabilities to rationalize the petabytes of data generated daily by modern enterprises.

As an industry we’ve been talking about AI for [IT] Operations—“AIOps”—since as far back as 2016. Large Language Models (LLMs) have been around based on pioneering research work stretching back to the turn of the millennium (if not before), and definitely more broadly since the advent of Google Neural Machine Translation in 2016. Observability was first discussed publicly in the seminal blog by Twitter in 2013. Thus, many may dismiss these concepts as “old news.”

That first wave of AIOps can be questioned as more of a hypecycle phenomenon; by 2020 critics dubbed it a “misleading promise.” So it is understandable for AI skeptics to be wary of claims today. Yet what’s happening in 2025 is fundamentally different. The inflection point in the change in trajectory can be traced to the release of ChatGPT 3.5 in late 2022. Three years later there’s been a Cambrian explosion of entirely new tools, better methodologies, and deeper insights.

LLMs for Natural Language Processing (NLP) have always been a natural fit for summarizing data into textual output. For observability, this has direct utility for generating incident reports, call summaries, hypothesizing on root-cause analysis, ticket-to-issue ratio analysis, and so on. Keep (keephq.dev) is an open source AIOps platform that focuses on enrichments, correlations and incident context gathering, workflow, integration and dashboards.

At the next level, NLP can be used to ask open-ended questions of data-driven observability systems; useful (in theory) for formulating real-time diagnostic strategies and exploring root cause analysis. However because of the prevalence for AIs to hallucinate and randomize their answers, plus their tendency to lose context over time, we have to question the results any AI provides. While it might be useful to generate hypotheses for investigation, it can’t be relied on as anything more than a bystander’s opinion—and potentially a highly misleading one.

However, these are more of a technical support or SRE meta-conversation, organizing information for human-level discussions about incidents and outages. It’s not the raw processing of the telemetry itself.

Open source projects like K8sGPT goes even deeper. It scans Kubernetes clusters, diagnosing and triaging issues. Claiming to have “SRE experience codified into its analyzers and helps to pull out the most relevant information to enrich it with AI.” K8sGPT allows users the option of putting the AI in the driver seat to conduct autoremediation—automatically applying suggested fixes.

There are elements of each of these levels of depth available in commercial offerings, such as DataDog’s Bits AI. It provides higher-level meta-conversation analysis and incident management, as well as some lower-level debugging, code fixes and autoremediation.

Let’s keep going deeper.

At the lowest level, you have methods to use vector similarity search to discern patterns in raw telemetry. Rather than simply rely upon text indexes and literal string text searches, you can now use vector indexing and algorithms like Hierarchical Navigable Small Worlds (HNSW) to find phenomena that are like other issues your systems have seen and experienced in the past. Google wrote about this in 2024 as pertains to BigQuery (see here); Apache Pinot also added vector indexing in 2024, allowing users to do similarity search for observability using open-source real-time analytics.

So while terms like “Observability” and “AIOps” are each closing in on a decade of history, what’s occurring in 2025—the fusion of both—is novel. As well, both trends are operating at completely different scales and capabilities, by orders of magnitude. If you dismissed these trends in past years, now is the chance for you and your team to reconsider and revisit what’s possible for your organization today. You might just save yourself from excessive alert fatigue, get more proactive in your practices, reduce your MTTR, and achieve significant, measurable ROI.

Article Tags

AIOps, K8sGPT, observability

About Peter Corless

Peter Corless is the director of Product Marketing at StarTree.

View all posts by Peter Corless

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

The rapidly changing landscape of observability and AI

Article Tags

Subscribe to SDTimes

About Peter Corless

Related Articles

Report: Open source solutions continue to dominate observability strategies

LogicMonitor enhances observability platform to better monitor AI workloads and applications

The crucial role of observability data lakes in LLM observability

Causely Launches New Integration with OpenTelemetry