There’s a problem with modern observability that almost nobody talks about openly: your monitoring stack might be hurting the systems it’s supposed to protect.

I don’t mean in a theoretical sense. I mean that the agents and SDKs most teams rely on for visibility impose real overhead on the applications they instrument. CPU, memory, throughput. When we benchmarked our own eBPF-based sensor against leading observability platforms at 3,000 requests per second, Datadog added 249% CPU overhead and 227% memory overhead to the monitored application. OpenTelemetry added 59% CPU and 27% memory. Under CPU-constrained conditions, that overhead translated directly into degraded request handling. Datadog reduced throughput by 71%, OpenTelemetry by 19%.

This is the hidden cost that engineering teams discover too late, usually when they’re already at scale and already paying for it. It’s also what pushed us to build differently, and why I think the observability industry is finally being forced to reckon with a problem it created for itself.

The Instrumentation Trap

I learned this lesson before founding groundcover, and it’s what convinced me the industry had a structural problem worth solving.

At a previous company, we had a data pipeline we couldn’t diagnose. Customers were reporting data loss across a complex system: thirty microservices, message queues, Redis, API calls, everything you’d expect from a modern platform. We had logs, plenty of them, but you can’t read through twenty or thirty million log lines and understand what’s going on. So we instrumented: counters to represent every stage of the pipeline, traces, the full treatment. Two months of work. About a million counters by the end. We finally found the leakage, and then the CTO came back and told us the observability bill had increased fivefold. We had to remove most of what we’d built because we couldn’t afford to run it.

That experience was the trigger for founding groundcover. We weren’t doing observability wrong. We were doing it exactly as the industry prescribes. The problem was the model itself.

The standard approach has an elegant simplicity: if you want visibility into a service, you add an SDK. If you want traces, you wrap your HTTP clients. If you want metrics, you decorate your code. For small systems with a handful of well-understood services, this works fine.

The trap springs at scale. Every service has to adopt the correct SDK version, aligned with its runtime and language. As microservice counts grow into the hundreds, keeping instrumentation consistent across teams becomes a project in itself, one that never quite finishes, because applications keep changing. Services get rewritten. Dependencies get upgraded. New third-party integrations appear that nobody has documented.

That last point is the deeper problem. Engineers can only instrument what they already know about. But modern platforms depend on a sprawling ecosystem of managed databases, feature-flag services, authentication providers, external APIs, and internal microservices, many of which were never formally mapped. If an interaction wasn’t anticipated during development, it simply won’t appear in your telemetry. You have visibility into the things you expected to see, and a blind spot over everything else.

In security, we’d call this the “unknown unknowns” problem. In observability, we just call it normal.

AI-assisted development is making this worse faster than most teams realize. Engineers are now generating large volumes of code quickly, new services, new dependencies, new integration patterns, at a pace that outstrips any team’s ability to instrument or document proactively. The gap between what’s running in production and what’s actually observable is widening.

Why the Kernel Changes Everything

eBPF, the extended Berkeley Packet Filter, offers a fundamentally different model. Instead of inserting instrumentation into application code, eBPF programs run directly inside the Linux kernel, observing system behavior from below the application layer entirely.

The architectural implication is significant. Traditional monitoring agents run in user space, which means they have to ask the kernel for the data they need. That request-response overhead is exactly what shows up in benchmark results as CPU and memory tax on your workloads. eBPF programs run in kernel space and access that data directly, which is why the overhead profile looks so different in practice.

There’s a second advantage that matters just as much: eBPF observes everything that touches the kernel, whether it was instrumented or not. HTTP requests, database calls, outbound network connections, process activity, all of it is visible at the kernel layer, regardless of what language the application is written in, which SDK it uses, or whether anyone thought to instrument it. When customers first deploy our sensor and open the platform for the first time, a consistent pattern emerges: they see their production workloads mapped in a way they’ve never seen before. Third-party applications reporting data to external vendors nobody knew about. Service interactions that don’t appear in any architecture diagram. Weird things, that’s the word I’d use, that you never thought you’d see.

The scariest problems in production are the ones you didn’t know to look for. Two minutes after deploying a sensor on a new cluster, customers can see every API going in and out, including the ones that surprised them. That visibility, arriving before anyone wrote a line of instrumentation code, is what makes the approach feel qualitatively different from what came before.

What eBPF Actually Costs to Run

I want to be honest about the tradeoffs, because the eBPF ecosystem has attracted enough hype that it’s worth separating the genuine advantages from the overselling.

Running programs inside the kernel requires a deep understanding of system boundaries. Every eBPF program must pass the kernel verifier before it executes, a safety check that prevents programs from harming system stability. This is genuinely valuable, but it creates real development friction. At groundcover, we describe it as learning to “dance with the verifier.” It rejects programs without always explaining why. Something as basic as copying data from A to B can require careful attention to avoid out-of-bounds access that trips the verifier. A program that passes on one kernel version may be rejected on another, and since the verifier only runs at load time, you may not discover the incompatibility until you deploy to a different node configuration.

Stack space for eBPF programs is constrained. Writing efficient eBPF code at scale requires a discipline that takes time to develop. And translating raw kernel signals, network packets, syscall events, and process metadata into something developers can actually act on is a non-trivial engineering problem. The kernel sees everything, but it doesn’t automatically speak the language of distributed traces and service maps. At the eBPF layer, you’re working in hundreds of nanoseconds, not milliseconds. That translation layer is where most of the hard work lives, and it’s the part that takes real investment to get right.

The operational maturity of the ecosystem has improved considerably. CO-RE (Compile Once, Run Everywhere) has addressed many of the portability problems that plagued earlier eBPF development. Toolchains like libbpf have raised the floor significantly. But teams considering eBPF should plan for the learning curve, not assume it away.

eBPF and OpenTelemetry Are Not Competitors

One of the most common misconceptions I encounter is that eBPF and OpenTelemetry are in tension, that adopting one means moving away from the other. This misunderstands what each technology actually does.

OpenTelemetry operates at the application layer. It gives developers a standardized, vendor-neutral way to emit traces, metrics, and logs from their own code. The signals it produces are rich with business context: domain-specific events, custom attributes, application-level spans that reflect your service’s actual logic. This is valuable data that kernel-level observability cannot replicate, because the kernel has no concept of a “checkout flow” or a “recommendation engine.” That semantic layer only exists inside the application.

eBPF operates at the system layer. It gives you automatic, zero-instrumentation visibility across your entire environment, every service, every network connection, every process, regardless of language or runtime.

The right mental model is that eBPF provides the floor and OpenTelemetry provides the ceiling. eBPF ensures you have coverage across everything, including the things you didn’t know to instrument. OpenTelemetry ensures the things you do care about are instrumented with the precision and context your business needs.

In practice, we see customers needing far less OTel instrumentation than they expect. Most teams think they need to instrument everything. What they actually need is eBPF to cover the full environment automatically, and OTel to pinpoint the 20 or 30 percent of their stack where business-level context genuinely matters: a specific checkout flow, a customer-facing API, a billing event. The combination is powerful precisely because each technology is doing what it does best, rather than both trying to do the same job.

Observability That Doesn’t Wait to Be Asked

The deeper shift that eBPF enables isn’t just technical. It’s philosophical.

The traditional model of observability is reactive and anticipatory. Teams instrument what they know, discover blind spots during incidents, add more instrumentation, and repeat the cycle. The system is only as observable as the engineering time invested in instrumenting it, and that investment is always running behind the pace of development.

The kernel-native model inverts this. When you observe behavior at the system level, you get immediate coverage across everything running in your environment, including services that were deployed five minutes ago, third-party dependencies that were never in your architecture docs, and edge cases that no one thought to plan for. You don’t have to anticipate what to observe. The system tells you what’s happening.

For teams operating at high scale, or teams whose development velocity has outpaced their instrumentation discipline, this isn’t an incremental improvement. It’s a different way of thinking about what observability is supposed to do.

The observability industry has spent years asking engineers to do more upfront work: more instrumentation, more configuration, more maintenance, in exchange for visibility. eBPF-native architectures make a different offer. Visibility first, instrumentation where it adds value. That’s the direction the field is moving, and I think it’s the right one.