
Anthropic’s October research showing an AI model reproducing a real intrusion drew mixed reactions. Some questioned the framing and others questioned the intent, but most platform teams did not find the result surprising. Many are already expecting a significant security adjustment as AI workloads grow.
AI systems are scaling faster than the security properties of the infrastructure they depend on. This gap becomes more visible as models become more capable and more widely deployed. For teams that want to reduce risk rather than wait for failure modes to appear, the right place to begin is the infrastructure layer. Modern deployment stacks still assume cooperative workloads that share kernels, drivers and accelerators. Those assumptions do not hold in adversarial settings.
Understanding where isolation breaks is the foundation for building safer AI systems. So what’s broken today?
Containers are not isolation and multi tenancy increases exposure
Containers became popular because they make packaging easy and predictable. Developers can bundle everything an application needs and run it anywhere without rebuilding. That convenience is separate from isolation. A container does not create a strong boundary. It brings its own user space but it still relies on a shared host kernel. If an attacker reaches that kernel or exploits a kernel flaw, every other container on that host becomes part of the same failure domain.
This risk increases in multi tenant environments. When workloads from different teams or customers share the same container infrastructure, they also share the same kernel. If one container is compromised and the attacker can access kernel memory, the attacker can observe or interfere with other workloads. Secrets, inference outputs and model weights become visible. For AI systems that process sensitive data, this creates real exposure.
The risk grows further when containers manage access to GPUs. GPU runtimes and drivers pass through the kernel in complex ways. They involve shared memory, IPC surfaces and device level calls that expand the attack surface. If the kernel is the only enforcement point, then using shared hardware means the entire stack becomes part of the trusted computing base. That is a large amount of code to trust. Code will always have vulnerabilities, so relying on a large surface makes accidental exposure more likely.
The core issue is simple. The container boundary is not the isolation boundary. It can define how software is packaged but not how faults are contained. To protect AI workloads, the enforcement point has to move below the kernel.
VM isolation restores clean boundaries and reduces attack surface
Virtual machines provide a more reliable isolation point because each VM has its own kernel. When the trust boundary is placed at the VM layer, a kernel flaw in one workload does not affect its neighbors. This reduces the shared attack surface and separates tenants more effectively. Lightweight virtual machines make this practical at scale. They preserve the container workflow while adding a protective boundary around each workload.
A microVM can run a container inside a minimal, tightly controlled environment. By using a microVM with a container runtime, developers keep the same packaging and deployment model they rely on today and the system gains a boundary that does not depend on a shared kernel. This removes an entire class of cross container risks. It also lets operators reduce how much code sits on the isolation boundary.
Hypervisors written in memory safe languages help even more. More than half of vulnerabilities in low level systems come from memory safety issues. Using a memory safe implementation eliminates many of these faults by construction. A smaller and safer hypervisor means a smaller trusted computing base. This aligns with the goal of reducing how much of the system must be trusted.
The principle is straightforward. Isolation should rely on the smallest set of components that can enforce it consistently. Virtual machines provide that boundary, and microVMs make it practical to use that boundary for container based workflows.
GPUs and confidential computing require isolation closer to the hardware
GPUs introduce additional challenges for multi tenant AI systems. They were designed for throughput and not for separation between untrusted workloads. Many GPUs do not clear memory between jobs. Residual data can remain in device memory long after a workload finishes. Timing behavior and resource allocation patterns can reveal information about what other tenants are doing. This becomes more problematic as multi tenant inference becomes common.
Containers that share a GPU also share the driver stack. This creates new paths for observation or interference. Without strong isolation around the accelerator, a compromise in one tenant can expose data processed by another. For AI workloads that operate on sensitive information, this is not acceptable.
Confidential computing helps by encrypting data while it is in use. It reduces how much must be trusted in the host operating system. When combined with VM based isolation, confidential computing ensures that even the hypervisor has less visibility into the data. It shrinks the trusted computing base and limits the impact of a compromise.
The result is a more predictable environment where GPU workloads can run without assuming that all tenants are cooperative or honest. The boundary shifts closer to the hardware and becomes easier to reason about under pressure.
Limiting blast radius is the heart of de-risking AI
Improving AI security is not about adding more controls. It is about reducing how much of the system must be trusted. Clean boundaries matter. Smaller attack surfaces matter. Predictable failure domains matter. Containers alone cannot provide these guarantees. VM based isolation restores the separation containers lack. Memory safe hypervisors reduce the risk of kernel compromise. Confidential computing shrinks the trusted computing base. GPU isolation prevents leakage at the accelerator layer.
The industry shifted from single large machines to distributed container platforms to meet demand. That shift improved agility but introduced new forms of exposure. The next evolution is to combine the flexibility of containers with the protection of strong isolation. If AI is going to be deployed across sensitive environments, the infrastructure must tolerate adversarial pressure without exposing the workloads it runs. The goal is to limit what any vulnerability can do and ensure that faults remain contained. That is the path toward de-risking AI in a way that scales with its adoption.
