Did you know gVisor makes containers more secure?
What is gVisor?
gVisor is a sandboxed container runtime, an application kernel, written in Go that delivers an additional layer of isolation between running applications and the host operating system.
It implements a substantial portion of the Linux system call interface and includes an Open Container Initiative (OCI) runtime called runsc that facilitates the work with existing container tooling. Running sandboxed containers is made easy thanks to the ability of runsc runtime to integrate with Docker and Kubernetes.
gVisor can be used with Docker, Kubernetes, or directly using runsc.
But what is Container Sandboxing?
Container sandboxing helps make containers more isolated by adding a new runtime to container platforms. This keeps the programs isolated from the rest of the system with the use of lightweight VMs which then start containers inside these pods.
Containers share host kernel resources among themselves whereas VMs run with their own OS (kernel). This means that containers are not as well isolated from the host operating system as VMs and we can therefore say that containers are not as secure as VMs.
Traditional Linux containers are not sandboxed, therefore sandboxed containers are generally added to the security features found within Linux containers.
Setting up other security measures like creating a seccomp filter or AppArmor profile for hundreds of applications in a production environment would be a very laborious process.
Here is where Container Sandboxing and a tool like gVisor comes in to address this needs in a more efficient way.
How does it work?
gVisor uses paravirtualization to isolate containerized applications from the host system. It intercepts application system calls and acts as the guest kernel between the containerized application and the host kernel.
This provides a level of isolation comparable to that of a virtual machine but without the heavyweight resource allocation or the fixed resource cost of each VM. It does this through various mechanisms to support syscall limits, file system proxying, and network access.
Furthermore gVisor employs rule-based execution to provide defense-in-depth.
A sandbox environment consists of multiple processes inside of which you can run one or more containers.
The Sentry is the largest component of gVisor. Each sandbox has its own isolated instance of the Sentry. It is a kernel that runs the containers and intercepts and responds to system calls made by the application. The Sentry implements all kernel functionality needed by an application, including: syscalls, signal delivery, memory management, page faulting logic, threading, and more.
The workflow looks as follows:
- Application makes a system call
- Platform redirects the call to the Sentry
- Sentry does the necessary work to service the call
*system calls are not passed through to the host kernel.
As a userspace application, the Sentry will make some host system calls to support its operation, but it does not allow the application to directly control the system calls it makes.
For example, the Sentry is not able to open files directly; file system operations that extend beyond the sandbox (not internal /proc files, pipes, etc) are sent to the Gofer component, described below.
The Gofer is a standard host process which is started with each container. Each container running in the sandbox has its own isolated instance of Gofer. This process communicates with the Sentry via the 9P protocol over a socket or shared memory channel.
The Sentry process is started in a restricted seccomp container without access to file system resources. The Gofer mediates all access to these resources, providing an additional level of isolation. The Sentry provides file system access to the containers.
What is runsc?
The runsc executable is the entrypoint to running a sandboxed container.
runsc implements the Open Container Initiative (OCI) runtime specification (used byKubernetes) meaning OCI compatible filesystem bundles can be run by runsc. Filesystem bundles consist of a config.json file containing container configuration, and a root filesystem for the container.
runsc implements multiple commands that perform various functions such as starting, stopping, listing, and querying the status of containers.
Benefits and limitations
gVisor is written in Go:
- avoids security pitfalls that can plague kernels
- strong typization
- built-in bounds checks
- no uninitialized variables
- no use-after-free
- built-in race detector
- flexible resource footprint (i.e. based on threads and memory mappings, not fixed guest physical resources)
- lower fixed costs of virtualization
Using Go has its challenges like:
- missing some features
- lack of ecosystem/library support
- The runtime introduces performance overhead.
- reduced application compatibility
- higher per-system call overhead
- poor performance for system call heavy workloads
You can use gVisor to isolate application containers that aren’t entirely trusted (e.g. a new version of an open source project you have never used before). It could be a new project your team has yet to completely vet or anything else you aren’t entirely sure can be trusted in your cluster.
You can also use it together with other tools like Falco to create a multi-layer defense. Falco is an open-source intrusion detection system for containers and cloud-native applications. gVisor’s runtime monitoring infrastructure allows Falco to see what’s happening inside the gVisor sandbox without the user having to do anything different.
You can also set up Kubernetes nodes to run pods with gvisor using the containerd runtime and the gvisor-containerd-shim. You can use either the io.kubernetes.cri.untrusted-workload annotation or RuntimeClass to run Pods with runsc. This will implement defense-in-depth for your container workloads and secure the container runtime from malicious attacks.In the event of an attack on a particular Pod, this will be limited to that specific Pod only, and an attacker will not have access to any other cluster-level resources.