In broad terms, virtualization of computing resource is about isolation of resources at different levels. We have covered hypervisor-based virtualization in the other post. In this article, we continue to dive into OS level virtualization.
Remember again that the gist of virtualization is isolation of resource. To support OS level virtualization, the OS must have its own capability to isolate computing resource. There are many implementations of OS level virtualization.
Linux Kernel provides low-level mechanisms some two kernel features(namespaces, cgroups and chroot) for building various lightweight tools that can virtualize the system environment. Docker is such framework that builds on chroot namespaces and cgroups.
Traditionally, root directory (/) is the top directory shared amongst all processes in the OS. There was a chroot() system call that allows each process to have its own idea of root directory. A chroot is an operation that changes the apparent root directory(/) for the current running process and their children. A program that is run in such a modified environment cannot access files and commands outside that environmental directory tree. This modified environment is called a chroot jail. By separating a process using chroot() we ensure security by restricting the process from accessing outside its environment (breaking the jail). This short video is a great lab.
Although chroot() has a basic idea of isolation, it simply modifies pathname lookups for a process and its children (by prepending the new root path to any name starting with /). Relative paths can still refer any locations outside of the new root. So chroot() does not intend to defend against intentional tampering by privileged users.
Namespaces are fundamentally the mechanisms to abstract, isolate, and limit the visibility that a group of processes has over various system entities such as process trees, network interfaces, user IDs and file system mounts. So there are several categories of namespaces:
- Mount namespaces – traditionally, there is one global mount namespace seen by all processes. The mount namespaces confine the set of filesystem mount points visible within a process namespace, enabling one process group in a mount namespace to have an exclusive view of the filesystem list, compared to another process.
- UTS namespaces – allows isolation of hostname per namespace. Each namespace can have its own hostname on the network
- User namespaces – allow a process to use unique user and group IDs
- Cgroup namespaces – processes inside a cgroup namespace are only able to view paths relative to their namespace root.
- IPC namespaces – isolates the System V inter-process communication between namespaces, as well as POSIX message queues within each namespace. POSIX message queue allow process to exchange data in the form of messsages.
- PID namespaces – traditionally, *nix kernels spawn the init process with PID 1 during system boot, which in turn starts other user-mode process and is considered the root of the process tree (all the other processes start below this process in the tree). The PID namespace allows a process to spin off a new tree of processes under it with its own root process (PID=1). PID namespaces isolate process ID numbers, and allow duplication of PID numbers across different PID namespaces. The process IDs only needs to be unique within a PID namespace, and are assigned sequentially starting with PID 1. PID namespaces are used in containers.
- Network namespaces – traditionally, all processes in the entire OS share a single set of network interfaces and routing table entries. The routing table entries can be modified at operating system level. With network namespace, this assumption is no longer valid. Network namespace provides abstraction and virtualization of network protocol and interfaces. Each network namespace will have its own network device instances that can be configured with individual network addresses. Other network services, such as routing table, port number, are isolated as well.
Namespaces are created with the “unshare” command or syscall, or as new flags in a clone() syscall. The flags are listed here in the man page for namespace. Note that the clone() syscall is a more generic implementation of fork() syscall.
cgroups is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc) of a collection of processes (not to be confused with process group, which has its own meaning). Cgroup has two versions. The control groups functionality (version 1) was merged into Linux kernel mainline in version 2.6.24, released in 2008, and version 2 in kernel 4.5 (March 2016), with significant changes to the interface and internal functionality.
Using cgroups, you can allocate resources such as CPU time, network and memory. Similiar to the process model in Linux, where each process is a child to a parent and relatively descends from the init process thus forming a single-tree like structure, cgroups are hierarchical, where child cgroups inherit the attributes of the parent, but what makes it different is that multiple cgroup hierarchies can exist within a single system, with each having distinct resource prerogatives.
Applying cgroups on namespaces results in isolation of processes into containers within a system, where resources are managed distinctly. Each container is a lightweight virtual machine, all of which run as individual entities and are oblivious of other entities within the same system.
Above we covered some kernel features that enables container technology. There are many ways to use these technologies to implement the isolation. We call them container runtime.
LXC is a user space interface for those Linux kernel containment features. It allows for running isolated containers on a control host using a single kernel. Users can launch a system init for each containers, also referred to as virtual environment (as opposed to virtual machines). The author of this article regard LXC as a suprcharged chroot on Linux. LXC has rest API tool called LXD. LXC was targeting sysadmin’s use cases (not developer) to isolate users’ own private workloads from one another. In early days Docker was built on LXC.
Docker’s target market is developers, and it moved beyond LXC with its own execution environment called libcontainer. With the initial success of Docker, a large community (Docker, CoreOS, Google, etc) emerged around the idea of using containers as the standard unit of software delivery. They started the Open Container Initiative (OCI) to define industry standards around container runtime (runtime spec) and image format (image spec). Docker donated the libcontainer codebase to run independently under OCI, as runc. Docker implements isolation using the following technologies:
- Namespace: to isolate process ID, networking, mount points, IPC, host and domain name;
- Cgroups: to isolate the usage of CPU and memory between containers
- UnionFS: isolate file system
Another container runtime technology is OpenVZ, which includes an extension of the Linux kernel. It uses container for entire operating systems (not just application and processes). All OpenVZ containers have to share the same Linux kernel version as host. The adoption of OpenVZ is not high.
|LXD (rest API)
|docker engine (daemon and cli)
Docker is now widely adopted for application hosting in production environment.
Container and Cloud
Public cloud vendors also has managed services around Docker. Here are some examples:
|Elastic Container Service
|Elastic Container Registry
|Elastic Kubernetes Services
|Azure Kubernetes Service
|Google Kubernetes Engine
Cloud service was originally developed with VM as a unit of computing resource to service. OS level virtualization allows container to be a unit of computing resource. All these new technologies breed the serverless architecture and cloud-native deployment model. This has significant impact on the creation and delivery of software services. The cloud native landscape page illustrates more tools around containers.