Making things simple is a lot of work. At dotCloud, we package terribly complex things – such as deploying and scaling web applications – into the simplest possible experience for developers. But how does it work behind the scenes?
From kernel-level virtualization to monitoring, from high-throughput network routing to distributed locks, from dealing with EBS issues to collecting millions of system metrics per minute.
As someone once commented, scaling a PaaS is “like disneyland for systems engineers on crack”.
Still with us? Read on!
This is the 3rd installment of a series of posts exploring the architecture and internals of platorm-as-a-service in general, and dotCloud in particular.
You can find episode 1 on kernel namespaces here. Episode 2 covered cgroups, which you can find here.
For our third episode, we will introduce AUFS.
Part 3 AUFS
AUFS is a union filesystem. The purpose of an union filesystem is to merge two directory hierarchies together.
There can be many reasons to do that, but the most common ones are:
- Consolidating a large file repository, which spans multiple devices (disks), exposing it under a single directory, and without using block-level techniques like LVM or RAID
- Combining a large, read-only file system (potentially on a CD, DVD, or network share), containing a ready-to-run system image, with a small, writeable area: the resulting file system looks like the large read-only one, except that you can write on it (and changes are actually stored in the writeable area). This is commonly used in Live CDs.
We’re interested in the latter: it lets us have a common base image for all dotCloud applications, and a separate read-write layer, unique to each app.
Storage Savings
Let’s assume that the base image weights 1 GB. It’s actually more than that, since we’re talking about a full server filesystem, containing everything a dotCloud app could potentially need: Python, Ruby, Perl, Java, C compiler and libraries, and so on. If the whole image had to be cloned each time a dotCloud application is deployed, it would use 1 GB of disk space for each new deployment. AUFS therefore lets us save on storage costs.
Faster Deployments
But there is more! Copying the whole base image would not only use up precious disk space – it would also take time: a minute or so; depending on the disk speed. Also, the copy would put a significant I/O load on the disk. On the other hand, creating a new “pseudo-image” using AUFS takes a fraction of a second, and virtually no I/O at all. Much better, indeed.
Better Memory Usage
Virtually all operating systems use a feature called buffer cache to make disk access faster. Without it, your system would be 10x, or even 100x or 1000x slower, since it would have to do disk access all the time – even for a simple command like ls! As we will see, AUFS also lets us rake big savings on this buffer cache.
Every single application will load from disk a number of common files and components: the libc standard library, the /bin/sh standard shell… and a lot of common infrastructure, like crond, sshd, the local Mail Transfer Agent, just to name a few. Additionally, all applications of the same type will load the same files: e.g. Python applications will all load a copy of the Python interpreter.
If each app were running from its own copy, identical copies of those common files would be present multiple times in memory, within the buffer cache. Using AUFS, those common files are in the base image, and the Linux kernel therefore knows how to load them only once in memory. This will typically save a few 10s of MB for each app.
Easier Upgrades
If you are familiar with storage technology, you might know about snapshots, and copy-on-write devices; and you might rightfully object that the previously mentioned advantages are also available with those.
That’s true; however, with those systems, it is not possible to update the base image, and have the changes reflect in the lightweight “clones”. AUFS, on the other hand, lets you do whatever you want with the base image – the changes will be immediately visible in the AUFS mountpoints using the base image. It means that it is easy to do software upgrades, even while the applications are running; just like on a normal server, except that you can upgrade thousands of servers at a time.
Allows Arbitrary Changes
All those things can also be done without AUFS. For more than 10 years, skilled UNIX sysadmins have been deploying machines (workstations, X terminals, servers…) with a read-only root file system, allowing read-write access through ad hoc mount points. After all, with some clever configuration and tuning, you don’t need to write anywhere else than some places like /tmp, /var/run, /var/lock, and of course /home. The latter can be a traditional read-write filesystem, and the formers can even use a tmpfs mount.
With that in mind, why use AUFS in the first place?
Because it allows arbitrary changes to the filesystem. You need an extra package, or maybe you want to upgrade the version of Python or Ruby? Without AUFS, if you just rely on a shared read-only root filesystem with writable mount points, you have two possibilities.
Either you upgrade the base image (and potentially affect all other users of the image), or, alternatively, install whatever you need in /home, /tmp or the like – which means a manual install, with potential side effects or conflicts with existing previous versions.
With AUFS, since your root filesystem is still writeable, just apt-get install whatever you need. The read-only base filesystem won’t be affected; all the changes will be written on your own private layer.
Other Union Filesystems
AUFS is not the only filesystem with those properties; so why use specifically this one? We opted for AUFS because for what we need to do, we believe that it is the most mature and stable solution – or at least, it was at the time we made the decision.
Caveats
It wasn’t perfect either. We are currently using AUFS 3; when we were using AUFS 2, it had significant issues, notably with mmap (other union filesystems performed even worse for that specific issue).
We worked around those issues by mounting directly some read-write volumes at strategic places: the data directories of MySQL, PostgreSQL, MongoDB, Redis; the home directory (in which the application code is executed)… This strategy gave us the required stability, without affecting the great flexibility provided by AUFS.
AUFS at dotCloud
Technically, the main feature that benefits from AUFS is our custom package installation system.
If you need a particular library which is not included in our base image, but does exist in the Ubuntu package repository, then installing it in your service is a breeze! Use the systempackages option in your dotcloud.yml file.
Thanks to AUFS, the package will be installed in your service, without touching the base image used by other users.