Storage Backends
By Neependra Khare, 0 comments.

When we start a containier we don't copy the image (unless VFS backend is used) and while doing the IO operations inside the container, we don't modify the original image. Because we don't copy the image, starting up a container is very fast. Changes done by the comtainer is saved in a different layer. This is all done using Copy on Write.

Setup for Hands-on

Please folliow the instructions mnetioned here but use this Vagrant file.

Copy On Write (CoW)

  • Implicit Sharing
    • looks like copy but its just a reference to original resource
    • do not create the local copy until modification is required.

Where CoW is used

  • Keep track of changes b/w the image and our container
  • Process Creation - fork()
  • Memory Snapshot - redis
  • Mapped Memory - mmap()
  • Disk Snapshot - btrfs
  • VM Provisioning - Vagrant
  • Containers - Docker

Storage backends for Docker

Docker support different storage backends to save images and run containers, on different distros.

On Linux

# pwd
/work/docker/docker/daemon/graphdriver
$ grep -ir -A9 "priority = \[\]string" driver_linux.go 
    priority = []string{
        "aufs",
        "btrfs",
        "zfs",
        "devicemapper",
        "overlay",
        "vfs",
    }

On Windows

$grep -ir -A4 "priority = \[\]string" driver_windows.go 
    priority = []string{
        "windowsfilter",
        "windowsdiff",
        "vfs",
    }

On Freebsd

$grep -ir -A2 "priority = \[\]string" driver_freebsd.go 
    priority = []string{
        "zfs",
    }

On Unsupported

# grep -ir -A2 "priority = \[\]string" driver_unsupported.go 
    priority = []string{
        "unsupported",
    }

AUFS

image ref: http://cdn-ak.f.st-hatena.com/images/fotolife/d/dayflower/20080714/20080714131209.png

  • Not in mainline kernel
  • CoW works at file level
  • Before writing, file has to be copied at upper most layer

As it works at file level, AUFS can take benefit Linux page cache. So with Docker, starting container from same image is very fast.

Performance Issues - Copying large file from read-only layer
- As the number layers increases, penalty for looking up a file increases

Device Mapper

Device mapper creates logical devices on top of pysical block device and provides addtional feature likes :-

  • RAID (dm-raid)
  • Multipath (dm-multipath)
  • Encyption (dm-crypt)
  • Delay (dm-delay)
  • Thin Provision (dm-thin)
    • Used to creare snapshots using CoW
    • Works at Block Level

As thinp CoW works at the block level, it can not take benefit of page cache as we saw with AUFS

Docker uses Thin Provisioning to save images and run containers.

By default Docker uses Device Mapper and configure it on top of loopback devices, which are not performant and not recommened for prodcution use.

[root@lab-vm-1 vagrant]# docker info
Containers: 0
Images: 24
Storage Driver: devicemapper
 Pool Name: docker-8:17-787438-pool
 Pool Blocksize: 65.54 kB
 Backing Filesystem: extfs
 Data file: /dev/loop0
 Metadata file: /dev/loop1
 Data Space Used: 2.357 GB
 Data Space Total: 107.4 GB
 Data Space Available: 33.91 GB
 Metadata Space Used: 2.413 MB
 Metadata Space Total: 2.147 GB
 Metadata Space Available: 2.145 GB
 Udev Sync Supported: true
 Deferred Removal Enabled: false
 Data loop file: /var/lib/docker/devicemapper/devicemapper/data
 Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata
 Library Version: 1.02.93 (2015-01-30)
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.0.4-301.fc22.x86_64
Operating System: Fedora 22 (Twenty Two)
CPUs: 1
Total Memory: 993.5 MiB
Name: lab-vm-1
ID: I5MF:I2RU:TPDA:6KME:XTKR:SEQ4:W6KD:MW53:FMCW:FVAH:GUAU:X65K

To get better performance we should use put Thin Provisioned volumes on real block devices.

$ vagrant ssh labvm-1
$ sudo -s
$ cd
$ fdisk -l | less
$ fdisk  -l /dev/sda
$ pvcreate /dev/sda
$ vgcreate direct-lvm /dev/sda
$ lvcreate --wipesignatures y -n data direct-lvm -l 95%VG
$ lvcreate --wipesignatures y -n metadata direct-lvm -l 5%VG

$ systemctl stop docker
$ rm -rf /var/lib/docker

- Update /etc/sysconfig/docker-storage and change 
DOCKER_STORAGE_OPTIONS = --storage-opt dm.metadatadev=/dev/direct-lvm/metadata --storage-opt dm.datadev=/dev/direct-lvm/data

$ systemctl start docker
$ docker info
$ dmsetup table
$ lsblk
$ docker images
$ docker run -id busybox /bin/sh
$ docker ps
$ dmsetup table
$ lsblk

Configutation option for Device Mapper:-

  • Pool Name name of the devicemapper pool for this driver.
  • Pool Blocksize tells the blocksize the thin pool was initialized with. This only changes on creation.
  • Base Device Size tells the maximum size of a container and image
  • Data file blockdevice file used for the devicemapper data
  • Metadata file blockdevice file used for the devicemapper metadata
  • Data Space Used tells how much of Data file is currently used
  • Data Space Total tells max size the Data file
  • Data Space Available tells how much free space there is in the Data file. If you are using a loop device this will report the actual space available to the loop device on the underlying filesystem.
  • Metadata Space Used tells how much of Metadata file is currently used
  • Metadata Space Total tells max size the Metadata file
  • Metadata Space Available tells how much free space there is in the Metadata file. If you are using a loop device this will report the actual space available to the loop device on the underlying filesystem.
  • Udev Sync Supported tells whether devicemapper is able to sync with Udev. Should be true.
  • Data loop file file attached to Data file, if loopback device is used
  • Metadata loop file file attached to Metadata file, if loopback device is used
  • Library Version from the libdevmapper used

Btrfs

  • CoW at filesystem level, instead of file or block
  • Snapshot is created from a subvolume
  • btrfs should created on /var/lib/docker to use with Docker
$  vagrant ssh labvm-1
$ sudo -s
$ cd
$ systemctl stop docker
$ fdisk -l | less
$ fdisk  -l /dev/sda
$ yum install -y btrfs-progs btrfs-progs-devel
$ mkfs.btrfs -f /dev/sda$
$ echo "/dev/sda /var/lib/docker btrfs defaults 0 0" >> /etc/fstab
$ mount -a
$ btrfs filesystem show /var/lib/docker
$ screen 
$ docker daemon -s btrfs
$ docker pull busybox
$ btrfs subvolume list /var/lib/docker
$ docker run -id  busybox /bin/sh
$ btrfs subvolume list /var/lib/docker

Overlay

image ref. : http://cdn-ak.f.st-hatena.com/images/fotolife/d/dayflower/20080714/20080714131209.png

  • Written to do filesystem virualization
  • In main line kernel
  • Works at file level
  • Shares the page cache, when files are open in read only in upper layer.

Overlay has 3 layes - Lower Layer - Upper Layer - Overlay layer (switch)

and a Workdir to do temporary operation. To mount an overlay filesystem, you would need to run following command:-

mount -t overlay /mnt \
      -o lowerdir=/lower,\
         upperdir=/storage/upper,\
         workdir=/storage/work

How overlay works :- - User interface is pathwalk - Works by path co-incidence - Overlay layer switches between layer - Lower layer copied up on change

To use Overlay with Docker

$ systemctl stop docker
$ rm -rf /var/lib/docker
$ docker daemon -s overlay

ZFS

  • Docker recently added support for ZFS storage driver
  • ZFS is not in mainline Linux kernel due to License issue.
  • Similar to Btrfs

VFS

  • No copy on write, so Docker would need to do full copy
  • Without CoW, it is very slow

Let's see how much time it takes to start a container with CoW based backend

[root@lab-vm-1 vagrant]# time docker run -id centos bash
c5f9ec7ac1c00bd4956ddb2c5655a9a91891bbd95c64abd2aac633401d6b139b

real    0m0.260s
user    0m0.009s
sys 0m0.003s

Now let's use VFS as Docker storage backend and start the container

$ systemctl stop docker
$ rm -rf /var/lib/docker
$ docker daemon -s vfs
$ docker pull centos
$ [root@lab-vm-1 vagrant]# time docker run -id centos bash
4ac88a0cf5d6c5ededd9734c2dc5c887b04ac5907a38cc3ef10df3e8aee65ab7

real    0m2.990s
user    0m0.010s
sys     0m0.003s

As you can see if CoW based backed it took just 0.260 sec but with VFS it took 2.9 sec