This commit is a set of fixes and improvement for zfs graph driver,
in particular:
1. Remove mount point after umount in `Get()` error path, as well
as in `Put()` (with `MNT_DETACH` flag). This should solve "failed
to remove root filesystem for <ID> .... dataset is busy" error
reported in Moby issue 35642.
To reproduce the issue:
- start dockerd with zfs
- docker run -d --name c1 --rm busybox top
- docker run -d --name c2 --rm busybox top
- docker stop c1
- docker rm c1
Output when the bug is present:
```
Error response from daemon: driver "zfs" failed to remove root
filesystem for XXX : exit status 1: "/sbin/zfs zfs destroy -r
scratch/docker/YYY" => cannot destroy 'scratch/docker/YYY':
dataset is busy
```
Output when the bug is fixed:
```
Error: No such container: c1
```
(as the container has been successfully autoremoved on stop)
2. Fix/improve error handling in `Get()` -- do not try to umount
if `refcount` > 0
3. Simplifies unmount in `Get()`. Specifically, remove call to
`graphdriver.Mounted()` (which checks if fs is mounted using
`statfs()` and check for fs type) and `mount.Unmount()` (which
parses `/proc/self/mountinfo`). Calling `unix.Unmount()` is
simple and sufficient.
4. Add unmounting of driver's home to `Cleanup()`.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Previously, the code would set the mtime on the directories before
creating files in the directory itself. This was problematic
because it resulted in the mtimes on the directories being
incorrectly set. This change makes it so that the mtime is
set only _after_ all of the files have been created.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
There was a small issue here, where it copied the data using
traditional mechanisms, even when copy_file_range was successful.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This change makes the VFS graphdriver use the kernel-accelerated
(copy_file_range) mechanism of copying files, which is able to
leverage reflinks.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Previously, graphdriver/copy would improperly copy hardlinks as just regular
files. This patch changes that behaviour, and instead the code now keeps
track of inode numbers, and if it sees the same inode number again
during the copy loop, it hardlinks it, instead of copying it.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
The overlay2 storage-driver requires multiple lower dir
support for overlayFs. Support for this feature was added
in kernel 4.x, but some distros (RHEL 7.4, CentOS 7.4) ship with
an older kernel with this feature backported.
This patch adds feature-detection for multiple lower dirs,
and will perform this feature-detection on pre-4.x kernels
with overlayFS support.
With this patch applied, daemons running on a kernel
with multiple lower dir support will now select "overlay2"
as storage-driver, instead of falling back to "overlay".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:
- for `Mkdir()`, `IsExist` error should (usually) be ignored
(unless you want to make sure directory was not there before)
as it means "the destination directory was already there"
- for `MkdirAll()`, `IsExist` error should NEVER be ignored.
Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.
Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.
NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).
For more details, a quote from my runc commit 6f82d4b (July 2015):
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.
It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).
As of recently, overlayfs reacts to these mounts being leaked (see
One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
When starting the daemon, the `/var/lib/docker` directory
is scanned for existing directories, so that the previously
selected graphdriver will automatically be used.
In some situations, empty directories are present (those
directories can be created during feature detection of
graph-drivers), in which case the daemon refuses to start.
This patch improves detection, and skips empty directories,
so that leftover directories don't cause the daemon to
fail.
Before this change:
$ mkdir /var/lib/docker /var/lib/docker/aufs /var/lib/docker/overlay2
$ dockerd
...
Error starting daemon: error initializing graphdriver: /var/lib/docker contains several valid graphdrivers: overlay2, aufs; Please cleanup or explicitly choose storage driver (-s <DRIVER>)
With this patch applied:
$ mkdir /var/lib/docker /var/lib/docker/aufs /var/lib/docker/overlay2
$ dockerd
...
INFO[2017-11-16T17:26:43.207739140Z] Docker daemon commit=ab90bc296 graphdriver(s)=overlay2 version=dev
INFO[2017-11-16T17:26:43.208033095Z] Daemon has completed initialization
And on restart (prior graphdriver is still picked up):
$ dockerd
...
INFO[2017-11-16T17:27:52.260361465Z] [graphdriver] using prior storage driver: overlay2
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 7a1618ced3 regresses running Docker
in user namespaces. The new check for whether quota are supported calls
NewControl() which in turn calls makeBackingFsDev() which tries to
mknod(). Skip quota tests when we detect that we are running in a user
namespace and return ErrQuotaNotSupported to the caller. This just
restores the status quo.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Add a way to specify a custom graphdriver priority list
during build. This can be done with something like
go build -ldflags "-X github.com/docker/docker/daemon/graphdriver.priority=overlay2,devicemapper"
As ldflags are already used by the engine build process, and it seems
that only one (last) `-ldflags` argument is taken into account by go,
an envoronment variable `DOCKER_LDFLAGS` is introduced in order to
be able to append some text to `-ldflags`. With this in place,
using the feature becomes
make DOCKER_LDFLAGS="-X github.com/docker/docker/daemon/graphdriver.priority=overlay2,devicemapper" dynbinary
The idea behind this is, the priority list might be different
for different distros, so vendors are now able to change it
without patching the source code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Make it possible to disable overlay and overlay2 separately.
With this commit, we now have `exclude_graphdriver_overlay` and
`exclude_graphdriver_overlay2` build tags for the engine, which
is in line with any other graph driver.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
From https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt:
> The lower filesystem can be any filesystem supported by Linux and does
> not need to be writable. The lower filesystem can even be another
> overlayfs. The upper filesystem will normally be writable and if it
> is it must support the creation of trusted.* extended attributes, and
> must provide valid d_type in readdir responses, so NFS is not suitable.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In order to avoid reverting our fix for mount leakage in devicemapper,
add a test which checks that devicemapper's Get() and Put() cycle can
survive having a command running in an rprivate mount propagation setup
in-between. While this is quite rudimentary, it should be sufficient.
We have to skip this test for pre-3.18 kernels.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This patch adds the capability for the VFS graphdriver to use
XFS project quotas. It reuses the existing quota management
code that was created by overlay2 on XFS.
It doesn't rely on a filesystem whitelist, but instead
the quota-capability detection code.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This adds a mechanism (read-only) to check for project quota support
in a standard way. This mechanism is leveraged by the tests, which
test for the following:
1. Can we get a quota controller?
2. Can we set the quota for a particular directory?
3. Is the quota being over-enforced?
4. Is the quota being under-enforced?
5. Can we retrieve the quota?
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This changeset allows Docker's VFS, and Overlay to take advantage of
Linux's zerocopy APIs.
The copy function first tries to use the ficlone ioctl. Reason being:
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation
See: http://oss.sgi.com/archives/xfs/2015-12/msg00356.html
If the clone fails, we fall back to copy_file_range, which internally
may fall back to splice, which has an upper limit on the size
of copy it can perform. Given that, we have to loop until the copy
is done.
For a given dirCopy operation, if the clone fails, we will not try
it again during any other file copy. Same is true with copy_file_range.
If all else fails, we fall back to traditional copy.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: John Howard <jhoward@microsoft.com>
This PR has the API changes described in https://github.com/moby/moby/issues/34617.
Specifically, it adds an HTTP header "X-Requested-Platform" which is a JSON-encoded
OCI Image-spec `Platform` structure.
In addition, it renames (almost all) uses of a string variable platform (and associated)
methods/functions to os. This makes it much clearer to disambiguate with the swarm
"platform" which is really os/arch. This is a stepping stone to getting the daemon towards
fully multi-platform/arch-aware, and makes it clear when "operating system" is being
referred to rather than "platform" which is misleadingly used - sometimes in the swarm
meaning, but more often as just the operating system.
When use overlay2 as the graphdriver and the kernel enable
`CONFIG_OVERLAY_FS_REDIRECT_DIR=y`, rename a dir in lower layer
will has a xattr to redirct its dir to source dir. This make the
image layer unportable. This patch fallback to use naive diff driver
when kernel enable CONFIG_OVERLAY_FS_REDIRECT_DIR
Signed-off-by: Lei Jitang <leijitang@huawei.com>
The change in 7a7357dae1 inadvertently
changed the `defer` error code into a no-op. This restores its behavior
prior to that code change, and also introduces a little more error
logging.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
It was causing the error message to be
'overlay' is not supported over <unknown>
instead of
'overlay' is not supported over ecryptfs
Signed-off-by: Iago López Galeiras <iago@kinvolk.io>
This commit reverts a hunk of commit 2f5f0af3f ("Add unconvert linter")
and adds a hint for unconvert linter to ignore excessive conversion as
it is required on 32-bit platforms (e.g. armhf).
The exact error on armhf is this:
19:06:45 ---> Making bundle: dynbinary (in bundles/17.06.0-dev/dynbinary)
19:06:48 Building: bundles/17.06.0-dev/dynbinary-daemon/dockerd-17.06.0-dev
19:10:58 # github.com/docker/docker/daemon/graphdriver/overlay
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Nsec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Nsec (type int32) as type int64 in argument to time.Unix
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of providing a generic message listing all possible reasons
why xfs is not available on the system, let's be specific.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If mount fails, the reason might be right there in the kernel log ring buffer.
Let's include it in the error message, it might be of great help.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>