Signed-off-by: John Howard <jhoward@microsoft.com>
This is a fix for a few related scenarios where it's impossible to remove layers or containers
until the host is rebooted. Generally (or at least easiest to repro) through a forced daemon kill
while a container is running.
Possibly slightly worse than that, as following a host reboot, the scratch layer would possibly be leaked and
left on disk under the dataroot\windowsfilter directory after the container is removed.
One such example of a failure:
1. run a long running container with the --rm flag
docker run --rm -d --name test microsoft/windowsservercore powershell sleep 30
2. Force kill the daemon not allowing it to cleanup. Simulates a crash or a host power-cycle.
3. (re-)Start daemon
4. docker ps -a
PS C:\control> docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7aff773d782b malloc "powershell start-sl…" 11 seconds ago Removal In Progress malloc
5. Try to remove
PS C:\control> docker rm 7aff
Error response from daemon: container 7aff773d782bbf35d95095369ffcb170b7b8f0e6f8f65d5aff42abf61234855d: driver "windowsfilter" failed to remove root filesystem: rename C:\control\windowsfilter\7aff773d782bbf35d95095369ffcb170b7b8f0e6f8f65d5aff42abf61234855d C:\control\windowsfilter\7aff773d782bbf35d95095369ffcb170b7b8f0e6f8f65d5aff42abf61234855d-removing: Access is denied.
PS C:\control>
Step 5 fails.
Signed-off-by: John Howard <jhoward@microsoft.com>
Fixes#36764
@johnstep PTAL. @jterry75 FYI.
There are two commits in this PR. The first ensure that errors are actually returned to the caller - it was being thrown away.
The second commit changes the LCOW driver to map, on a per service VM basis, "long" container paths such as `/tmp/c8fa0ae1b348f505df2707060f6a49e63280d71b83b7936935c827e2e9bde16d` to much shorter paths, based on a per-service VM counter, so something more like /tmp/d3. This means that the root cause of the failure where the mount call to create the overlay was failing due to command line length becomes something much shorter such as below.
`mount -t overlay overlay -olowerdir=/tmp/d3:/tmp/d4:/tmp/d5:/tmp/d6:/tmp/d7:/tmp/d8:/tmp/d9:/tmp/d10:/tmp/d11:/tmp/d12:/tmp/d13:/tmp/d14:/tmp/d15:/tmp/d16:/tmp/d17:/tmp/d18:/tmp/d19:/tmp/d20:/tmp/d21:/tmp/d22:/tmp/d23:/tmp/d24:/tmp/d25:/tmp/d26:/tmp/d27:/tmp/d28:/tmp/d29:/tmp/d30:/tmp/d31:/tmp/d32:/tmp/d33:/tmp/d34:/tmp/d35:/tmp/d36:/tmp/d37:/tmp/d38:/tmp/d39:/tmp/d40:/tmp/d41:/tmp/d42:/tmp/d43:/tmp/d44:/tmp/d45:/tmp/d46:/tmp/d47:/tmp/d48:/tmp/d49:/tmp/d50:/tmp/d51:/tmp/d52:/tmp/d53:/tmp/d54:/tmp/d55:/tmp/d56:/tmp/d57:/tmp/d58:/tmp/d59:/tmp/d60:/tmp/d61:/tmp/d62,upperdir=/tmp/d2/upper,workdir=/tmp/d2/work /tmp/c8fa0ae1b348f505df2707060f6a49e63280d71b83b7936935c827e2e9bde16d-mount`
For those worrying about overflow (which I'm sure @thaJeztah will mention...): It's safe to use a counter here as SVMs are disposable in the default configuration. The exception is when running the daemon in unsafe LCOW "global" mode (ie `--storage-opt lcow.globalmode=1`) where the SVMs aren't disposed of, but a single one is reused. However, to overflow the command line length, it would require several hundred-thousand trillion (conservative, I should sit down and work it out accurately if I get -really- bored) of SCSI hot-add operations, and even to hit that would be hard as just running containers normally uses the VPMEM path for the containers UVM, not to the global SVM on SCSI. It gets incremented by one per build step (commit more accurately) as a general rule. Hence it would be necessary to have to be doing automated builds without restarting the daemon for literally years on end in unsafe mode. 😇
Note that in reality, the previous limit of ~47 layers before hitting the command line length limit is close to what is possible in the platform, at least as of RS5/Windows Server 2019 where, in the HCS v1 schema, a single SCSI controller is used, and that can only support 64 disks per controller per the Hyper-V VDEV. And remember we have one slot taken up for the SVMs scratch, and another for the containers scratch when committing a layer. So the best you can architecturally get on the platform is around the following (it's also different by 1 depending on whether in unsafe or default mode)
```
PS E:\docker\build\36764\short> docker build --no-cache .
Sending build context to Docker daemon 2.048kB
Step 1/4 : FROM alpine as first
---> 11cd0b38bc3c
Step 2/4 : RUN echo test > /test
---> Running in 8ddfe20e5bfb
Removing intermediate container 8ddfe20e5bfb
---> b0103a00b1c9
Step 3/4 : FROM alpine
---> 11cd0b38bc3c
Step 4/4 : COPY --from=first /test /test
---> 54bfae391eba
Successfully built 54bfae391eba
PS E:\docker\build\36764\short> cd ..
PS E:\docker\build\36764> docker build --no-cache .
Sending build context to Docker daemon 4.689MB
Step 1/61 : FROM alpine as first
---> 11cd0b38bc3c
Step 2/61 : RUN echo test > /test
---> Running in 02597ff870db
Removing intermediate container 02597ff870db
---> 3096de6fc454
Step 3/61 : RUN echo test > /test
---> Running in 9a8110f4ff19
Removing intermediate container 9a8110f4ff19
---> 7691808cf28e
Step 4/61 : RUN echo test > /test
---> Running in 9afb8f51510b
Removing intermediate container 9afb8f51510b
---> e42a0df2bb1c
Step 5/61 : RUN echo test > /test
---> Running in fe977ed6804e
Removing intermediate container fe977ed6804e
---> 55850c9b0479
Step 6/61 : RUN echo test > /test
---> Running in be65cbfad172
Removing intermediate container be65cbfad172
---> 0cf8acba70f0
Step 7/61 : RUN echo test > /test
---> Running in fd5b0907b6a9
Removing intermediate container fd5b0907b6a9
---> 257a4493d85d
Step 8/61 : RUN echo test > /test
---> Running in f7ca0ffd9076
Removing intermediate container f7ca0ffd9076
---> 3baa6f4fa2d5
Step 9/61 : RUN echo test > /test
---> Running in 5146814d4727
Removing intermediate container 5146814d4727
---> 485b9d5cf228
Step 10/61 : RUN echo test > /test
---> Running in a090eec1b743
Removing intermediate container a090eec1b743
---> a7eb10155b51
Step 11/61 : RUN echo test > /test
---> Running in 942660b288df
Removing intermediate container 942660b288df
---> 9d286a1e2133
Step 12/61 : RUN echo test > /test
---> Running in c3d369aa91df
Removing intermediate container c3d369aa91df
---> f78be4788992
Step 13/61 : RUN echo test > /test
---> Running in a03c3ac6888f
Removing intermediate container a03c3ac6888f
---> 6504363f61ab
Step 14/61 : RUN echo test > /test
---> Running in 0c3c2fca3f90
Removing intermediate container 0c3c2fca3f90
---> fe3448b8bb29
Step 15/61 : RUN echo test > /test
---> Running in 828d51c76d3b
Removing intermediate container 828d51c76d3b
---> 870684e3aea0
Step 16/61 : RUN echo test > /test
---> Running in 59a2f7c5f3ad
Removing intermediate container 59a2f7c5f3ad
---> cf84556ca5c0
Step 17/61 : RUN echo test > /test
---> Running in bfb4e088eeb3
Removing intermediate container bfb4e088eeb3
---> 9c8f9f652cef
Step 18/61 : RUN echo test > /test
---> Running in f1b88bb5a2d7
Removing intermediate container f1b88bb5a2d7
---> a6233ad21648
Step 19/61 : RUN echo test > /test
---> Running in 45f70577d709
Removing intermediate container 45f70577d709
---> 1b5cc52d370d
Step 20/61 : RUN echo test > /test
---> Running in 2ce231d5043d
Removing intermediate container 2ce231d5043d
---> 4a0e17cbebaa
Step 21/61 : RUN echo test > /test
---> Running in 52e4b0928f1f
Removing intermediate container 52e4b0928f1f
---> 99b50e989bcb
Step 22/61 : RUN echo test > /test
---> Running in f7ba3da7460d
Removing intermediate container f7ba3da7460d
---> bfa3cad88285
Step 23/61 : RUN echo test > /test
---> Running in 60180bf60f88
Removing intermediate container 60180bf60f88
---> fe7271988bcb
Step 24/61 : RUN echo test > /test
---> Running in 20324d396531
Removing intermediate container 20324d396531
---> e930bc039128
Step 25/61 : RUN echo test > /test
---> Running in b3ac70fd4404
Removing intermediate container b3ac70fd4404
---> 39d0a11ea6d8
Step 26/61 : RUN echo test > /test
---> Running in 0193267d3787
Removing intermediate container 0193267d3787
---> 8062d7aab0a5
Step 27/61 : RUN echo test > /test
---> Running in f41f45fb7985
Removing intermediate container f41f45fb7985
---> 1f5f18f2315b
Step 28/61 : RUN echo test > /test
---> Running in 90dd09c63d6e
Removing intermediate container 90dd09c63d6e
---> 02f0a1141f11
Step 29/61 : RUN echo test > /test
---> Running in c557e5386e0a
Removing intermediate container c557e5386e0a
---> dbcd6fb1f6f4
Step 30/61 : RUN echo test > /test
---> Running in 65369385d855
Removing intermediate container 65369385d855
---> e6e9058a0650
Step 31/61 : RUN echo test > /test
---> Running in d861fcc388fd
Removing intermediate container d861fcc388fd
---> 6e4c2c0f741f
Step 32/61 : RUN echo test > /test
---> Running in 1483962b7e1c
Removing intermediate container 1483962b7e1c
---> cf8f142aa055
Step 33/61 : RUN echo test > /test
---> Running in 5868934816c1
Removing intermediate container 5868934816c1
---> d5ff87cdc204
Step 34/61 : RUN echo test > /test
---> Running in e057f3201f3a
Removing intermediate container e057f3201f3a
---> b4031b7ab4ac
Step 35/61 : RUN echo test > /test
---> Running in 22b769b9079c
Removing intermediate container 22b769b9079c
---> 019d898510b6
Step 36/61 : RUN echo test > /test
---> Running in f1d364ef4ff8
Removing intermediate container f1d364ef4ff8
---> 9525cafdf04d
Step 37/61 : RUN echo test > /test
---> Running in 5bf505b8bdcc
Removing intermediate container 5bf505b8bdcc
---> cd5002b33bfd
Step 38/61 : RUN echo test > /test
---> Running in be24a921945c
Removing intermediate container be24a921945c
---> 8675db44d1b7
Step 39/61 : RUN echo test > /test
---> Running in 352dc6beef3d
Removing intermediate container 352dc6beef3d
---> 0ab0ece43c71
Step 40/61 : RUN echo test > /test
---> Running in eebde33e5d9b
Removing intermediate container eebde33e5d9b
---> 46ca4b0dfc03
Step 41/61 : RUN echo test > /test
---> Running in f920313a1e85
Removing intermediate container f920313a1e85
---> 7f3888414d58
Step 42/61 : RUN echo test > /test
---> Running in 10e2f4dc1ac7
Removing intermediate container 10e2f4dc1ac7
---> 14db9e15f2dc
Step 43/61 : RUN echo test > /test
---> Running in c849d6e89aa5
Removing intermediate container c849d6e89aa5
---> fdb770494dd6
Step 44/61 : RUN echo test > /test
---> Running in 419d1a8353db
Removing intermediate container 419d1a8353db
---> d12e9cf078be
Step 45/61 : RUN echo test > /test
---> Running in 0f1805263e4c
Removing intermediate container 0f1805263e4c
---> cd005e7b08a4
Step 46/61 : RUN echo test > /test
---> Running in 5bde05b46441
Removing intermediate container 5bde05b46441
---> 05aa426a3d4a
Step 47/61 : RUN echo test > /test
---> Running in 01ebc84bd1bc
Removing intermediate container 01ebc84bd1bc
---> 35d371fa4342
Step 48/61 : RUN echo test > /test
---> Running in 49f6c2f51dd4
Removing intermediate container 49f6c2f51dd4
---> 1090b5dfa130
Step 49/61 : RUN echo test > /test
---> Running in f8a9089cd725
Removing intermediate container f8a9089cd725
---> b2d0eec0716d
Step 50/61 : RUN echo test > /test
---> Running in a1697a0b2db0
Removing intermediate container a1697a0b2db0
---> 10d96ac8f497
Step 51/61 : RUN echo test > /test
---> Running in 33a2332c06eb
Removing intermediate container 33a2332c06eb
---> ba5bf5609c1c
Step 52/61 : RUN echo test > /test
---> Running in e8920392be0d
Removing intermediate container e8920392be0d
---> 5b3a95685c7e
Step 53/61 : RUN echo test > /test
---> Running in 4b9298587c65
Removing intermediate container 4b9298587c65
---> d4961a349141
Step 54/61 : RUN echo test > /test
---> Running in 8a0c960c2ba1
Removing intermediate container 8a0c960c2ba1
---> b413197fcfa2
Step 55/61 : RUN echo test > /test
---> Running in 536ee3b9596b
Removing intermediate container 536ee3b9596b
---> fc16b69b224a
Step 56/61 : RUN echo test > /test
---> Running in 8b817b8d7b59
Removing intermediate container 8b817b8d7b59
---> 2f0896400ff9
Step 57/61 : RUN echo test > /test
---> Running in ab0ed79ec3d4
Removing intermediate container ab0ed79ec3d4
---> b4fb420e736c
Step 58/61 : RUN echo test > /test
---> Running in 8548d7eead1f
Removing intermediate container 8548d7eead1f
---> 745103fd5a38
Step 59/61 : RUN echo test > /test
---> Running in 1980559ad5d6
Removing intermediate container 1980559ad5d6
---> 08c1c74a5618
Step 60/61 : FROM alpine
---> 11cd0b38bc3c
Step 61/61 : COPY --from=first /test /test
---> 67f053c66c27
Successfully built 67f053c66c27
PS E:\docker\build\36764>
```
Note also that subsequent error messages once you go beyond current platform limitations kind of suck (such as insufficient resources with a bunch of spew which is incomprehensible to most) and we could do better to detect this earlier in the daemon. That'll be for a (reasonably low-priority) follow-up though as and when I have time. Theoretically we *may*, if the platform doesn't require additional changes for RS5, be able to have bigger platform limits using the v2 schema with up to 127 VPMem devices, and the possibility to have multiple SCSI controllers per SVM/UVM. However, currently LCOW is using HCS v1 schema calls, and there's no plans to rewrite the graphdriver/libcontainerd components outside of the moving LCOW fully over to the containerd runtime/snapshotter using HCS v2 schema, which is still some time off fruition.
PS OK, while waiting for a full run to complete, I did get bored. Turns out it won't overflow line length as max(uint64) is 18446744073709551616 which would still be short enough at 127 layers, double the current platform limit. And I could always change it to hex or base36 to make it even shorter, or remove the 'd' from /tmp/dN. IOW, pretty sure no-one is going to hit the limit even if we could get the platform to 256 which is the current Hyper-V SCSI limit per VM (4x64), although PMEM at 127 would be the next immediate limit.
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.
NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.
The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>
On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.
Signed-off-by: Salahuddin Khan <salah@docker.com>
When go-1.11beta1 is used for building, the following error is
reported:
> 14:56:20 daemon\graphdriver\lcow\lcow.go:236: Debugf format %s reads
> arg #2, but call has 1 arg
While fixing this, let's also fix a few other things in this
very function (startServiceVMIfNotRunning):
1. Do not use fmt.Printf when not required.
2. Use `title` whenever possible.
3. Don't add `id` to messages as `title` already has it.
4. Remove duplicated colons.
5. Try to unify style of messages.
6. s/startservicevmifnotrunning/startServiceVMIfNotRunning/
...
In general, logging/debugging here is a mess and requires much more
love than I can give it at the moment. Areas for improvement:
1. Add a global var logger = logrus.WithField("storage-driver", "lcow")
and use it everywhere else in the code.
2. Use logger.WithField("id", id) whenever possible (same for "context"
and other similar fields).
3. Revise all the errors returned to be uniform.
4. Make use of errors.Wrap[f] whenever possible.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Fix the following go-1.11beta1 build error:
> daemon/graphdriver/aufs/aufs.go:376: Wrapf format %s reads arg #1, but call has 0 args
While at it, change '%s' to %q.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes it a bit simpler to remove this interface for v2 plugins
and not break external projects (libnetwork and swarmkit).
Note that before we remove the `Client()` interface from `CompatPlugin`
libnetwork and swarmkit must be updated to explicitly check for the v1
client interface as is done int his PR.
This is just a minor tweak that I realized is needed after trying to
implement the needed changes on libnetwork.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
In case aufs driver is not supported because supportsAufs() said so,
it is not possible to get a real reason from the logs.
To fix, log the error returned.
Note we're not using WithError here as the error message itself is the
sole message we want to print (i.e. there's nothing to add to it).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
ext4 support d_type by default, but filetype feature is a tunable so
there is still a chance to disable it for some reasons. In this case,
print additional message to explicitly tell how to support d_type.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
The overlay storage driver currently does not support any option, but was silently
ignoring any option that was passed.
This patch verifies that no options are passed, and if they are passed will produce
an error.
Before this change:
dockerd --storage-driver=overlay --storage-opt dm.thinp_percent=95
INFO[2018-05-11T11:40:40.996597152Z] libcontainerd: started new docker-containerd process pid=256
....
INFO[2018-05-11T11:40:41.135392535Z] Daemon has completed initialization
INFO[2018-05-11T11:40:41.141035093Z] API listen on /var/run/docker.sock
After this change:
dockerd --storage-driver=overlay --storage-opt dm.thinp_percent=95
INFO[2018-05-11T11:39:21.632610319Z] libcontainerd: started new docker-containerd process pid=233
....
Error starting daemon: error initializing graphdriver: overlay: unknown option dm.thinp_percent
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Functions `GetMounts()` and `parseMountTable()` return all the entries
as read and parsed from /proc/self/mountinfo. In many cases the caller
is only interested only one or a few entries, not all of them.
One good example is `Mounted()` function, which looks for a specific
entry only. Another example is `RecursiveUnmount()` which is only
interested in mount under a specific path.
This commit adds `filter` argument to `GetMounts()` to implement
two things:
1. filter out entries a caller is not interested in
2. stop processing if a caller is found what it wanted
`nil` can be passed to get a backward-compatible behavior, i.e. return
all the entries.
A few filters are implemented:
- `PrefixFilter`: filters out all entries not under `prefix`
- `SingleEntryFilter`: looks for a specific entry
Finally, `Mounted()` is modified to use `SingleEntryFilter()`, and
`RecursiveUnmount()` is using `PrefixFilter()`.
Unit tests are added to check filters are working.
[v2: ditch NoFilter, use nil]
[v3: ditch GetMountsFiltered()]
[v4: add unit test for filters]
[v5: switch to gotestyourself]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
commit 617c352e92 "Don't create devices if in a user namespace"
introduced check, which meant to skip mknod operation when run
in user namespace, but instread skipped FIFO and socket files
copy.
Signed-off-by: Maxim Ivanov <ivanov.maxim@gmail.com>
Now all of the storage drivers use the field "storage-driver" in their log
messages, which is set to name of the respective driver.
Storage drivers changed:
- Aufs
- Btrfs
- Devicemapper
- Overlay
- Overlay 2
- Zfs
Signed-off-by: Alejandro GonzÃlez Hevia <alejandrgh11@gmail.com>
Move the "unmount and deactivate" code into a separate method, and
optimize it a bit:
1. Do not use filepath.Walk() as there's no requirement to recursively
go into every directory under home/mnt; a list of directories in mnt
is sufficient. With filepath.Walk(), in case some container will fail
to unmount, it'll go through the whole container filesystem which is
excessive and useless.
2. Do not use GetMounts() and check if a directory is mounted; just
unmount it and ignore "not mounted" error. Note the same error
is returned in case of wrong flags set, but as flags are hardcoded
we can safely ignore such case.
While at it, promote "can't unmount" log level from debug to warning.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Make sure it's clear the error is from unmount.
2. Simplify the code a bit to make it more readable.
[v2: use errors.Wrap]
[v3: use errors.Wrapf]
[v4: lowercase the error message]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Replace EnsureRemoveAll() with Rmdir(), as here we are removing
the container's mount point, which is already properly unmounted
and is therefore an empty directory.
2. Ignore the Rmdir() error (but log it unless it's ENOENT). This
is a mount point, currently unmounted (i.e. an empty directory),
and an older kernel can return EBUSY if e.g. the mount was
leaked to other mount namespaces.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Before this change, volume management was relying on the fact that
everything the plugin mounts is visible on the host within the plugin's
rootfs. In practice this caused some issues with mount leaks, so we
changed the behavior such that mounts are not visible on the plugin's
rootfs, but available outside of it, which breaks volume management.
To fix the issue, allow the plugin to scope the path correctly rather
than assuming that everything is visible in `p.Rootfs`.
In practice this is just scoping the `PropagatedMount` paths to the
correct host path.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Setting up the mounts on the host increases chances of mount leakage and
makes for more cleanup after the plugin has stopped.
With this change all mounts for the plugin are performed by the
container runtime and automatically cleaned up when the container exits.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This was added in #36047 just as a way to make sure the tree is fully
unmounted on shutdown.
For ZFS this could be a breaking change since there was no unmount before.
Someone could have setup the zfs tree themselves. It would be better, if
we really do want the cleanup to actually the unpacked layers checking
for mounts rather than a blind recursive unmount of the root.
BTRFS does not use mounts and does not need to unmount anyway.
These was only an unmount to begin with because for some reason the
btrfs tree was being moutned with `private` propagation.
For the other graphdrivers that still have a recursive unmount here...
these were already being unmounted and performing the recursive unmount
shouldn't break anything. If anyone had anything mounted at the
graphdriver location it would have been unmounted on shutdown anyway.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.
There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system servie winds up with it's own mount
namespace. This causes a race betwen daemon startup and other system
services as to if the mount is actually private.
This also means there is a negative impact when other system services
are started while the daemon is running.
Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns heirarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.
It also does not protect situations like `-v
/var/lib/docker:/var/lib/docker` where all mounts are recursively bound
into the container anyway.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This is a fix to regression in vfs graph driver introduced by
commit 7a1618ced3 ("add quota support to VFS graphdriver").
On some filesystems, vfs fails to init with the following error:
> Error starting daemon: error initializing graphdriver: Failed to mknod
> /go/src/github.com/docker/docker/bundles/test-integration/d6bcf6de610e9/root/vfs/backingFsBlockDev:
> function not implemented
As quota is not essential for vfs, let's ignore (but log as a warning) any error
from quota init.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If mknod() returns ENOSYS, it most probably means quota is not supported
here, so return the appropriate error.
This is a conservative* fix to regression in vfs graph driver introduced
by commit 7a1618ced3 ("add quota support to VFS graphdriver").
On some filesystems, vfs fails to init with the following error:
> Error starting daemon: error initializing graphdriver: Failed to mknod
> /go/src/github.com/docker/docker/bundles/test-integration/d6bcf6de610e9/root/vfs/backingFsBlockDev:
> function not implemented
Reported-by: Brian Goff <cpuguy83@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Files that are suffixed with `_linux.go` or `_windows.go` are
already only built on Linux / Windows, so these build-tags
were redundant.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The overlay2 driver was not setting up the archive.TarOptions field
properly like other storage backend routes to "applyTarLayer"
functionality. The InUserNS field is populated now for overlay2 using
the same query function used by the other storage drivers.
Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
Right now we only log source and destination (and demsg) if mount operation
fails. fstype and mount options are available easily. It probably is a good
idea to log these as well. Especially sometimes failures can happen due to
mount options.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Even though it's highly discouraged, there are existing
installs that are running overlay/overlay2 on filesystems
without d_type support.
This patch allows the daemon to start in such cases, instead of
refusing to start without an option to override.
For fresh installs, backing filesystems without d_type support
will still cause the overlay/overlay2 drivers to be marked as
"unsupported", and skipped during the automatic selection.
This feature is only to keep backward compatibility, but
will be removed at some point.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Support for running overlay/overlay2 on a backing filesystem
without d_type support (most likely: xfs, as ext4 supports
this by default), was deprecated for some time.
Running without d_type support is problematic, and can
lead to difficult to debug issues ("invalid argument" errors,
or unable to remove files from the container's filesystem).
This patch turns the warning that was previously printed
into an "unsupported" error, so that the overlay/overlay2
drivers are not automatically selected when detecting supported
storage drivers.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The fsmagic check was always performed on "data-root" (`/var/lib/docker`),
not on the storage-driver's home directory (e.g. `/var/lib/docker/<somedriver>`).
This caused detection to be done on the wrong filesystem in situations
where `/var/lib/docker/<somedriver>` was a mount, and a different
filesystem than `/var/lib/docker` itself.
This patch checks if the storage-driver's home directory exists, and only
falls back to `/var/lib/docker` if it doesn't exist.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This commit is a set of fixes and improvement for zfs graph driver,
in particular:
1. Remove mount point after umount in `Get()` error path, as well
as in `Put()` (with `MNT_DETACH` flag). This should solve "failed
to remove root filesystem for <ID> .... dataset is busy" error
reported in Moby issue 35642.
To reproduce the issue:
- start dockerd with zfs
- docker run -d --name c1 --rm busybox top
- docker run -d --name c2 --rm busybox top
- docker stop c1
- docker rm c1
Output when the bug is present:
```
Error response from daemon: driver "zfs" failed to remove root
filesystem for XXX : exit status 1: "/sbin/zfs zfs destroy -r
scratch/docker/YYY" => cannot destroy 'scratch/docker/YYY':
dataset is busy
```
Output when the bug is fixed:
```
Error: No such container: c1
```
(as the container has been successfully autoremoved on stop)
2. Fix/improve error handling in `Get()` -- do not try to umount
if `refcount` > 0
3. Simplifies unmount in `Get()`. Specifically, remove call to
`graphdriver.Mounted()` (which checks if fs is mounted using
`statfs()` and check for fs type) and `mount.Unmount()` (which
parses `/proc/self/mountinfo`). Calling `unix.Unmount()` is
simple and sufficient.
4. Add unmounting of driver's home to `Cleanup()`.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Previously, the code would set the mtime on the directories before
creating files in the directory itself. This was problematic
because it resulted in the mtimes on the directories being
incorrectly set. This change makes it so that the mtime is
set only _after_ all of the files have been created.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
There was a small issue here, where it copied the data using
traditional mechanisms, even when copy_file_range was successful.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This change makes the VFS graphdriver use the kernel-accelerated
(copy_file_range) mechanism of copying files, which is able to
leverage reflinks.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Previously, graphdriver/copy would improperly copy hardlinks as just regular
files. This patch changes that behaviour, and instead the code now keeps
track of inode numbers, and if it sees the same inode number again
during the copy loop, it hardlinks it, instead of copying it.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
The overlay2 storage-driver requires multiple lower dir
support for overlayFs. Support for this feature was added
in kernel 4.x, but some distros (RHEL 7.4, CentOS 7.4) ship with
an older kernel with this feature backported.
This patch adds feature-detection for multiple lower dirs,
and will perform this feature-detection on pre-4.x kernels
with overlayFS support.
With this patch applied, daemons running on a kernel
with multiple lower dir support will now select "overlay2"
as storage-driver, instead of falling back to "overlay".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:
- for `Mkdir()`, `IsExist` error should (usually) be ignored
(unless you want to make sure directory was not there before)
as it means "the destination directory was already there"
- for `MkdirAll()`, `IsExist` error should NEVER be ignored.
Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.
Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.
NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).
For more details, a quote from my runc commit 6f82d4b (July 2015):
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.
It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).
As of recently, overlayfs reacts to these mounts being leaked (see
One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
When starting the daemon, the `/var/lib/docker` directory
is scanned for existing directories, so that the previously
selected graphdriver will automatically be used.
In some situations, empty directories are present (those
directories can be created during feature detection of
graph-drivers), in which case the daemon refuses to start.
This patch improves detection, and skips empty directories,
so that leftover directories don't cause the daemon to
fail.
Before this change:
$ mkdir /var/lib/docker /var/lib/docker/aufs /var/lib/docker/overlay2
$ dockerd
...
Error starting daemon: error initializing graphdriver: /var/lib/docker contains several valid graphdrivers: overlay2, aufs; Please cleanup or explicitly choose storage driver (-s <DRIVER>)
With this patch applied:
$ mkdir /var/lib/docker /var/lib/docker/aufs /var/lib/docker/overlay2
$ dockerd
...
INFO[2017-11-16T17:26:43.207739140Z] Docker daemon commit=ab90bc296 graphdriver(s)=overlay2 version=dev
INFO[2017-11-16T17:26:43.208033095Z] Daemon has completed initialization
And on restart (prior graphdriver is still picked up):
$ dockerd
...
INFO[2017-11-16T17:27:52.260361465Z] [graphdriver] using prior storage driver: overlay2
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 7a1618ced3 regresses running Docker
in user namespaces. The new check for whether quota are supported calls
NewControl() which in turn calls makeBackingFsDev() which tries to
mknod(). Skip quota tests when we detect that we are running in a user
namespace and return ErrQuotaNotSupported to the caller. This just
restores the status quo.
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Add a way to specify a custom graphdriver priority list
during build. This can be done with something like
go build -ldflags "-X github.com/docker/docker/daemon/graphdriver.priority=overlay2,devicemapper"
As ldflags are already used by the engine build process, and it seems
that only one (last) `-ldflags` argument is taken into account by go,
an envoronment variable `DOCKER_LDFLAGS` is introduced in order to
be able to append some text to `-ldflags`. With this in place,
using the feature becomes
make DOCKER_LDFLAGS="-X github.com/docker/docker/daemon/graphdriver.priority=overlay2,devicemapper" dynbinary
The idea behind this is, the priority list might be different
for different distros, so vendors are now able to change it
without patching the source code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Make it possible to disable overlay and overlay2 separately.
With this commit, we now have `exclude_graphdriver_overlay` and
`exclude_graphdriver_overlay2` build tags for the engine, which
is in line with any other graph driver.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
From https://www.kernel.org/doc/Documentation/filesystems/overlayfs.txt:
> The lower filesystem can be any filesystem supported by Linux and does
> not need to be writable. The lower filesystem can even be another
> overlayfs. The upper filesystem will normally be writable and if it
> is it must support the creation of trusted.* extended attributes, and
> must provide valid d_type in readdir responses, so NFS is not suitable.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In order to avoid reverting our fix for mount leakage in devicemapper,
add a test which checks that devicemapper's Get() and Put() cycle can
survive having a command running in an rprivate mount propagation setup
in-between. While this is quite rudimentary, it should be sufficient.
We have to skip this test for pre-3.18 kernels.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This patch adds the capability for the VFS graphdriver to use
XFS project quotas. It reuses the existing quota management
code that was created by overlay2 on XFS.
It doesn't rely on a filesystem whitelist, but instead
the quota-capability detection code.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This adds a mechanism (read-only) to check for project quota support
in a standard way. This mechanism is leveraged by the tests, which
test for the following:
1. Can we get a quota controller?
2. Can we set the quota for a particular directory?
3. Is the quota being over-enforced?
4. Is the quota being under-enforced?
5. Can we retrieve the quota?
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
This changeset allows Docker's VFS, and Overlay to take advantage of
Linux's zerocopy APIs.
The copy function first tries to use the ficlone ioctl. Reason being:
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation
See: http://oss.sgi.com/archives/xfs/2015-12/msg00356.html
If the clone fails, we fall back to copy_file_range, which internally
may fall back to splice, which has an upper limit on the size
of copy it can perform. Given that, we have to loop until the copy
is done.
For a given dirCopy operation, if the clone fails, we will not try
it again during any other file copy. Same is true with copy_file_range.
If all else fails, we fall back to traditional copy.
Signed-off-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: John Howard <jhoward@microsoft.com>
This PR has the API changes described in https://github.com/moby/moby/issues/34617.
Specifically, it adds an HTTP header "X-Requested-Platform" which is a JSON-encoded
OCI Image-spec `Platform` structure.
In addition, it renames (almost all) uses of a string variable platform (and associated)
methods/functions to os. This makes it much clearer to disambiguate with the swarm
"platform" which is really os/arch. This is a stepping stone to getting the daemon towards
fully multi-platform/arch-aware, and makes it clear when "operating system" is being
referred to rather than "platform" which is misleadingly used - sometimes in the swarm
meaning, but more often as just the operating system.
When use overlay2 as the graphdriver and the kernel enable
`CONFIG_OVERLAY_FS_REDIRECT_DIR=y`, rename a dir in lower layer
will has a xattr to redirct its dir to source dir. This make the
image layer unportable. This patch fallback to use naive diff driver
when kernel enable CONFIG_OVERLAY_FS_REDIRECT_DIR
Signed-off-by: Lei Jitang <leijitang@huawei.com>
The change in 7a7357dae1 inadvertently
changed the `defer` error code into a no-op. This restores its behavior
prior to that code change, and also introduces a little more error
logging.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
It was causing the error message to be
'overlay' is not supported over <unknown>
instead of
'overlay' is not supported over ecryptfs
Signed-off-by: Iago López Galeiras <iago@kinvolk.io>
This commit reverts a hunk of commit 2f5f0af3f ("Add unconvert linter")
and adds a hint for unconvert linter to ignore excessive conversion as
it is required on 32-bit platforms (e.g. armhf).
The exact error on armhf is this:
19:06:45 ---> Making bundle: dynbinary (in bundles/17.06.0-dev/dynbinary)
19:06:48 Building: bundles/17.06.0-dev/dynbinary-daemon/dockerd-17.06.0-dev
19:10:58 # github.com/docker/docker/daemon/graphdriver/overlay
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Nsec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Nsec (type int32) as type int64 in argument to time.Unix
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of providing a generic message listing all possible reasons
why xfs is not available on the system, let's be specific.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If mount fails, the reason might be right there in the kernel log ring buffer.
Let's include it in the error message, it might be of great help.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the update to Debian Stretch, devmapper unit test fails. One
reason is, the combination of somewhat old (less than 3.16) kernel and
relatively new xfsprogs leads to creating a filesystem which is not supported
by the kernel:
> [12206.467518] XFS (dm-1): Superblock has unknown read-only compatible features (0x1) enabled.
> [12206.472046] XFS (dm-1): Attempted to mount read-only compatible filesystem read-write.
> Filesystem can only be safely mounted read only.
> [12206.472079] XFS (dm-1): SB validate failed with error 22.
Ideally, that would be automatically and implicitly handled by xfsprogs.
In real life, we have to take care about it here. Sigh.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Static build with devmapper is impossible now since libudev is required
and no static version of libudev is available (as static libraries are
not supported by systemd which udev is part of).
This should not hurt anyone as "[t]he primary user of static builds
is the Editions, and docker in docker via the containers, and none
of those use device mapper".
Also, since the need for static libdevmapper is gone, there is no need
to self-compile libdevmapper -- let's use the one from Debian Stretch.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Make sure to call C.free on C string allocated using C.CString in every
exit path.
C.CString allocates memory in the C heap using malloc. It is the callers
responsibility to free them. See
https://golang.org/cmd/cgo/#hdr-Go_references_to_C for details.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This enables docker cp and ADD/COPY docker build support for LCOW.
Originally, the graphdriver.Get() interface returned a local path
to the container root filesystem. This does not work for LCOW, so
the Get() method now returns an interface that LCOW implements to
support copying to and from the container.
Signed-off-by: Akash Gupta <akagup@microsoft.com>
This commit reverts a hunk of commit 2f5f0af3f ("Add unconvert linter")
and adds a hint for unconvert linter to ignore excessive conversion as
it is required on 32-bit platforms (e.g. armhf).
The exact error on armhf is this:
19:06:45 ---> Making bundle: dynbinary (in bundles/17.06.0-dev/dynbinary)
19:06:48 Building: bundles/17.06.0-dev/dynbinary-daemon/dockerd-17.06.0-dev
19:10:58 # github.com/docker/docker/daemon/graphdriver/overlay
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:161: cannot use stat.Atim.Nsec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Sec (type int32) as type int64 in argument to time.Unix
19:10:58 daemon/graphdriver/overlay/copy.go:162: cannot use stat.Mtim.Nsec (type int32) as type int64 in argument to time.Unix
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
libdm currently has a fairly substantial DoS bug that makes certain
operations fail on a libdm device if the device has active references
through mountpoints. This is a significant problem with the advent of
mount namespaces and MS_PRIVATE, and can cause certain --volume mounts
to cause libdm to no longer be able to remove containers:
% docker run -d --name testA busybox top
% docker run -d --name testB -v /var/lib/docker:/docker busybox top
% docker rm -f testA
[fails on libdm with dm_task_run errors.]
This also solves the problem of unprivileged users being able to DoS
docker by using unprivileged mount namespaces to preseve mounts that
Docker has dropped.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In d42dbdd3d4 the code was re-arranged to
better report errors, and ignore non-errors.
In doing so we removed a deferred remove of the AUFS diff path, but did
not replace it with a non-deferred one.
This fixes the issue and makes the code a bit more readable.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
I was able to successfully use device mapper autoconfig feature
(commit 5ef07d79c) but it stopped working after a reboot.
Investigation shown that the dm device was not activated because of
a missing binary, that is not used during initial setup, but every
following time. Here's an error shown when trying to manually activate
the device:
> kir@kd:~/go/src/github.com/docker/docker$ sudo lvchange -a y /dev/docker/thinpool
> /usr/sbin/thin_check: execvp failed: No such file or directory
> Check of pool docker/thinpool failed (status:2). Manual repair required!
Surely, there is no solution to this other than to have a package that
provides the thin_check binary installed beforehand. Due to the fact
the issue revealed itself way later than DM setup was performed, it was
somewhat harder to investigate.
With this in mind, let's check for binary presense before setting up DM,
refusing to proceed if the binary is not there, saving a user from later
frustration.
While at it, eliminate repeated binary checking code. The downside is
that the binary lookup is happening more than once now -- I think the
clarity of code overweights this minor de-optimization.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
I tried using dm.directlvm_device but it ended up with the following
error:
> Error starting daemon: error initializing graphdriver: error
> writing docker thinp autoextend profile: open
> /etc/lvm/profile/docker-thinpool.profile: no such file or directory
The reason is /etc/lvm/profile directory does not exist. I think it is
better to try creating it beforehand.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
This changes the graphdriver to perform dynamic sandbox management.
Previously, as a temporary 'hack', the service VM had a prebuilt
sandbox in it. With this change, management is under the control
of the client (docker) and executes a mkfs.ext4 on it. This enables
sandboxes of non-default sizes too (a TODO previously in the code).
It also addresses https://github.com/moby/moby/pull/33969#discussion_r127287887
Requires:
- go-winio: v0.4.3
- opengcs: v0.0.12
- hcsshim: v0.6.x
Make sure user understands this is about the in-kernel driver
(not the dockerd driver or smth).
While at it, amend the comment as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Switch some more usage of the Stat function and the Stat_t type from the
syscall package to golang.org/x/sys. Those were missing in PR #33399.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Specifically, none of the graphdrivers are supposed to return a
not-exist type of error on remove (or at least that's how they are
currently handled).
Found that AUFS still had one case where a not-exist error could escape,
when checking if the directory is mounted we call a `Statfs` on the
path.
This fixes AUFS to not return an error in this case, but also
double-checks at the daemon level on layer remove that the error is not
a `not-exist` type of error.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
This changes the LCOW driver to support both global SVM lifetime and
per-instance lifetime. It also corrects the scratch implementation.
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
Because we use our own logging callbacks in order to use libdm
effectively, it is quite difficult to debug complicated devicemapper
issues (because any warnings or notices from libdm are muted by our own
callback function). e07d3cd9a ("devmapper: Fix libdm logging") further
reduced the ability of this debugging by only allowing _LOG_FATAL errors
to be passed to the output.
Unfortunately libdm is very chatty, so in order to avoid making the logs
even more crowded, add a dm.libdm_log_level storage option that allows
people who are debugging the lovely world of libdm to be able to dive in
without recompiling binaries.
The valid values of dm.libdm_log_level map directly to the libdm logging
levels, and are in the range [2,7] as of the time of writing with 7
being _LOG_DEBUG and 2 being _LOG_FATAL. The default is _LOG_FATAL.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
LogInit used to act as a manual way of registering the *necessary*
pkg/devicemapper logging callbacks. In addition, it was used to split up
the logic of pkg/devicemapper into daemon/graphdriver/devmapper (such
that some things were logged from libdm).
The manual aspect of this API was completely non-sensical and was just
begging for incorrect usage of pkg/devicemapper, so remove that semantic
and always register our own libdm callbacks.
In addition, recombine the split out logging callbacks into
pkg/devicemapper so that the default logger is local to the library and
also shown to be the recommended logger. This makes the code
substantially easier to read. Also the new DefaultLogger now has
configurable upper-bound for the log level, which allows for dynamically
changing the logging level.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
There have been some cases where umount, a device can be busy for a very
short duration. Maybe its udev rules, or maybe it is runc related races
or probably it is something else. We don't know yet.
If deferred removal is enabled but deferred deletion is not, then for the
case of "docker run -ti --rm fedora bash", a container will exit, device
will be deferred removed and then immediately a call will come to delete
the device. It is possible that deletion will fail if device was busy
at that time.
A device can't be deleted if it can't be removed/deactivated first. There
is only one exception and that is when deferred deletion is on. In that
case graph driver will keep track of deleted device and try to delete it
later and return success to caller.
Always make sure that device deactivation is synchronous when device is
being deleted (except the case when deferred deletion is enabled).
This should also take care of small races when device is busy for a short
duration and it is being deleted.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This commit adds the overlay2.size option to the daemon daemon
storage opts.
The user can override this option by the "docker run --storage-opt"
options.
Signed-off-by: Dhawal Yogesh Bhanushali <dbhanushali@vmware.com>
This enables deferred device deletion/removal by default if the driver
version in the kernel is new enough to support the feature.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
we see a lot of
```
level=debug msg="Failed to unmount a03b1bb6f569421857e5407d73d89451f92724674caa56bfc2170de7e585a00b-init overlay: device or resource busy"
```
in daemon logs and there is a lot of mountpoint leftover.
This cause failed to remove container.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
There is no case which would resolve in this error. The root user always exists, and if the id maps are empty, the default value of 0 is correct.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
The test was failing because TarOptions was using a non-pointer for
ChownOpts, which meant the check for nil was never true, and
createTarFile was never using the hdr.UID/GID
Signed-off-by: Daniel Nephin <dnephin@docker.com>
This commit is an extension of fix for 29325 based on the review comment.
In this commit, the quota size for btrfs is kept in `/var/lib/docker/btrfs/quotas`
so that a daemon restart keeps quota.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the issue raised in 29325 where
btrfs quota groups are not clean up even after containers
have been destroyed.
The reason for the issue is that btrfs quota groups have
to be explicitly destroyed. This fix fixes this issue.
This fix is tested manually in Ubuntu 16.04,
with steps specified in 29325.
This fix fixes 29325.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
OverlayFS is supported on top of btrfs as of Linux Kernel 4.7.
Skip the hard enforcement when on kernel 4.7 or newer and
respect the kernel check override flag on older kernels.
https://btrfs.wiki.kernel.org/index.php/Changelog#By_feature
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
Before this, if `forceRemove` is set the container data will be removed
no matter what, including if there are issues with removing container
on-disk state (rw layer, container root).
In practice this causes a lot of issues with leaked data sitting on
disk that users are not able to clean up themselves.
This is particularly a problem while the `EBUSY` errors on remove are so
prevalent. So for now let's not keep this behavior.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Instead of forcing users to manually configure a block device to use
with devmapper, this gives the user the option to let the devmapper
driver configure a device for them.
Adds several new options to the devmapper storage-opts:
- dm.directlvm_device="" - path to the block device to configure for
direct-lvm
- dm.thinp_percent=95 - sets the percentage of space to use for
storage from the passed in block device
- dm.thinp_metapercent=1 - sets the percentage of space to for metadata
storage from the passed in block device
- dm.thinp_autoextend_threshold=80 - sets the threshold for when `lvm`
should automatically extend the thin pool as a percentage of the total
storage space
- dm.thinp_autoextend_percent=20 - sets the percentage to increase the
thin pool by when an autoextend is triggered.
Defaults are taken from
[here](https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/#/configure-direct-lvm-mode-for-production)
The only option that is required is `dm.directlvm_device` for docker to
set everything up.
Changes to these settings are not currently supported and will error
out.
Future work could support allowing changes to these values.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
if initDevmapper failed after creating thin-pool, the thin-pool will not be removed,
this would cause we can't use the same lvm to create another thin-pool.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
This allows graphdrivers to declare that they can reproduce the original
diff stream for a layer. If they do so, the layer store will not use
tar-split processing, but will still verify the digest on layer export.
This makes it easier to experiment with non-default diff formats.
Signed-off-by: Alfred Landrum <alfred.landrum@docker.com>
Since it was introduced no reports were made and lsof seems to cause
issues on some systems.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
when doing devices.cancelDeferredRemoval, the device could have been removed
and return ErrEnxio, but it continue to check if it is need to do suspend.
doSuspend := devinfo != nil && devinfo.Exists != 0 uses a devinfo which is
get before devices.cancelDeferredRemoval(baseInfo), it is outdate, the device
has been removed and there is no need to do suspend. If do suspend it will return
devicemapper: Error running deviceSuspend dm_task_run failed.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
This fixes https://github.com/docker/docker/issues/30278 where
there is a race condition in HCS for RS1 and RS2 builds, and enumeration
of compute systems can return access is denied if a silo is being
torn down in the kernel while HCS is attempting to enumerate them.
I often get complains that container removal failed and users got following
error message.
"Driver devicemapper failed to remove root filesystem 18a69ba82aaf7a039ce7d44156215012d703001643079775190ac7dd6c6acf56:Device is Busy"
This error message talks about container id but does not give any info
about which particular device id is busy. Most likely device is mounted
in some other mount namespace and if one knows the device id, they
can try to do some debugging figuring which process and which mount
namespace is keeping the device busy and how did we reach that stage.
Without that information, it becomes almost impossible to debug the
problem.
So to improve the debuggability, when device removal fails, also return
device id in error message. Now new message looks as follows.
"Driver devicemapper failed to remove root filesystem 18a69ba82aaf7a039ce7d44156215012d703001643079775190ac7dd6c6acf56: Failed to remove device dbc15bdf9994a17c613d8ef9e924f3cffbf67f91e4f709295c901ad628377991:Device is Busy"
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This fix tries to address the issue raised in 29810
where btrfs subvolume removal failed when docker
is in an unprivileged lxc container. The failure
was caused by `Failed to rescan btrfs quota` with
`operation not permitted`.
However, if disk quota is not enabled, there is no
need to run a btrfs rescan at the first place.
This fix checks for `quotaEnabled` and only run btrfs
rescan if `quotaEnabled` is true.
This fix fixes 29810.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Go style calls for mixed caps instead of all caps:
https://golang.org/doc/effective_go.html#mixed-caps
Change LOOKUP, ACQUIRE, and RELEASE to Lookup, Acquire, and Release.
This vendors a fork of libnetwork for now, to deal with a cyclic
dependency issue. The change will be upstream to libnetwork once this is
merged.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Adds 2 new methods to v2 plugin `Acquire` and `Release` which allow
refcounting directly at the plugin level instead of just the store.
Since a graphdriver is initialized exactly once, and is really managed
by a separate object, it didn't really seem right to call
`getter.Get()` to refcount graphdriver plugins.
On shutdown it was particularly weird where we'd either need to keep a
driver reference in daemon, or keep a reference to the pluggin getter in
the layer store, and even then still store extra details on if the
graphdriver is a plugin or not.
Instead the plugin proxy itself will handle calling the neccessary
refcounting methods directly on the plugin object.
Also adds a new interface in `plugingetter` to account for these new
functions which are not going to be implemented by v1 plugins.
Changes terms `plugingetter.CREATE` and `plugingetter.REMOVE` to
`ACQUIRE` and `RELEASE` respectively, which seems to be better
adjectives for what we're doing.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>