Move the "unmount and deactivate" code into a separate method, and
optimize it a bit:
1. Do not use filepath.Walk() as there's no requirement to recursively
go into every directory under home/mnt; a list of directories in mnt
is sufficient. With filepath.Walk(), in case some container will fail
to unmount, it'll go through the whole container filesystem which is
excessive and useless.
2. Do not use GetMounts() and check if a directory is mounted; just
unmount it and ignore "not mounted" error. Note the same error
is returned in case of wrong flags set, but as flags are hardcoded
we can safely ignore such case.
While at it, promote "can't unmount" log level from debug to warning.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Right now we only log source and destination (and demsg) if mount operation
fails. fstype and mount options are available easily. It probably is a good
idea to log these as well. Especially sometimes failures can happen due to
mount options.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:
- for `Mkdir()`, `IsExist` error should (usually) be ignored
(unless you want to make sure directory was not there before)
as it means "the destination directory was already there"
- for `MkdirAll()`, `IsExist` error should NEVER be ignored.
Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.
Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.
NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).
For more details, a quote from my runc commit 6f82d4b (July 2015):
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of providing a generic message listing all possible reasons
why xfs is not available on the system, let's be specific.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If mount fails, the reason might be right there in the kernel log ring buffer.
Let's include it in the error message, it might be of great help.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since the update to Debian Stretch, devmapper unit test fails. One
reason is, the combination of somewhat old (less than 3.16) kernel and
relatively new xfsprogs leads to creating a filesystem which is not supported
by the kernel:
> [12206.467518] XFS (dm-1): Superblock has unknown read-only compatible features (0x1) enabled.
> [12206.472046] XFS (dm-1): Attempted to mount read-only compatible filesystem read-write.
> Filesystem can only be safely mounted read only.
> [12206.472079] XFS (dm-1): SB validate failed with error 22.
Ideally, that would be automatically and implicitly handled by xfsprogs.
In real life, we have to take care about it here. Sigh.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
libdm currently has a fairly substantial DoS bug that makes certain
operations fail on a libdm device if the device has active references
through mountpoints. This is a significant problem with the advent of
mount namespaces and MS_PRIVATE, and can cause certain --volume mounts
to cause libdm to no longer be able to remove containers:
% docker run -d --name testA busybox top
% docker run -d --name testB -v /var/lib/docker:/docker busybox top
% docker rm -f testA
[fails on libdm with dm_task_run errors.]
This also solves the problem of unprivileged users being able to DoS
docker by using unprivileged mount namespaces to preseve mounts that
Docker has dropped.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Make sure user understands this is about the in-kernel driver
(not the dockerd driver or smth).
While at it, amend the comment as well.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Switch some more usage of the Stat function and the Stat_t type from the
syscall package to golang.org/x/sys. Those were missing in PR #33399.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
Because we use our own logging callbacks in order to use libdm
effectively, it is quite difficult to debug complicated devicemapper
issues (because any warnings or notices from libdm are muted by our own
callback function). e07d3cd9a ("devmapper: Fix libdm logging") further
reduced the ability of this debugging by only allowing _LOG_FATAL errors
to be passed to the output.
Unfortunately libdm is very chatty, so in order to avoid making the logs
even more crowded, add a dm.libdm_log_level storage option that allows
people who are debugging the lovely world of libdm to be able to dive in
without recompiling binaries.
The valid values of dm.libdm_log_level map directly to the libdm logging
levels, and are in the range [2,7] as of the time of writing with 7
being _LOG_DEBUG and 2 being _LOG_FATAL. The default is _LOG_FATAL.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
LogInit used to act as a manual way of registering the *necessary*
pkg/devicemapper logging callbacks. In addition, it was used to split up
the logic of pkg/devicemapper into daemon/graphdriver/devmapper (such
that some things were logged from libdm).
The manual aspect of this API was completely non-sensical and was just
begging for incorrect usage of pkg/devicemapper, so remove that semantic
and always register our own libdm callbacks.
In addition, recombine the split out logging callbacks into
pkg/devicemapper so that the default logger is local to the library and
also shown to be the recommended logger. This makes the code
substantially easier to read. Also the new DefaultLogger now has
configurable upper-bound for the log level, which allows for dynamically
changing the logging level.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
There have been some cases where umount, a device can be busy for a very
short duration. Maybe its udev rules, or maybe it is runc related races
or probably it is something else. We don't know yet.
If deferred removal is enabled but deferred deletion is not, then for the
case of "docker run -ti --rm fedora bash", a container will exit, device
will be deferred removed and then immediately a call will come to delete
the device. It is possible that deletion will fail if device was busy
at that time.
A device can't be deleted if it can't be removed/deactivated first. There
is only one exception and that is when deferred deletion is on. In that
case graph driver will keep track of deleted device and try to delete it
later and return success to caller.
Always make sure that device deactivation is synchronous when device is
being deleted (except the case when deferred deletion is enabled).
This should also take care of small races when device is busy for a short
duration and it is being deleted.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This enables deferred device deletion/removal by default if the driver
version in the kernel is new enough to support the feature.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Instead of forcing users to manually configure a block device to use
with devmapper, this gives the user the option to let the devmapper
driver configure a device for them.
Adds several new options to the devmapper storage-opts:
- dm.directlvm_device="" - path to the block device to configure for
direct-lvm
- dm.thinp_percent=95 - sets the percentage of space to use for
storage from the passed in block device
- dm.thinp_metapercent=1 - sets the percentage of space to for metadata
storage from the passed in block device
- dm.thinp_autoextend_threshold=80 - sets the threshold for when `lvm`
should automatically extend the thin pool as a percentage of the total
storage space
- dm.thinp_autoextend_percent=20 - sets the percentage to increase the
thin pool by when an autoextend is triggered.
Defaults are taken from
[here](https://docs.docker.com/engine/userguide/storagedriver/device-mapper-driver/#/configure-direct-lvm-mode-for-production)
The only option that is required is `dm.directlvm_device` for docker to
set everything up.
Changes to these settings are not currently supported and will error
out.
Future work could support allowing changes to these values.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
if initDevmapper failed after creating thin-pool, the thin-pool will not be removed,
this would cause we can't use the same lvm to create another thin-pool.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
when doing devices.cancelDeferredRemoval, the device could have been removed
and return ErrEnxio, but it continue to check if it is need to do suspend.
doSuspend := devinfo != nil && devinfo.Exists != 0 uses a devinfo which is
get before devices.cancelDeferredRemoval(baseInfo), it is outdate, the device
has been removed and there is no need to do suspend. If do suspend it will return
devicemapper: Error running deviceSuspend dm_task_run failed.
Signed-off-by: Lei Jitang <leijitang@huawei.com>
There is no need to populate device id during unregisterDevice(). Nobody
makes use of this information. We just need to remove file associated
with device and that file is looked up using the hash and not the
device id which is used for thin pool operations.
So get rid of device id argument to unregisterDevice().
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
This fix tries to fix logrus formatting by removing `f` from
`logrus.[Error|Warn|Debug|Fatal|Panic|Info]f` when formatting string
is not present.
Fixed issue #23459
Signed-off-by: Daehyeok Mun <daehyeok@gmail.com>
We just introduced a new tunable dm.xfs_nospace_max_retries. But this tunable
will work only on new kernels where xfs supports this feature. On older
kernels xfs does not allow tuning this behavior.
There are two issues. First one is that if xfsSetNospaceRetries() fails,
it returns error but leaves the device activated and mounted. We should
be unmounting the device and deactivate it before returning.
Second issue is, if docker is started on older kernel, with
dm.xfs_nospace_max_retries specified, then docker will silently ignore the
fact that /sys file to tweak this behavior is not present and will continue.
But I think it might be better to fail container creation/start if kernel
does not support this feature.
This patch fixes it. After this patch, user will get an error like following
when container is run.
# docker run -ti fedora bash
docker: Error response from daemon: devmapper: user specified daemon option dm.xfs_nospace_max_retries but it does not seem to be supported on this system :open /sys/fs/xfs/dm-5/error/metadata/ENOSPC/max_retries: no such file or directory.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
When xfs filesystem is being used on top of thin pool, xfs can get ENOSPC
errors from thin pool when thin pool is full. As of now xfs retries the
IO and keeps on retrying and does not give up. This can result in container
application being stuck for a very long time. In fact I have seen instances
of unkillable processes. So that means once thin pool is full and process
gets stuck, container can't be stopped/killed either and only option left
seems to be power recycle of the box.
In another instance, writer did not block but failed after a while. But
when I tried to exit/stop the container, unmounting xfs hanged and only
thing I could do was power cycle the machine.
Now upstream kernel has committed patches where it allows user space to
customize user space behavior in case of errors. One of the knobs is
max_retries, which specifies how many times an IO should be retried
when ENOSPC is encountered.
This patch sets provides a tunable knob (dm.xfs_nospace_max_retries) so
that user can specify value for max_retries and tune xfs behavior. If
one sets this value to 0, xfs will not retry IO when ENOSPC error is
encountered. It will instead give up and shutdown filesystem.
This knob can be useful if one is running into unkillable
processes/containers issue on top of xfs.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Problem Description:
An example scenario that involves deferred removal
1. A new base image gets created (e.g. 'docker load -i'). The base device is activated and
mounted at some point in time during image creation.
2. While image creation is in progress, a privileged container is started
from another image and the host's mount name space is shared with this
container ('docker run --privileged -v /:/host').
3. Image creation completes and the base device gets unmounted. However,
as the privileged container still holds a reference on the base image
mount point, the base device cannot be removed right away. So it gets
flagged for deferred removal.
4. Next, the privileged container terminates and thus its reference to the
base image mount point gets released. The base device (which is flagged
for deferred removal) may now be cleaned up by the device-mapper. This
opens up an opportunity for a race between a 'kworker' thread (executing
the do_deferred_remove() function) and the Docker daemon (executing the
CreateSnapDevice() function).
This PR cancel the deferred removal, if the device is marked for it. And reschedule the
deferred removal later after the device is resumed successfully.
Signed-off-by: Shishir Mahajan <shishir.mahajan@redhat.com>
This fix tries to fix logrus formatting by removing `f` from
`logrus.[Error|Warn|Debug|Fatal|Panic|Info]f` when formatting string
is not present.
This fix fixes#23459.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
device Base should not exists on failure:
--- FAIL: TestDevmapperCreateBase (0.06s)
graphtest_unix.go:122: stat
/tmp/docker-graphtest-079240530/devicemapper/mnt/Base/rootfs/a subdir:
no such file or directory
--- FAIL: TestDevmapperCreateSnap (0.00s)
graphtest_unix.go:219: devmapper: device Base already
exists.
it should be:
--- FAIL: TestDevmapperCreateBase (0.25s)
graphtest_unix.go:122: stat
/tmp/docker-graphtest-828994195/devicemapper/mnt/Base/rootfs/a subdir:
no such file or directory
--- FAIL: TestDevmapperCreateSnap (0.13s)
graphtest_unix.go:122: stat
/tmp/docker-graphtest-828994195/devicemapper/mnt/Snap/rootfs/a subdir:
no such file or directory
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Right now there is no way to know what's the minimum free space threshold
daemon is applying. It would be good to export it through docker info and
then user knows what's the current value. Also this could be useful to
higher level management tools which can look at this value and setup their
own internal thresholds for image garbage collection etc.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>