The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
(cherry picked from commit c55a4ac779)
Signed-off-by: Cory Snider <csnider@mirantis.com>
Movified from 686be57d0a, and re-ran
gofmt again to address for files not present in 20.10 and vice-versa.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 686be57d0a)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Do not use 0701 perms.
0701 dir perms allows anyone to traverse the docker dir.
It happens to allow any user to execute, as an example, suid binaries
from image rootfs dirs because it allows traversal AND critically
container users need to be able to do execute things.
0701 on lower directories also happens to allow any user to modify
things in, for instance, the overlay upper dir which neccessarily
has 0755 permissions.
This changes to use 0710 which allows users in the group to traverse.
In userns mode the UID owner is (real) root and the GID is the remapped
root's GID.
This prevents anyone but the remapped root to traverse our directories
(which is required for userns with runc).
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit ef7237442147441a7cadcda0600be1186d81ac73)
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
overlay2 no longer sets `archive.OverlayWhiteoutFormat` when
running in UserNS, so we can remove the complicated logic in the
archive package.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 6322dfc217)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When running in userns, returns error (i.e. "use naive, not native")
immediately.
No substantial change to the logic.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 67aa418df2)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Previously, `d.naiveDiff.ApplyDiff` was not used even when
`useNaiveDiff()==true`
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit dd97134232)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The "userxattr" option is needed for mounting overlayfs inside a user namespace with kernel >= 5.11.
The "userxattr" option is NOT needed for the initial user namespace (aka "the host").
Also, Ubuntu (since circa 2015) and Debian (since 10) with kernel < 5.11 can mount the overlayfs in a user namespace without the "userxattr" option.
The corresponding kernel commit: 2d2f2d7322ff43e0fe92bf8cccdc0b09449bf2e1
> **ovl: user xattr**
>
> Optionally allow using "user.overlay." namespace instead of "trusted.overlay."
> ...
> Disable redirect_dir and metacopy options, because these would allow privilege escalation through direct manipulation of the
> "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
Fix issue 42055
Related to containerd/containerd PR 5076
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
(cherry picked from commit 11ef8d3ba9)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Various dirs in /var/lib/docker contain data that needs to be mounted
into a container. For this reason, these dirs are set to be owned by the
remapped root user, otherwise there can be permissions issues.
However, this uneccessarily exposes these dirs to an unprivileged user
on the host.
Instead, set the ownership of these dirs to the real root (or rather the
UID/GID of dockerd) with 0701 permissions, which allows the remapped
root to enter the directories but not read/write to them.
The remapped root needs to enter these dirs so the container's rootfs
can be configured... e.g. to mount /etc/resolve.conf.
This prevents an unprivileged user from having read/write access to
these dirs on the host.
The flip side of this is now any user can enter these directories.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This is a fix for https://github.com/docker/for-linux/issues/1012.
The code was not considering that C strings are NULL-terminated so
we need to leave one extra byte.
Without this fix, the testcase in https://github.com/docker/for-linux/issues/1012
fails with
```
Step 61/1001 : RUN echo 60 > 60
---> Running in dde85ac3b1e3
Removing intermediate container dde85ac3b1e3
---> 80a12a18a241
Step 62/1001 : RUN echo 61 > 61
error creating overlay mount to /23456789112345678921234/overlay2/d368abcc97d6c6ebcf23fa71225e2011d095295d5d8c9b31d6810bea748bdf07-init/merged: no such file or directory
```
with the output of `dmesg -T` as:
```
[Sat Dec 19 02:35:40 2020] overlayfs: failed to resolve '/23456789112345678921234/overlay2/89e435a1b24583c463abb73e8abfad8bf8a88312ef8253455390c5fa0a765517-init/wor': -2
```
with this fix, you get the expected:
```
Step 126/1001 : RUN echo 125 > 125
---> Running in 2f2e56da89e0
max depth exceeded
```
Signed-off-by: Oscar Bonilla <6f6231@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
`fuse-overlayfs` provides rootless overlayfs functionality without depending
on any kernel patch.
Aside from rootless, `fuse-overlayfs` could be potentially used for eliminating
`chown()` calls that happen in userns-remap mode, because `fuse-overlayfs` also
provides shiftfs functionality.
System requirements:
* fuse-overlayfs needs to be installed. Tested with 0.7.6.
* kernel >= 4.18
Unit test: `go test -exec sudo -v ./daemon/graphdriver/fuse-overlayfs`
The implementation is based on Podman's `overlay` driver which supports
both kernel-mode overlayfs and fuse-overlayfs in the single driver instance:
https://github.com/containers/storage/blob/39a8d5ed/drivers/overlay/overlay.go
However, Moby's implementation aims to decouple `fuse-overlayfs` driver from the
kernel-mode driver (`overlay2`) for simplicity.
Fix#40218
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Now that we do check if overlay is working by performing an actual
overlayfs mount, there's no need in extra checks for the kernel version
or the filesystem type. Actual mount check is sufficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Before this commit, overlay check was performed by looking for
`overlay` in /proc/filesystem. This obviously might not work
for rootless Docker (fs is there, but one can't use it as non-root).
This commit changes the check to perform the actual mount, by reusing
the code previously written to check for multiple lower dirs support.
The old check is removed from both drivers, as well as the additional
check for the multiple lower dirs support in overlay2 since it's now
a part of the main check.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This moves supportsMultipleLowerDir() to overlayutils
so it can be used from both overlay and overlay2.
The only changes made were:
* replace logger with logrus
* don't use workDirName mergedDirName constants
* add mnt var to improve readability a bit
This is a preparation for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When mounting overlays which have children, enforce that
the mount is always performed as read only. Newer versions
of the kernel return a device busy error when a lower directory
is in use as an upper directory in another overlay mount.
Adds committed file to indicate when an overlay is being used
as a parent, ensuring it will no longer be mounted with an
upper directory.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
As pointed out in https://github.com/moby/moby/issues/37970,
Docker overlay driver can't work with index=on feature of
the Linux kernel "overlay" filesystem. In case the global
default is set to "yes", Docker will fail with EBUSY when
trying to mount, like this:
> error creating overlay mount to ...../merged: device or resource busy
and the kernel log should contain something like:
> overlayfs: upperdir is in-use by another mount, mount with
> '-o index=off' to override exclusive upperdir protection.
A workaround is to set index=off in overlay kernel module
parameters, or even recompile the kernel with
CONFIG_OVERLAY_FS_INDEX=n in .config. Surely this is not
always practical or even possible.
The solution, as pointed out my Amir Goldstein (as well as
the above kernel message:) is to use 'index=off' option
when mounting.
NOTE since older (< 4.13rc1) kernels do not support "index="
overlayfs parameter, try to figure out whether the option
is supported. In case it's not possible to figure out,
assume it is not.
NOTE the default can be changed anytime (by writing to
/sys/module/overlay/parameters/index) so we need to always
use index=off.
[v2: move the detection code to Init()]
[v3: don't set index=off if stat() failed]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.
NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.
The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>
On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.
Signed-off-by: Salahuddin Khan <salah@docker.com>
Now all of the storage drivers use the field "storage-driver" in their log
messages, which is set to name of the respective driver.
Storage drivers changed:
- Aufs
- Btrfs
- Devicemapper
- Overlay
- Overlay 2
- Zfs
Signed-off-by: Alejandro GonzÃlez Hevia <alejandrgh11@gmail.com>
The idea behind making the graphdrivers private is to prevent leaking
mounts into other namespaces.
Unfortunately this is not really what happens.
There is one case where this does work, and that is when the namespace
was created before the daemon's namespace.
However with systemd each system servie winds up with it's own mount
namespace. This causes a race betwen daemon startup and other system
services as to if the mount is actually private.
This also means there is a negative impact when other system services
are started while the daemon is running.
Basically there are too many things that the daemon does not have
control over (nor should it) to be able to protect against these kinds
of leakages. One thing is certain, setting the graphdriver roots to
private disconnects the mount ns heirarchy preventing propagation of
unmounts... new mounts are of course not propagated either, but the
behavior is racey (or just bad in the case of restarting services)... so
it's better to just be able to keep mount propagation in tact.
It also does not protect situations like `-v
/var/lib/docker:/var/lib/docker` where all mounts are recursively bound
into the container anyway.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The overlay2 driver was not setting up the archive.TarOptions field
properly like other storage backend routes to "applyTarLayer"
functionality. The InUserNS field is populated now for overlay2 using
the same query function used by the other storage drivers.
Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
Even though it's highly discouraged, there are existing
installs that are running overlay/overlay2 on filesystems
without d_type support.
This patch allows the daemon to start in such cases, instead of
refusing to start without an option to override.
For fresh installs, backing filesystems without d_type support
will still cause the overlay/overlay2 drivers to be marked as
"unsupported", and skipped during the automatic selection.
This feature is only to keep backward compatibility, but
will be removed at some point.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Support for running overlay/overlay2 on a backing filesystem
without d_type support (most likely: xfs, as ext4 supports
this by default), was deprecated for some time.
Running without d_type support is problematic, and can
lead to difficult to debug issues ("invalid argument" errors,
or unable to remove files from the container's filesystem).
This patch turns the warning that was previously printed
into an "unsupported" error, so that the overlay/overlay2
drivers are not automatically selected when detecting supported
storage drivers.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The fsmagic check was always performed on "data-root" (`/var/lib/docker`),
not on the storage-driver's home directory (e.g. `/var/lib/docker/<somedriver>`).
This caused detection to be done on the wrong filesystem in situations
where `/var/lib/docker/<somedriver>` was a mount, and a different
filesystem than `/var/lib/docker` itself.
This patch checks if the storage-driver's home directory exists, and only
falls back to `/var/lib/docker` if it doesn't exist.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The overlay2 storage-driver requires multiple lower dir
support for overlayFs. Support for this feature was added
in kernel 4.x, but some distros (RHEL 7.4, CentOS 7.4) ship with
an older kernel with this feature backported.
This patch adds feature-detection for multiple lower dirs,
and will perform this feature-detection on pre-4.x kernels
with overlayFS support.
With this patch applied, daemons running on a kernel
with multiple lower dir support will now select "overlay2"
as storage-driver, instead of falling back to "overlay".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:
- for `Mkdir()`, `IsExist` error should (usually) be ignored
(unless you want to make sure directory was not there before)
as it means "the destination directory was already there"
- for `MkdirAll()`, `IsExist` error should NEVER be ignored.
Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.
Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.
NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).
For more details, a quote from my runc commit 6f82d4b (July 2015):
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This removes and recreates the merged dir with each umount/mount
respectively.
This is done to make the impact of leaking mountpoints have less
user-visible impact.
It's fairly easy to accidentally leak mountpoints (even if moby doesn't,
other tools on linux like 'unshare' are quite able to incidentally do
so).
As of recently, overlayfs reacts to these mounts being leaked (see
One trick to force an unmount is to remove the mounted directory and
recreate it. Devicemapper now does this, overlay can follow suit.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>