Commit graph

194 commits

Author SHA1 Message Date
Sebastiaan van Stijn
011e1c71ff
Merge pull request #43131 from thaJeztah/move_cpu_realtime_checks
daemon: move check for CPU-realtime daemon options
2022-03-07 19:27:12 +01:00
Sebastiaan van Stijn
5263bea70f
daemon: move check for CPU-realtime daemon options
Perform the validation when the daemon starts instead of performing these
validations for each individual container, so that we can fail early.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-03-03 19:50:27 +01:00
Sebastiaan van Stijn
821b4d4108
daemon/config: DefaultShmSize: minor tweak and improve docs
I had to check what the actual size was, so added it to the const's documentation.

While at it, also made use of it in a test, so that we're testing against the expected
value, and changed one alias to be consistent with other places where we alias this
import.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2022-02-18 23:29:36 +01:00
Sebastiaan van Stijn
9d9b8e0cf3
daemon.WithDevices(): use containerd's HostDevices()
Trying to reduce the use of libcontainer/devices, as it's considered
to be an "internal" package by runc.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-12-01 15:42:18 +01:00
Sebastiaan van Stijn
a826ca3aef
daemon.WithCommonOptions() fix detection of user-namespaces
Commit dae652e2e5 added support for non-privileged
containers to use ICMP_PROTO (used for `ping`). This option cannot be set for
containers that have user-namespaces enabled.

However, the detection looks to be incorrect; HostConfig.UsernsMode was added
in 6993e891d1 / ee2183881b,
and the property only has meaning if the daemon is running with user namespaces
enabled. In other situations, the property has no meaning.
As a result of the above, the sysctl would only be set for containers running
with UsernsMode=host on a daemon running with user-namespaces enabled.

This patch adds a check if the daemon has user-namespaces enabled (RemappedRoot
having a non-empty value), or if the daemon is running inside a user namespace
(e.g. rootless mode) to fix the detection.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-08-30 19:48:29 +02:00
Eng Zer Jun
c55a4ac779
refactor: move from io/ioutil to io and os package
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.

Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
2021-08-27 14:56:57 +08:00
Brian Goff
9674540ccf
Merge pull request #42520 from thaJeztah/remove_lcow_step5_alternative
Remove LCOW (step 5): volumes/mounts: remove LCOW code (alternative)
2021-07-26 10:24:52 -07:00
Sebastiaan van Stijn
9b795c3e50
pkg/sysinfo.New(), daemon.RawSysInfo(): remove "quiet" argument
The "quiet" argument was only used in a single place (at daemon startup), and
every other use had to pass "false" to prevent this function from logging
warnings.

Now that SysInfo contains the warnings that occurred when collecting the
system information, we can make leave it up to the caller to use those
warnings (and log them if wanted).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-14 23:10:07 +02:00
Sebastiaan van Stijn
300c11c7c9
volume/mounts: remove "containerOS" argument from NewParser (LCOW code)
This changes mounts.NewParser() to create a parser for the current operatingsystem,
instead of one specific to a (possibly non-matching, in case of LCOW) OS.

With the OS-specific handling being removed, the "OS" parameter is also removed
from `daemon.verifyContainerSettings()`, and various other container-related
functions.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-07-02 13:51:55 +02:00
Sebastiaan van Stijn
472f21b923
replace uses of deprecated containerd/sys.RunningInUserNS()
This utility was moved to a separate package, which has no
dependencies.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-18 11:01:24 +02:00
Sebastiaan van Stijn
aa4dce742f
daemon: improve handling of ROOTLESSKIT_PARENT_EUID
- daemon.WithRootless():  make sure ROOTLESSKIT_PARENT_EUID is valid int
- daemon.RawSysInfo(): minor simplification, and rename variable that
  clashed with imported package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-06-05 21:12:32 +02:00
Sebastiaan van Stijn
2834f842ee
Use containerd's apparmor package to detect if apparmor can be used
The runc/libcontainer apparmor package on master no longer checks if apparmor_parser
is enabled, or if we are running docker-in-docker.

While those checks are not relevant to runc (as it doesn't load the profile), these
checks _are_ relevant to us (and containerd). So switching to use the containerd
apparmor package, which does include the needed checks.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-04-08 20:22:08 +02:00
Akihiro Suda
248f98ef5e
rootless: bind mount: fix "operation not permitted"
The following was failing previously, because `getUnprivilegedMountFlags()` was not called:
```console
$ sudo mount -t tmpfs -o noexec none /tmp/foo
$ $ docker --context=rootless run -it --rm -v /tmp/foo:/mnt:ro alpine
docker: Error response from daemon: OCI runtime create failed: container_linux.go:367: starting container process caused: process_linux.go:520: container init caused: rootfs_linux.go:60: mounting "/tmp/foo" to rootfs at "/home/suda/.local/share/docker/overlay2/b8e7ea02f6ef51247f7f10c7fb26edbfb308d2af8a2c77915260408ed3b0a8ec/merged/mnt" caused: operation not permitted: unknown.
```

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2021-04-01 14:58:11 +09:00
Sebastiaan van Stijn
6458f750e1
use containerd/cgroups to detect cgroups v2
libcontainer does not guarantee a stable API, and is not intended
for external consumers.

this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-09 15:00:32 +01:00
Sebastiaan van Stijn
65a33d02f6
Simplify getUser() to use libcontainer built-in functionality
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-09-09 13:25:59 +02:00
Aleksa Sarai
3108ae6226
oci: correctly use user.GetExecUser interface
A nil interface in Go is not the same as a nil pointer that satisfies
the interface. libcontainer/user has special handling for missing
/etc/{passwd,group} files but this is all based on nil interface checks,
which were broken by Docker's usage of the API.

When combined with some recent changes in runc that made read errors
actually be returned to the caller, this results in spurrious -EINVAL
errors when we should detect the situation as "there is no passwd file".

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-07-29 14:04:47 +02:00
Brian Goff
24f173a003 Replace service "Capabilities" w/ add/drop API
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.

This brings in the changes already merged into swarmkit.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-07-27 10:09:42 -07:00
Kir Kolyshkin
e3cff19dd1 Untangle CPU RT controller init
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.

This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).

This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.

Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.

This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Sebastiaan van Stijn
4534a7afc3
daemon: use containerd/sys to detect UserNamespaces
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.

In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-15 13:06:08 +02:00
Justin Cormack
dae652e2e5
Add default sysctls to allow ping sockets and privileged ports with no capabilities
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2020-06-04 18:11:08 +01:00
Akihiro Suda
33ee7941d4 support --privileged --cgroupns=private on cgroup v1
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-21 23:11:32 +09:00
Sebastiaan van Stijn
5d040cbd16
daemon: fix capitalization of some functions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:19 +02:00
Sebastiaan van Stijn
eeef12f469
daemon: address some minor linting issues and nits
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:17 +02:00
Kir Kolyshkin
39048cf656 Really switch to moby/sys/mount*
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.

This commit was generated by the following bash script:

```
set -e -u -o pipefail

for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
	sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
		-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
		$file
	goimports -w $file
done
```

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 09:46:25 -07:00
Akihiro Suda
ca4b51868a rootless: support --exec-opt native.cgroupdriver=systemd
Support cgroup as in Rootless Podman.

Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-02-14 15:32:31 +09:00
Sebastiaan van Stijn
e6c1820ef5
Merge pull request #40174 from AkihiroSuda/cgroup2
support cgroup2
2020-01-09 20:09:11 +01:00
Akhil Mohan
86ebbe16de
remove host directory check
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2020-01-02 14:28:51 +05:30
Akihiro Suda
19baeaca26 cgroup2: enable cgroup namespace by default
For cgroup v1, we were unable to change the default because of
compatibility issue.

For cgroup v2, we should change the default right now because switching
to cgroup v2 is already breaking change.

See also containers/libpod#4363 containers/libpod#4374

Privileged containers also use cgroupns=private by default.
https://github.com/containers/libpod/pull/4374#issuecomment-549776387

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 02:58:40 +09:00
Akhil Mohan
35b9e6989f
Make --device flag work in privileged mode
When a container is started in privileged mode, the device mappings
provided by `--device` flag was ignored. Now the device mappings will
be considered even in privileged mode.

Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2019-12-06 18:43:56 +05:30
wenlxie
03b3ec1dd5
make --device works at privileged mode
Signed-off-by: wenlxie <wenlxie@ebay.com>
2019-12-06 18:17:03 +05:30
Olli Janatuinen
1308a3a99f Move DefaultCapabilities() to caps package
Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
2019-11-14 21:13:16 +02:00
Justin Cormack
dde030a6b1
Merge pull request #40083 from thaJeztah/daemon_consts
daemon: use constants for AppArmor and Seccomp
2019-10-17 11:12:37 -07:00
Grant Millar
df7b8f458a daemon: Use short libnetwork ID in exec-root & update libnetwork
Signed-off-by: Grant Millar <rid@cylo.io>
2019-10-15 11:40:24 +01:00
Sebastiaan van Stijn
a33cf495f2
daemon: use constants for AppArmor profiles
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-10-13 19:16:12 +02:00
Sebastiaan van Stijn
07ff4f1de8
goimports: fix imports
Format the source according to latest goimports.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-18 12:56:54 +02:00
Rob Gulewich
072400fc4b Make cgroup namespaces configurable
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
  container creation time to control what cgroup namespace the container
  is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
  what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
  compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Rob Gulewich
256eb04d69 Start containers in their own cgroup namespaces
This is enabled for all containers that are not run with --privileged,
if the kernel supports it.

Fixes #38332

Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-05-07 10:22:16 -07:00
Michael Crosby
c478553640 Export all spec generation opts
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-04-10 15:38:36 -04:00
Michael Crosby
cb902f4430 Refactor few spec generation ops
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-04-09 16:51:40 -04:00
John Howard
a3eda72f71
Merge pull request #38541 from Microsoft/jjh/containerd
Windows: Experimental: ContainerD runtime
2019-03-19 21:09:19 -07:00
Tibor Vass
8f936ae8cf Add DeviceRequests to HostConfig to support NVIDIA GPUs
This patch hard-codes support for NVIDIA GPUs.
In a future patch it should move out into its own Device Plugin.

Signed-off-by: Tibor Vass <tibor@docker.com>
2019-03-18 17:19:45 +00:00
John Howard
d4ceb61f2b LCOW:Reworking spec builder
Signed-off-by: John Howard <jhoward@microsoft.com>
2019-03-12 18:41:55 -07:00
Sebastiaan van Stijn
dd94555787
Merge pull request #32519 from darkowlzz/32443-docker-update-pids-limit
Add pids-limit support in docker update
2019-02-23 15:20:59 +01:00
Sunny Gogoi
74eb258ffb Add pids-limit support in docker update
- Adds updating PidsLimit in UpdateContainer().
- Adds setting PidsLimit in toContainerResources().

Signed-off-by: Sunny Gogoi <indiasuny000@gmail.com>
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2019-02-21 14:17:38 -08:00
Akihiro Suda
ec87479b7e allow running dockerd in an unprivileged user namespace (rootless mode)
Please refer to `docs/rootless.md`.

TLDR:
 * Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
 * `dockerd-rootless.sh --experimental`
 * `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2019-02-04 00:24:27 +09:00
Olli Janatuinen
80d7bfd54d Capabilities refactor
- Add support for exact list of capabilities, support only OCI model
- Support OCI model on CapAdd and CapDrop but remain backward compatibility
- Create variable locally instead of declaring it at the top
- Use const for magic "ALL" value
- Rename `cap` variable as it overlaps with `cap()` built-in
- Normalize and validate capabilities before use
- Move validation for conflicting options to validateHostConfig()
- TweakCapabilities: simplify logic to calculate capabilities

Signed-off-by: Olli Janatuinen <olli.janatuinen@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-01-22 21:50:41 +02:00
Michael Crosby
b940cc5cff Move caps and device spec utils to oci pkg
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-12-11 10:20:25 -05:00
Aleksa Sarai
7417f50575
oci: include the domainname in "kernel.domainname"
The OCI doesn't have a specific field for an NIS domainname[1] (mainly
because FreeBSD and Solaris appear to have a similar concept but it is
configured entirely differently).

However, on Linux, the NIS domainname can be configured through both the
setdomainname(2) syscall but also through the "kernel.domainname"
sysctl. Since the OCI has a way of injecting sysctls this means we don't
need to have any OCI changes to support NIS domainnames (and we can
always switch if the OCI picks up such support in the future).

It should be noted that because we have to generate this each spec
creation we also have to make sure that it's not clobbered by the
HostConfig. I'm pretty sure making this change generic (so that
HostConfig will not clobber any pre-set sysctls) will not cause other
issues to crop up.

[1]: https://github.com/opencontainers/runtime-spec/issues/592

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-11-30 17:31:38 +11:00
Akihiro Suda
596cdffb9f mount: add BindOptions.NonRecursive (API v1.40)
This allows non-recursive bind-mount, i.e. mount(2) with "bind" rather than "rbind".

Swarm-mode will be supported in a separate PR because of mutual vendoring.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-11-06 17:51:58 +09:00
Sebastiaan van Stijn
deac65c929
Merge pull request #37850 from AkihiroSuda/propagate-exec-root-to-libnetwork
daemon: propagate exec-root to libnetwork-setkey
2018-09-28 15:20:37 +02:00
Brian Goff
12d5eb8e22
Merge pull request #37703 from kolyshkin/rm-dead-code
daemon/setMounts(): remove dead code
2018-09-25 16:07:15 -07:00
Akihiro Suda
40385208cb daemon: propagate exec-root to libnetwork-setkey
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-15 13:49:30 +09:00
Kir Kolyshkin
ac8c3debdb daemon/setMounts(): remove dead code
Since PR 11353 (commit 7804cd36ee "Filter out default mounts that
are override by user") there can be no duplicated mounts in the list,
so the check is redundant.

This should speed up container start by a nanosecond or two.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-08-27 15:40:10 -07:00
Kir Kolyshkin
bcacbf523b Fix docker --init with /dev bind mount
In case a user wants to have a child reaper inside a container
(i.e. run "docker --init") AND a bind-mounted /dev, the following
error occurs:

> docker run -d -v /dev:/dev --init busybox top
> 088c96808c683077f04c4cc2711fddefe1f5970afc085d59e0baae779745a7cf
> docker: Error response from daemon: OCI runtime create failed: container_linux.go:296: starting container process caused "exec: "/dev/init": stat /dev/init: no such file or directory": unknown.

This happens because if a user-suppled /dev is provided, all the
built-in /dev/xxx mounts are filtered out.

To solve, let's move in-container init to /sbin, as the chance that
/sbin will be bind-mounted to a container is smaller than that for /dev.
While at it, let's give it more unique name (docker-init).

NOTE it still won't work for the case of bind-mounted /sbin.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-08-27 15:38:46 -07:00
Salahuddin Khan
763d839261 Add ADD/COPY --chown flag support to Windows
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.

NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.

The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>

On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.

Signed-off-by: Salahuddin Khan <salah@docker.com>
2018-08-13 21:59:11 -07:00
Kazuhiro Sera
1e49fdcafc Fix the several typos detected by github.com/client9/misspell
Signed-off-by: Kazuhiro Sera <seratch@gmail.com>
2018-08-09 00:45:00 +09:00
John Starks
e9268d9642 lcow: Allow the client to add device cgroup rules
Signed-off-by: John Starks <jostarks@microsoft.com>
2018-06-15 16:14:17 -07:00
John Starks
349aeeab7c lcow: Allow the client to add or remove capabilities
Signed-off-by: John Starks <jostarks@microsoft.com>
2018-06-15 16:03:33 -07:00
Jess Frazelle
3694c1e34e
api: add configurable MaskedPaths and ReadOnlyPaths to the API
This adds MaskedPaths and ReadOnlyPaths options to HostConfig for containers so
that a user can override the default values.

When the value sent through the API is nil the default is used.
Otherwise the default is overridden.

Adds integration tests for MaskedPaths and ReadonlyPaths.

Signed-off-by: Jess Frazelle <acidburn@microsoft.com>
2018-06-05 12:33:14 -04:00
Sebastiaan van Stijn
f23c00d870
Various code-cleanup
remove unnescessary import aliases, brackets, and so on.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2018-05-23 17:50:54 +02:00
Sebastiaan van Stijn
31aca4bef4
Merge pull request #36991 from kolyshkin/slice-in-place
daemon.setMounts(): copy slice in place
2018-05-14 13:49:47 +02:00
Kir Kolyshkin
d8fd6137a1 daemon.getSourceMount(): fix for / mount point
A recent optimization in getSourceMount() made it return an error
in case when the found mount point is "/". This prevented bind-mounted
volumes from working in such cases.

A (rather trivial but adeqate) unit test case is added.

Fixes: 871c957242 ("getSourceMount(): simplify")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-05-10 12:53:37 -07:00
Kir Kolyshkin
d4c94e83ca daemon.setMounts(): copy slice in place
It does not make sense to copy a slice element by element, then discard
the source one. Let's do copy in place instead which is way more
efficient.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-05-03 10:26:06 -07:00
Vincent Demeester
53982e3fc1
Merge pull request #36091 from kolyshkin/mount
pkg/mount improvements
2018-04-21 11:03:54 +02:00
Kir Kolyshkin
871c957242 getSourceMount(): simplify
The flow of getSourceMount was:
 1 get all entries from /proc/self/mountinfo
 2 do a linear search for the `source` directory
 3 if found, return its data
 4 get the parent directory of `source`, goto 2

The repeated linear search through the whole mountinfo (which can have
thousands of records) is inefficient. Instead, let's just

 1 collect all the relevant records (only those mount points
   that can be a parent of `source`)
 2 find the record with the longest mountpath, return its data

This was tested manually with something like

```go
func TestGetSourceMount(t *testing.T) {
	mnt, flags, err := getSourceMount("/sys/devices/msr/")
	assert.NoError(t, err)
	t.Logf("mnt: %v, flags: %v", mnt, flags)
}
```

...but it relies on having a specific mount points on the system
being used for testing.

[v2: add unit tests for ParentsFilter]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-04-19 14:49:17 -07:00
Kir Kolyshkin
bb934c6aca pkg/mount: implement/use filter for mountinfo parsing
Functions `GetMounts()` and `parseMountTable()` return all the entries
as read and parsed from /proc/self/mountinfo. In many cases the caller
is only interested only one or a few entries, not all of them.

One good example is `Mounted()` function, which looks for a specific
entry only. Another example is `RecursiveUnmount()` which is only
interested in mount under a specific path.

This commit adds `filter` argument to `GetMounts()` to implement
two things:
 1. filter out entries a caller is not interested in
 2. stop processing if a caller is found what it wanted

`nil` can be passed to get a backward-compatible behavior, i.e. return
all the entries.

A few filters are implemented:
 - `PrefixFilter`: filters out all entries not under `prefix`
 - `SingleEntryFilter`: looks for a specific entry

Finally, `Mounted()` is modified to use `SingleEntryFilter()`, and
`RecursiveUnmount()` is using `PrefixFilter()`.

Unit tests are added to check filters are working.

[v2: ditch NoFilter, use nil]
[v3: ditch GetMountsFiltered()]
[v4: add unit test for filters]
[v5: switch to gotestyourself]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-04-19 14:48:09 -07:00
Brian Goff
6a70fd222b Move mount parsing to separate package.
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-04-19 06:35:54 -04:00
Justin Cormack
a729853bc7
Always make sysfs read-write with privileged
It does not make any sense to vary this based on whether the
rootfs is read only. We removed all the other mount dependencies
on read-only eg see #35344.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-04-06 16:17:18 +01:00
Justin Cormack
15ff09395c
If container will run as non root user, drop permitted, effective caps early
As soon as the initial executable in the container is executed as a non root user,
permitted and effective capabilities are dropped. Drop them earlier than this, so
that they are dropped before executing the file. The main effect of this is that
if `CAP_DAC_OVERRIDE` is set (the default) the user will not be able to execute
files they do not have permission to execute, which previously they could.

The old behaviour was somewhat surprising and the new one is definitely correct,
but it is not in any meaningful way exploitable, and I do not think it is
necessary to backport this fix. It is unlikely to have any negative effects as
almost all executables have world execute permission anyway.

Use the bounding set not the effective set as the canonical set of capabilities, as
effective will now vary.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2018-03-19 14:45:27 -07:00
Kir Kolyshkin
d6ea46ceda container.BaseFS: check for nil before deref
Commit 7a7357dae1 ("LCOW: Implemented support for docker cp + build")
changed `container.BaseFS` from being a string (that could be empty but
can't lead to nil pointer dereference) to containerfs.ContainerFS,
which could be be `nil` and so nil dereference is at least theoretically
possible, which leads to panic (i.e. engine crashes).

Such a panic can be avoided by carefully analysing the source code in all
the places that dereference a variable, to make the variable can't be nil.
Practically, this analisys are impossible as code is constantly
evolving.

Still, we need to avoid panics and crashes. A good way to do so is to
explicitly check that a variable is non-nil, returning an error
otherwise. Even in case such a check looks absolutely redundant,
further changes to the code might make it useful, and having an
extra check is not a big price to pay to avoid a panic.

This commit adds such checks for all the places where it is not obvious
that container.BaseFS is not nil (which in this case means we do not
call daemon.Mount() a few lines earlier).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-03-13 21:24:48 -07:00
Kir Kolyshkin
cad74056c0 daemon/setMounts(): do not make /dev/shm ro
It has been pointed out that if --read-only flag is given, /dev/shm
also becomes read-only in case of --ipc private.

This happens because in this case the mount comes from OCI spec
(since commit 7120976d74), and is a regression caused by that
commit.

The meaning of --read-only flag is to only have a "main" container
filesystem read-only, not the auxiliary stuff (that includes /dev/shm,
other mounts and volumes, --tmpfs, /proc, /dev and so on).

So, let's make sure /dev/shm that comes from OCI spec is not made
read-only.

Fixes: 7120976d74 ("Implement none, private, and shareable ipc modes")

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-03-08 14:04:03 -08:00
Sebastiaan van Stijn
0076343b29
Merge pull request #33702 from aaronlehmann/templated-secrets-and-configs
Templated secrets and configs
2018-02-21 13:39:10 +01:00
Brian Goff
c02171802b Merge configs/secrets in unix implementation
On unix, merge secrets/configs handling. This is important because
configs can contain secrets (via templating) and potentially a config
could just simply have secret information "by accident" from the user.
This just make sure that configs are as secure as secrets and de-dups a
lot of code.
Generally this makes everything simpler and configs more secure.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-02-16 11:25:14 -05:00
Brian Goff
8e8f5f4457 Always mount configs with tmpfs
This makes configs and secrets behavior identical.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-02-16 11:25:14 -05:00
Aaron Lehmann
cd3d0486a6 Store configs that contain secrets on tmpfs
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
2018-02-16 11:25:14 -05:00
Brian Goff
487c6c7e73 Ensure daemon root is unmounted on shutdown
This is only for the case when dockerd has had to re-mount the daemon
root as shared.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-02-15 15:58:20 -05:00
Sebastiaan van Stijn
ea34f82711
Merge pull request #36055 from cpuguy83/slave_mounts_for_root
Use rslave propagation for mounts from daemon root
2018-02-15 12:57:25 +01:00
Brian Goff
589a0afa8c Use rslave propagation for mounts from daemon root
By default, if a user requests a bind mount it uses private propagation.
When the source path is a path within the daemon root this, along with
some other propagation values that the user can use, causes issues when
the daemon tries to remove a mountpoint because a container will then
have a private reference to that mount which prevents removal.

Unmouting with MNT_DETATCH can help this scenario on newer kernels, but
ultimately this is just covering up the problem and doesn't actually
free up the underlying resources until all references are destroyed.

This change does essentially 2 things:

1. Change the default propagation when unspecified to `rslave` when the
source path is within the daemon root path or a parent of the daemon
root (because everything is using rbinds).
2. Creates a validation error on create when the user tries to specify
an unacceptable propagation mode for these paths...
basically the only two acceptable modes are `rslave` and `rshared`.

In cases where we have used the new default propagation but the
underlying filesystem is not setup to handle it (fs must hvae at least
rshared propagation) instead of erroring out like we normally would,
this falls back to the old default mode of `private`, which preserves
backwards compatibility.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-02-07 14:27:09 -05:00
Daniel Nephin
4f0d95fa6e Add canonical import comment
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2018-02-05 16:51:57 -05:00
Michael Crosby
59ec65cd8c Use proc/exe for reexec
You don't need to resolve the symlink for the exec as long as the
process is to keep running during execution.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-26 14:13:43 -05:00
Brian Goff
eaa5192856 Make container resource mounts unbindable
It's a common scenario for admins and/or monitoring applications to
mount in the daemon root dir into a container. When doing so all mounts
get coppied into the container, often with private references.
This can prevent removal of a container due to the various mounts that
must be configured before a container is started (for example, for
shared /dev/shm, or secrets) being leaked into another namespace,
usually with private references.

This is particularly problematic on older kernels (e.g. RHEL < 7.4)
where a mount may be active in another namespace and attempting to
remove a mountpoint which is active in another namespace fails.

This change moves all container resource mounts into a common directory
so that the directory can be made unbindable.
What this does is prevents sub-mounts of this new directory from leaking
into other namespaces when mounted with `rbind`... which is how all
binds are handled for containers.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2018-01-16 15:09:05 -05:00
Vincent Demeester
f70c715be0
Merge pull request #35316 from kolyshkin/facepalm
Fix honoring tmpfs-size for user /dev/shm mount
2017-11-14 11:13:59 +01:00
Kir Kolyshkin
31d30a985d Fix user mount /dev/shm size
Commit 7120976d74 ("Implement none, private, and shareable ipc
modes") introduces a bug: if a user-specified mount for /dev/shm
is provided, its size is overriden by value of ShmSize.

A reproducer is simple:

 docker run --rm
	--mount type=tmpfs,dst=/dev/shm,tmpfs-size=100K \
	alpine df /dev/shm

This commit is an attempt to fix the bug, as well as optimize things
a but and make the code easier to read.

https://github.com/moby/moby/issues/35271

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-11-12 21:42:59 -08:00
Chao Wang
5c154cfac8 Copy Inslice() to those parts that use it
Signed-off-by: Chao Wang <wangchao.fnst@cn.fujitsu.com>
2017-11-10 13:42:38 +08:00
Justin Cormack
f5c70c5b75
Merge pull request #35365 from Microsoft/jjh/removeduplicateoomscoreadj
Remove duplicate redundant setting of OOMScoreAdj in OCI spec
2017-11-03 13:59:51 +00:00
Daniel J Walsh
5f3bd2473e /dev should not be readonly with --readonly flag
/dev is mounted on a tmpfs inside of a container.  Processes inside of containers
some times need to create devices nodes, or to setup a socket that listens on /dev/log
Allowing these containers to run with the --readonly flag makes sense.  Making a tmpfs
readonly does not add any security to the container, since there is plenty of places
where the container can write tmpfs content.

I have no idea why /dev was excluded.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2017-11-02 10:28:51 -04:00
John Howard
f0b44881b5 Remove dupl setting of OOMScoreAdj in OCI spec
Signed-off-by: John Howard <jhoward@microsoft.com>
2017-11-01 11:01:43 -07:00
Kenfe-Mickael Laventure
ddae20c032
Update libcontainerd to use containerd 1.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-10-20 07:11:37 -07:00
Aleksa Sarai
c0f883fdee
daemon: oci: obey CL_UNPRIVILEGED for user namespaced daemon
When runc is bind-mounting a particular path "with options", it has to
do so by first creating a bind-mount and the modifying the options of
said bind-mount via remount. However, in a user namespace, there are
restrictions on which flags you can change with a remount (due to
CL_UNPRIVILEGED being set in this instance). Docker historically has
ignored this, and as a result, internal Docker mounts (such as secrets)
haven't worked with --userns-remap. Fix this by preserving
CL_UNPRIVILEGED mount flags when Docker is spawning containers with user
namespaces enabled.

Ref: https://github.com/opencontainers/runc/pull/1603
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-10-16 02:52:56 +11:00
Victor Vieux
a5f9783c93 Merge pull request #34252 from Microsoft/akagup/lcow-remotefs-sandbox
LCOW: Support for docker cp, ADD/COPY on build
2017-09-15 16:49:48 -07:00
Simon Ferquel
e89b6e8c2d Volume refactoring for LCOW
Signed-off-by: Simon Ferquel <simon.ferquel@docker.com>
2017-09-14 12:33:31 -07:00
Akash Gupta
7a7357dae1 LCOW: Implemented support for docker cp + build
This enables docker cp and ADD/COPY docker build support for LCOW.
Originally, the graphdriver.Get() interface returned a local path
to the container root filesystem. This does not work for LCOW, so
the Get() method now returns an interface that LCOW implements to
support copying to and from the container.

Signed-off-by: Akash Gupta <akagup@microsoft.com>
2017-09-14 12:07:52 -07:00
Daniel Nephin
f7f101d57e Add gosimple linter
Update gometalinter

Signed-off-by: Daniel Nephin <dnephin@docker.com>
2017-09-12 12:09:59 -04:00
Yong Tang
cb952bf006 Merge pull request #34625 from dnephin/more-linters
Add interfacer and unconvert linters
2017-09-01 08:46:08 -07:00
Daniel Nephin
2f5f0af3fd Add unconvert linter
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2017-08-24 15:08:31 -04:00
Kenfe-Mickael Laventure
45d85c9913
Update containerd to 06b9cb35161009dcb7123345749fef02f7cea8e0
This also update:
 - runc to 3f2f8b84a77f73d38244dd690525642a72156c64
 - runtime-specs to v1.0.0

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-21 12:04:07 -07:00
Daniel Nephin
9b47b7b151 Fix golint errors.
Signed-off-by: Daniel Nephin <dnephin@docker.com>
2017-08-18 14:23:44 -04:00
Brian Goff
ebcb7d6b40 Remove string checking in API error handling
Use strongly typed errors to set HTTP status codes.
Error interfaces are defined in the api/errors package and errors
returned from controllers are checked against these interfaces.

Errors can be wraeped in a pkg/errors.Causer, as long as somewhere in the
line of causes one of the interfaces is implemented. The special error
interfaces take precedence over Causer, meaning if both Causer and one
of the new error interfaces are implemented, the Causer is not
traversed.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-08-15 16:01:11 -04:00
Kir Kolyshkin
7120976d74 Implement none, private, and shareable ipc modes
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.

Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).

This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:

 - 'shareable':	enables sharing this container's IPC with others
		(this used to be the implicit default);

 - 'private':	disables sharing this container's IPC.

In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.

While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:

> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...

...so here's yet yet another mode:

 - 'none':	no /dev/shm mount inside the container (though it still
		has its own private IPC namespace).

Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.

Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).

Some other changes this patch introduces are:

1. A mount for /dev/shm is added to default OCI Linux spec.

2. IpcMode.Valid() is simplified to remove duplicated code that parsed
   'container:ID' form. Note the old version used to check that ID does
   not contain a semicolon -- this is no longer the case (tests are
   modified accordingly). The motivation is we should either do a
   proper check for container ID validity, or don't check it at all
   (since it is checked in other places anyway). I chose the latter.

3. IpcMode.Container() is modified to not return container ID if the
   mode value does not start with "container:", unifying the check to
   be the same as in IpcMode.IsContainer().

3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
   to add checks for newly added values.

[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
     container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2017-08-14 10:50:39 +03:00
Derek McGowan
1009e6a40b
Update logrus to v1.0.1
Fixes case sensitivity issue

Signed-off-by: Derek McGowan <derek@mcgstyle.net>
2017-07-31 13:16:46 -07:00