Commit graph

376 commits

Author SHA1 Message Date
gunadhya
cda6988478
Fix Error in daemon_unix.go and docker_cli_run_unit_test.go
Signed-off-by: gunadhya <6939749+gunadhya@users.noreply.github.com>
(cherry picked from commit 64465f3b5f)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2021-02-17 21:17:28 +01:00
Brian Goff
e908cc3901 Use real root with 0701 perms
Various dirs in /var/lib/docker contain data that needs to be mounted
into a container. For this reason, these dirs are set to be owned by the
remapped root user, otherwise there can be permissions issues.
However, this uneccessarily exposes these dirs to an unprivileged user
on the host.

Instead, set the ownership of these dirs to the real root (or rather the
UID/GID of dockerd) with 0701 permissions, which allows the remapped
root to enter the directories but not read/write to them.
The remapped root needs to enter these dirs so the container's rootfs
can be configured... e.g. to mount /etc/resolve.conf.

This prevents an unprivileged user from having read/write access to
these dirs on the host.
The flip side of this is now any user can enter these directories.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2021-01-26 17:23:32 +00:00
Sebastiaan van Stijn
1c0af18c6c
vendor: opencontainers/selinux v1.8.0, and remove selinux build-tag and stubs
full diff: https://github.com/opencontainers/selinux/compare/v1.7.0...v1.8.0

Remove "selinux" build tag

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-12-24 00:47:16 +01:00
Sebastiaan van Stijn
cf31b9622a
Merge pull request #41622 from bboehmke/ipv6_nat
IPv6 iptables config option
2020-12-07 11:59:42 +01:00
Benjamin Böhmke
cd63cc846e mark ip6tables as experimental feature
Signed-off-by: Benjamin Böhmke <benjamin@boehmke.net>
2020-12-02 22:23:33 +01:00
Sebastiaan van Stijn
6458f750e1
use containerd/cgroups to detect cgroups v2
libcontainer does not guarantee a stable API, and is not intended
for external consumers.

this patch replaces some uses of libcontainer/cgroups with
containerd/cgroups.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-11-09 15:00:32 +01:00
Benjamin Böhmke
66459cc623 Added ip6tables config option
Signed-off-by: Benjamin Böhmke <benjamin@boehmke.net>
2020-11-05 16:18:23 +01:00
Sebastiaan van Stijn
182795cff6
Do not call mount.RecursiveUnmount() on Windows
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-29 23:00:16 +01:00
Sebastiaan van Stijn
cf7a5be0f2
daemon: don't adjust oom-score if score is 0
This patch makes two changes if --oom-score-adj is set to 0

- do not adjust the oom-score-adjust cgroup for dockerd
- do not set the hard-coded -999 score for containerd if
  containerd is running as child process

Before this change:

oom-score-adj | dockerd       | containerd as child-process
--------------|---------------|----------------------------
-             | -500          | -500 (same as dockerd)
-100          | -100          | -100 (same as dockerd)
 0            |  0            | -999 (hard-coded default)

With this change:

oom-score-adj | dockerd       | containerd as child-process
--------------|---------------|----------------------------
-             | -500          | -500 (same as dockerd)
-100          | -100          | -100 (same as dockerd)
0             | not adjusted  | not adjusted

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-10-05 19:52:02 +02:00
Jeff Zvier
a7c279f203 Add more error message for ops when container limit use an device which not exist
Signed-off-by: Jeff Zvier <zvier20@gmail.com>
2020-08-11 06:33:22 +08:00
Akihiro Suda
51e3cd4761
statsV2: implement Failcnt
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-07-30 14:31:20 +09:00
Akihiro Suda
b8ca7de823
Deprecate KernelMemory
Kernel memory limit is not supported on cgroup v2.
Even on cgroup v1, kernel memory limit (`kmem.limit_in_bytes`) has been deprecated since kernel 5.4.
0158115f70

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-07-24 20:44:29 +09:00
Brian Goff
260c26b7be
Merge pull request #41016 from kolyshkin/cgroup-init 2020-07-16 11:26:52 -07:00
Brian Goff
61b73ee714
Merge pull request #41182 from cpuguy83/runtime_configure_shim 2020-07-14 14:16:04 -07:00
Brian Goff
f63f73a4a8 Configure shims from runtime config
In dockerd we already have a concept of a "runtime", which specifies the
OCI runtime to use (e.g. runc).
This PR extends that config to add containerd shim configuration.
This option is only exposed within the daemon itself (cannot be
configured in daemon.json).
This is due to issues in supporting unknown shims which will require
more design work.

What this change allows us to do is keep all the runtime config in one
place.

So the default "runc" runtime will just have it's already existing shim
config codified within the runtime config alone.
I've also added 2 more "stock" runtimes which are basically runc+shimv1
and runc+shimv2.
These new runtime configurations are:

- io.containerd.runtime.v1.linux - runc + v1 shim using the V1 shim API
- io.containerd.runc.v2 - runc + shim v2

These names coincide with the actual names of the containerd shims.

This allows the user to essentially control what shim is going to be
used by either specifying these as a `--runtime` on container create or
by setting `--default-runtime` on the daemon.

For custom/user-specified runtimes, the default shim config (currently
shim v1) is used.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-07-13 14:18:02 -07:00
Sebastiaan van Stijn
d2e23405be
Set minimum memory limit to 6M, to account for higher startup memory use
For some time, we defined a minimum limit for `--memory` limits to account for
overhead during startup, and to supply a reasonable functional container.

Changes in the runtime (runc) introduced a higher memory footprint during container
startup, which now lead to obscure error-messages that are unfriendly for users:

    run --rm --memory=4m alpine echo success
    docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"4194304\\\\\\\" to \\\\\\\"/sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes\\\\\\\": write /sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
    ERRO[0000] error waiting for container: context canceled

Containers that fail to start because of this limit, will not be marked as OOMKilled,
which makes it harder for users to find the cause of the failure.

Note that _after_ this memory is only required during startup of the container. After
the container was started, the container may not consume this memory, and limits
could (manually) be lowered, for example, an alpine container running only a shell
can run with 512k of memory;

    echo 524288  > /sys/fs/cgroup/memory/docker/acdd326419f0898be63b0463cfc81cd17fb34d2dae6f8aa3768ee6a075ca5c86/memory.limit_in_bytes

However, restarting the container will reset that manual limit to the container's
configuration. While `docker container update` would allow for the updated limit to
be persisted, (re)starting the container after updating produces the same error message
again, so we cannot use different limits for `docker run` / `docker create` and `docker update`.

This patch raises the minimum memory limnit to 6M, so that a better error-message is
produced if a user tries to create a container with a memory-limit that is too low:

    docker create --memory=4m alpine echo success
    docker: Error response from daemon: Minimum memory limit allowed is 6MB.

Possibly, this constraint could be handled by runc, so that different runtimes
could set a best-matching limit (other runtimes may require less overhead).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-01 13:29:07 +02:00
Kir Kolyshkin
e3cff19dd1 Untangle CPU RT controller init
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.

This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).

This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.

Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.

This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Kir Kolyshkin
afbeaf6f29 pkg/sysinfo: rm duplicates
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.

Same for CPU realtime.

For compatibility reasons, /info stays the same for now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Sebastiaan van Stijn
4534a7afc3
daemon: use containerd/sys to detect UserNamespaces
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.

In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-15 13:06:08 +02:00
Sebastiaan van Stijn
3aac5f0bbb
Merge pull request #41018 from akhilerm/identity-mapping
remove group name from identity mapping
2020-06-08 15:15:05 +02:00
Akhil Mohan
7ad0da7051
remove group name from identity mapping
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.

Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.

Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2020-06-03 20:04:42 +05:30
Brian Goff
763f9e799b
Merge pull request #40846 from AkihiroSuda/cgroup2-use-systemd-by-default
cgroup2: use "systemd" cgroup driver by default when available
2020-05-28 11:37:39 -07:00
Sebastiaan van Stijn
b453b64d04
Merge pull request #40845 from AkihiroSuda/allow-privileged-cgroupns-private-on-cgroup-v1
support `--privileged --cgroupns=private` on cgroup v1
2020-05-07 21:11:42 +02:00
Brian Goff
f6163d3f7a
Merge pull request #40673 from kolyshkin/scan
Simplify daemon.overlaySupportsSelinux(), fix use of bufio.Scanner.Err()
2020-04-29 17:18:37 -07:00
Akihiro Suda
4714ab5d6c cgroup2: use "systemd" cgroup driver by default when available
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.

This commit does not affect cgroup v1 hosts.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-22 05:13:37 +09:00
Akihiro Suda
33ee7941d4 support --privileged --cgroupns=private on cgroup v1
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-21 23:11:32 +09:00
Akihiro Suda
f350b53241 cgroup2: implement docker info
ref: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-17 07:20:01 +09:00
Sebastiaan van Stijn
eb14d936bf
daemon: rename variables that collide with imported package names
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:23 +02:00
Sebastiaan van Stijn
5d040cbd16
daemon: fix capitalization of some functions
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-14 17:22:19 +02:00
Sebastiaan van Stijn
af0415257e
Merge pull request #40694 from kolyshkin/moby-sys-mount-part-II
switch to moby/sys/{mount,mountinfo} part II
2020-04-02 21:52:21 +02:00
Akihiro Suda
3802830989 cgroup2: implement docker stats
The following fields are unsupported:

* BlkioStats: all fields other than IoServiceBytesRecursive
* CPUStats: CPUUsage.PercpuUsage
* MemoryStats: MaxUsage and Failcnt

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-02 17:51:34 +09:00
Kir Kolyshkin
5b658a0348 daemon.overlaySupportsSelinux: simplify check
1. Sscanf is very slow, and we don't use the first two fields -- get rid of it.

2. Since the field we search for is at the end of line and prepended by
   a space, we can just use strings.HaveSuffix.

3. Error checking for bufio.Scanner should be done after the Scan()
   loop, not inside it.

Fixes: 885b29df09
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 14:32:42 -07:00
Kir Kolyshkin
39048cf656 Really switch to moby/sys/mount*
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.

This commit was generated by the following bash script:

```
set -e -u -o pipefail

for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
	sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
		-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
		$file
	goimports -w $file
done
```

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 09:46:25 -07:00
Akihiro Suda
92e7f8f67c daemon: fail early if rootless && cgroupdriver == "systemd" && cgroup v1
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-11 12:49:03 +09:00
Akihiro Suda
ca4b51868a rootless: support --exec-opt native.cgroupdriver=systemd
Support cgroup as in Rootless Podman.

Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-02-14 15:32:31 +09:00
Arko Dasgupta
f800d5f786 Set the bip network value as the subnet
Dont assign the --bip value directly to the subnet
for the default bridge. Instead use the network value
from the ParseCIDR output

Addresses: https://github.com/moby/moby/issues/40392

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2020-02-10 17:38:54 -08:00
Sebastiaan van Stijn
ca20bc4214
Merge pull request #40007 from arkodg/add-host-docker-internal
Support host.docker.internal in dockerd on Linux
2020-01-27 13:42:26 +01:00
Arko Dasgupta
92e809a680 Support host.docker.internal in dockerd on Linux
Docker Desktop (on MAC and Windows hosts) allows containers
running inside a Linux VM to connect to the host using
the host.docker.internal DNS name, which is implemented by
VPNkit (DNS proxy on the host)

This PR allows containers to connect to Linux hosts
by appending a special string "host-gateway" to --add-host
e.g. "--add-host=host.docker.internal:host-gateway" which adds
host.docker.internal DNS entry in /etc/hosts and maps it to host-gateway-ip

This PR also add a daemon flag call host-gateway-ip which defaults to
the default bridge IP
Docker Desktop will need to set this field to the Host Proxy IP
so DNS requests for host.docker.internal can be routed to VPNkit

Addresses: https://github.com/docker/for-linux/issues/264

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2020-01-22 13:30:00 -08:00
Sebastiaan van Stijn
be095a1859
Merge pull request #40366 from arkodg/check-cidr-ipv6
Handle the error case when fixed-cidr-ipv6 is empty and ipv6 is enabled
2020-01-14 13:53:45 +01:00
Arko Dasgupta
bdad16b0ee Handle error case when fixed-cidr-ipv6 is empty
When IPv6 is enabled, make sure fixed-cidr-ipv6 is set
by the user since there is no default IPv6 local subnet
in the IPAM

Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
2020-01-13 09:56:41 -08:00
Akihiro Suda
491531c12b cgroup2: mark cpu-rt-{period,runtime} unimplemented
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 02:58:40 +09:00
Akihiro Suda
19baeaca26 cgroup2: enable cgroup namespace by default
For cgroup v1, we were unable to change the default because of
compatibility issue.

For cgroup v2, we should change the default right now because switching
to cgroup v2 is already breaking change.

See also containers/libpod#4363 containers/libpod#4374

Privileged containers also use cgroupns=private by default.
https://github.com/containers/libpod/pull/4374#issuecomment-549776387

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 02:58:40 +09:00
Akihiro Suda
612343618d cgroup2: use shim V2
* Requires containerd binaries from containerd/containerd#3799 . Metrics are unimplemented yet.
* Works with crun v0.10.4, but `--security-opt seccomp=unconfined` is needed unless using master version of libseccomp
  ( containers/crun#156, seccomp/libseccomp#177 )
* Doesn't work with master runc yet
* Resource limitations are unimplemented

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-01 02:58:40 +09:00
Yong Tang
f09dc2f4fc Fix docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid
This fix tries to address the issue raised in 39353 where
docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid.

The issue was that, mapping to `/etc/sub[u,g]id` in docker does not
allow numeric ID.

This fix fixes the issue by probing other combinations (uid:groupname, username:gid, uid:gid)
when normal username:groupname fails.

This fix fixes 39353.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2019-11-07 20:17:11 +00:00
Sebastiaan van Stijn
9a7e96b5b7
Rename "v1" to "statsV1"
follow-up to 27552ceb15, where this
was left as a review comment, but the PR was already merged.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-11-01 16:18:06 +01:00
Sebastiaan van Stijn
27552ceb15
bump containerd/cgroups 5fbad35c2a7e855762d3c60f2e474ffcad0d470a
full diff: c4b9ac5c76...5fbad35c2a

- containerd/cgroups#82 Add go module support
- containerd/cgroups#96 Move metrics proto package to stats/v1
- containerd/cgroups#97 Allow overriding the default /proc folder in blkioController
- containerd/cgroups#98 Allows ignoring memory modules
- containerd/cgroups#99 Add Go 1.13 to Travis
- containerd/cgroups#100 stats/v1: export per-cgroup stats

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-10-31 01:09:12 +01:00
Sebastiaan van Stijn
05469b5fa2
daemon: add "isWindows" const
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-10-17 23:49:43 +02:00
Sebastiaan van Stijn
422067ba7b
Return "invalid parameter" when linking to non-existing container
Trying to link to a non-existing container is not valid, and should return an
"invalid parameter" (400) error. Returning a "not found" error in this situation
would make the client report the container's image could not be found.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-10 23:06:56 +02:00
Rob Gulewich
530f2d65c3 Explicity set Cgroup NS mode to "host" when running privileged
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
2019-08-23 11:27:27 -07:00
Sebastiaan van Stijn
1ea8b413d1
initBridgeDriver: minor cleanup and linting fixes
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-08-09 18:34:35 +02:00