Kernel memory limit is not supported on cgroup v2.
Even on cgroup v1, kernel memory limit (`kmem.limit_in_bytes`) has been deprecated since kernel 5.4.
0158115f70
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
In dockerd we already have a concept of a "runtime", which specifies the
OCI runtime to use (e.g. runc).
This PR extends that config to add containerd shim configuration.
This option is only exposed within the daemon itself (cannot be
configured in daemon.json).
This is due to issues in supporting unknown shims which will require
more design work.
What this change allows us to do is keep all the runtime config in one
place.
So the default "runc" runtime will just have it's already existing shim
config codified within the runtime config alone.
I've also added 2 more "stock" runtimes which are basically runc+shimv1
and runc+shimv2.
These new runtime configurations are:
- io.containerd.runtime.v1.linux - runc + v1 shim using the V1 shim API
- io.containerd.runc.v2 - runc + shim v2
These names coincide with the actual names of the containerd shims.
This allows the user to essentially control what shim is going to be
used by either specifying these as a `--runtime` on container create or
by setting `--default-runtime` on the daemon.
For custom/user-specified runtimes, the default shim config (currently
shim v1) is used.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
For some time, we defined a minimum limit for `--memory` limits to account for
overhead during startup, and to supply a reasonable functional container.
Changes in the runtime (runc) introduced a higher memory footprint during container
startup, which now lead to obscure error-messages that are unfriendly for users:
run --rm --memory=4m alpine echo success
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"4194304\\\\\\\" to \\\\\\\"/sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes\\\\\\\": write /sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Containers that fail to start because of this limit, will not be marked as OOMKilled,
which makes it harder for users to find the cause of the failure.
Note that _after_ this memory is only required during startup of the container. After
the container was started, the container may not consume this memory, and limits
could (manually) be lowered, for example, an alpine container running only a shell
can run with 512k of memory;
echo 524288 > /sys/fs/cgroup/memory/docker/acdd326419f0898be63b0463cfc81cd17fb34d2dae6f8aa3768ee6a075ca5c86/memory.limit_in_bytes
However, restarting the container will reset that manual limit to the container's
configuration. While `docker container update` would allow for the updated limit to
be persisted, (re)starting the container after updating produces the same error message
again, so we cannot use different limits for `docker run` / `docker create` and `docker update`.
This patch raises the minimum memory limnit to 6M, so that a better error-message is
produced if a user tries to create a container with a memory-limit that is too low:
docker create --memory=4m alpine echo success
docker: Error response from daemon: Minimum memory limit allowed is 6MB.
Possibly, this constraint could be handled by runc, so that different runtimes
could set a best-matching limit (other runtimes may require less overhead).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.
This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).
This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.
Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.
This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.
Same for CPU realtime.
For compatibility reasons, /info stays the same for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.
Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.
This commit does not affect cgroup v1 hosts.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The following fields are unsupported:
* BlkioStats: all fields other than IoServiceBytesRecursive
* CPUStats: CPUUsage.PercpuUsage
* MemoryStats: MaxUsage and Failcnt
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
1. Sscanf is very slow, and we don't use the first two fields -- get rid of it.
2. Since the field we search for is at the end of line and prepended by
a space, we can just use strings.HaveSuffix.
3. Error checking for bufio.Scanner should be done after the Scan()
loop, not inside it.
Fixes: 885b29df09
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Support cgroup as in Rootless Podman.
Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Dont assign the --bip value directly to the subnet
for the default bridge. Instead use the network value
from the ParseCIDR output
Addresses: https://github.com/moby/moby/issues/40392
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
Docker Desktop (on MAC and Windows hosts) allows containers
running inside a Linux VM to connect to the host using
the host.docker.internal DNS name, which is implemented by
VPNkit (DNS proxy on the host)
This PR allows containers to connect to Linux hosts
by appending a special string "host-gateway" to --add-host
e.g. "--add-host=host.docker.internal:host-gateway" which adds
host.docker.internal DNS entry in /etc/hosts and maps it to host-gateway-ip
This PR also add a daemon flag call host-gateway-ip which defaults to
the default bridge IP
Docker Desktop will need to set this field to the Host Proxy IP
so DNS requests for host.docker.internal can be routed to VPNkit
Addresses: https://github.com/docker/for-linux/issues/264
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
When IPv6 is enabled, make sure fixed-cidr-ipv6 is set
by the user since there is no default IPv6 local subnet
in the IPAM
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
* Requires containerd binaries from containerd/containerd#3799 . Metrics are unimplemented yet.
* Works with crun v0.10.4, but `--security-opt seccomp=unconfined` is needed unless using master version of libseccomp
( containers/crun#156, seccomp/libseccomp#177 )
* Doesn't work with master runc yet
* Resource limitations are unimplemented
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This fix tries to address the issue raised in 39353 where
docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid.
The issue was that, mapping to `/etc/sub[u,g]id` in docker does not
allow numeric ID.
This fix fixes the issue by probing other combinations (uid:groupname, username:gid, uid:gid)
when normal username:groupname fails.
This fix fixes 39353.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Trying to link to a non-existing container is not valid, and should return an
"invalid parameter" (400) error. Returning a "not found" error in this situation
would make the client report the container's image could not be found.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows our tests, which all share a containerd instance, to be a
bit more isolated by setting the containerd namespaces to the generated
daemon ID's rather than the default namespaces.
This came about because I found in some cases we had test daemons
failing to start (really very slow to start) because it was (seemingly)
processing events from other tests.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
While investigating a test failure, I found this in the logs:
```
time="2019-07-04T15:06:32.622506760Z" level=warning msg="Error while setting daemon root propagation, this is not generally critical but may cause some functionality to not work or fallback to less desirable behavior" dir=/go/src/github.com/docker/docker/bundles/test-integration/d1285b8250308/root error="error writing file to signal mount cleanup on shutdown: open /tmp/dxr/d1285b8250308/unmount-on-shutdown: no such file or directory"
```
This path is generated from the daemon's exec-root, which appears to not
exist yet. This change just makes sure it exists before we try to write
a file.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Previously `docker info` had reported "cgroupfs" as the cgroup driver
but the driver wasn't actually used at all.
This PR reports "none" as the cgroup driver so as to avoid confusion.
e.g. kubeadm/kubelet will detect cgroupless-ness by checking this docker
info field. https://github.com/rootless-containers/usernetes/pull/97
Note that user still cannot specify `native.cgroupdriver=none` manually.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
container creation time to control what cgroup namespace the container
is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
- Don't set `PidsLimit` when creating a container and
no limit was set (or the limit was set to "unlimited")
- Don't set `PidsLimit` if the host does not have pids-limit
support (previously "unlimited" was set).
- Do not generate a warning if the host does not have pids-limit
support, but pids-limit was set to unlimited (having no
limit set, or the limit set to "unlimited" is equivalent,
so no warning is nescessary in that case).
- When updating a container, convert `0`, and `-1` to
"unlimited" (`0`).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>