Kernel memory limit is not supported on cgroup v2.
Even on cgroup v1, kernel memory limit (`kmem.limit_in_bytes`) has been deprecated since kernel 5.4.
0158115f70
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
In dockerd we already have a concept of a "runtime", which specifies the
OCI runtime to use (e.g. runc).
This PR extends that config to add containerd shim configuration.
This option is only exposed within the daemon itself (cannot be
configured in daemon.json).
This is due to issues in supporting unknown shims which will require
more design work.
What this change allows us to do is keep all the runtime config in one
place.
So the default "runc" runtime will just have it's already existing shim
config codified within the runtime config alone.
I've also added 2 more "stock" runtimes which are basically runc+shimv1
and runc+shimv2.
These new runtime configurations are:
- io.containerd.runtime.v1.linux - runc + v1 shim using the V1 shim API
- io.containerd.runc.v2 - runc + shim v2
These names coincide with the actual names of the containerd shims.
This allows the user to essentially control what shim is going to be
used by either specifying these as a `--runtime` on container create or
by setting `--default-runtime` on the daemon.
For custom/user-specified runtimes, the default shim config (currently
shim v1) is used.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
For some time, we defined a minimum limit for `--memory` limits to account for
overhead during startup, and to supply a reasonable functional container.
Changes in the runtime (runc) introduced a higher memory footprint during container
startup, which now lead to obscure error-messages that are unfriendly for users:
run --rm --memory=4m alpine echo success
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"4194304\\\\\\\" to \\\\\\\"/sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes\\\\\\\": write /sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Containers that fail to start because of this limit, will not be marked as OOMKilled,
which makes it harder for users to find the cause of the failure.
Note that _after_ this memory is only required during startup of the container. After
the container was started, the container may not consume this memory, and limits
could (manually) be lowered, for example, an alpine container running only a shell
can run with 512k of memory;
echo 524288 > /sys/fs/cgroup/memory/docker/acdd326419f0898be63b0463cfc81cd17fb34d2dae6f8aa3768ee6a075ca5c86/memory.limit_in_bytes
However, restarting the container will reset that manual limit to the container's
configuration. While `docker container update` would allow for the updated limit to
be persisted, (re)starting the container after updating produces the same error message
again, so we cannot use different limits for `docker run` / `docker create` and `docker update`.
This patch raises the minimum memory limnit to 6M, so that a better error-message is
produced if a user tries to create a container with a memory-limit that is too low:
docker create --memory=4m alpine echo success
docker: Error response from daemon: Minimum memory limit allowed is 6MB.
Possibly, this constraint could be handled by runc, so that different runtimes
could set a best-matching limit (other runtimes may require less overhead).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.
This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).
This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.
Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.
This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.
Same for CPU realtime.
For compatibility reasons, /info stays the same for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.
Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.
This commit does not affect cgroup v1 hosts.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The following fields are unsupported:
* BlkioStats: all fields other than IoServiceBytesRecursive
* CPUStats: CPUUsage.PercpuUsage
* MemoryStats: MaxUsage and Failcnt
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
1. Sscanf is very slow, and we don't use the first two fields -- get rid of it.
2. Since the field we search for is at the end of line and prepended by
a space, we can just use strings.HaveSuffix.
3. Error checking for bufio.Scanner should be done after the Scan()
loop, not inside it.
Fixes: 885b29df09
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Support cgroup as in Rootless Podman.
Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Dont assign the --bip value directly to the subnet
for the default bridge. Instead use the network value
from the ParseCIDR output
Addresses: https://github.com/moby/moby/issues/40392
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
Docker Desktop (on MAC and Windows hosts) allows containers
running inside a Linux VM to connect to the host using
the host.docker.internal DNS name, which is implemented by
VPNkit (DNS proxy on the host)
This PR allows containers to connect to Linux hosts
by appending a special string "host-gateway" to --add-host
e.g. "--add-host=host.docker.internal:host-gateway" which adds
host.docker.internal DNS entry in /etc/hosts and maps it to host-gateway-ip
This PR also add a daemon flag call host-gateway-ip which defaults to
the default bridge IP
Docker Desktop will need to set this field to the Host Proxy IP
so DNS requests for host.docker.internal can be routed to VPNkit
Addresses: https://github.com/docker/for-linux/issues/264
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
When IPv6 is enabled, make sure fixed-cidr-ipv6 is set
by the user since there is no default IPv6 local subnet
in the IPAM
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
* Requires containerd binaries from containerd/containerd#3799 . Metrics are unimplemented yet.
* Works with crun v0.10.4, but `--security-opt seccomp=unconfined` is needed unless using master version of libseccomp
( containers/crun#156, seccomp/libseccomp#177 )
* Doesn't work with master runc yet
* Resource limitations are unimplemented
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This fix tries to address the issue raised in 39353 where
docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid.
The issue was that, mapping to `/etc/sub[u,g]id` in docker does not
allow numeric ID.
This fix fixes the issue by probing other combinations (uid:groupname, username:gid, uid:gid)
when normal username:groupname fails.
This fix fixes 39353.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Trying to link to a non-existing container is not valid, and should return an
"invalid parameter" (400) error. Returning a "not found" error in this situation
would make the client report the container's image could not be found.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows our tests, which all share a containerd instance, to be a
bit more isolated by setting the containerd namespaces to the generated
daemon ID's rather than the default namespaces.
This came about because I found in some cases we had test daemons
failing to start (really very slow to start) because it was (seemingly)
processing events from other tests.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
While investigating a test failure, I found this in the logs:
```
time="2019-07-04T15:06:32.622506760Z" level=warning msg="Error while setting daemon root propagation, this is not generally critical but may cause some functionality to not work or fallback to less desirable behavior" dir=/go/src/github.com/docker/docker/bundles/test-integration/d1285b8250308/root error="error writing file to signal mount cleanup on shutdown: open /tmp/dxr/d1285b8250308/unmount-on-shutdown: no such file or directory"
```
This path is generated from the daemon's exec-root, which appears to not
exist yet. This change just makes sure it exists before we try to write
a file.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Previously `docker info` had reported "cgroupfs" as the cgroup driver
but the driver wasn't actually used at all.
This PR reports "none" as the cgroup driver so as to avoid confusion.
e.g. kubeadm/kubelet will detect cgroupless-ness by checking this docker
info field. https://github.com/rootless-containers/usernetes/pull/97
Note that user still cannot specify `native.cgroupdriver=none` manually.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This adds both a daemon-wide flag and a container creation property:
- Set the `CgroupnsMode: "host|private"` HostConfig property at
container creation time to control what cgroup namespace the container
is created in
- Set the `--default-cgroupns-mode=host|private` daemon flag to control
what cgroup namespace containers are created in by default
- Set the default if the daemon flag is unset to "host", for backward
compatibility
- Default to CgroupnsMode: "host" for client versions < 1.40
Signed-off-by: Rob Gulewich <rgulewich@netflix.com>
- Don't set `PidsLimit` when creating a container and
no limit was set (or the limit was set to "unlimited")
- Don't set `PidsLimit` if the host does not have pids-limit
support (previously "unlimited" was set).
- Do not generate a warning if the host does not have pids-limit
support, but pids-limit was set to unlimited (having no
limit set, or the limit set to "unlimited" is equivalent,
so no warning is nescessary in that case).
- When updating a container, convert `0`, and `-1` to
"unlimited" (`0`).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Please refer to `docs/rootless.md`.
TLDR:
* Make sure `/etc/subuid` and `/etc/subgid` contain the entry for you
* `dockerd-rootless.sh --experimental`
* `docker -H unix://$XDG_RUNTIME_DIR/docker.sock run ...`
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Many startup tasks have to run for each container, and thus using a
WaitGroup (which doesn't have a limit to the number of parallel tasks)
can result in Docker exceeding the NOFILE limit quite trivially. A more
optimal solution is to have a parallelism limit by using a semaphore.
In addition, several startup tasks were not parallelised previously
which resulted in very long startup times. According to my testing, 20K
dead containers resulted in ~6 minute startup times (during which time
Docker is completely unusable).
This patch fixes both issues, and the parallelStartupTimes factor chosen
(128 * NumCPU) is based on my own significant testing of the 20K
container case. This patch (on my machines) reduces the startup time
from 6 minutes to less than a minute (ideally this could be further
reduced by removing the need to scan all dead containers on startup --
but that's beyond the scope of this patchset).
In order to avoid the NOFILE limit problem, we also detect this
on-startup and if NOFILE < 2*128*NumCPU we will reduce the parallelism
factor to avoid hitting NOFILE limits (but also emit a warning since
this is almost certainly a mis-configuration).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Disabling the oom-killer for a container without setting a memory limit
is dangerous, as it can result in the container consuming unlimited memory,
without the kernel being able to kill it. A check for this situation is curently
done in the CLI, but other consumers of the API won't receive this warning.
This patch adds a check for this situation to the daemon, so that all consumers
of the API will receive this warning.
This patch will have one side-effect; docker cli's that also perform this check
client-side will print the warning twice; this can be addressed by disabling
the cli-side check for newer API versions, but will generate a bit of extra
noise when using an older CLI.
With this patch applied (and a cli that does not take the new warning into account);
```
docker create --oom-kill-disable busybox
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
669933b9b237fa27da699483b5cf15355a9027050825146587a0e5be0d848adf
docker run --rm --oom-kill-disable busybox
WARNING: Disabling the OOM killer on containers without setting a '-m/--memory' limit may be dangerous.
WARNING: OOM killer is disabled for the container, but no memory limit is set, this can result in the system running out of resources.
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fix tries to address the issue raised in 37038 where
there were no memory.kernelTCP support for linux.
This fix add MemoryKernelTCP to HostConfig, and pass
the config to runtime-spec.
Additional test case has been added.
This fix fixes 37038.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Using a value such as `--cpuset-mems=1-9223372036854775807` would cause
`dockerd` to run out of memory allocating a map of the values in the
validation code. Set limits to the normal limit of the number of CPUs,
and improve the error handling.
Reported by Huawei PSIRT.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This implements chown support on Windows. Built-in accounts as well
as accounts included in the SAM database of the container are supported.
NOTE: IDPair is now named Identity and IDMappings is now named
IdentityMapping.
The following are valid examples:
ADD --chown=Guest . <some directory>
COPY --chown=Administrator . <some directory>
COPY --chown=Guests . <some directory>
COPY --chown=ContainerUser . <some directory>
On Windows an owner is only granted the permission to read the security
descriptor and read/write the discretionary access control list. This
fix also grants read/write and execute permissions to the owner.
Signed-off-by: Salahuddin Khan <salah@docker.com>
In particular, these two:
> daemon/daemon_unix.go:1129: Wrapf format %v reads arg #1, but call has 0 args
> daemon/kill.go:111: Warn call has possible formatting directive %s
and a few more.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This moves the platform specific stuff in a separate package and keeps
the `volume` package and the defined interfaces light to import.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This makes sure that if the daemon root was already a self-binded mount
(thus meaning the daemonc only performed a remount) that the daemon does
not try to unmount.
Example:
```
$ sudo mount --bind /var/lib/docker /var/lib/docker
$ sudo dockerd &
```
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Current implementaion of docke daemon doesn't pass down the
`--oom-kill-disable` option specified by the end user to the containerd
when spawning a new docker instance with help from `runc` component, which
results in the `--oom-kill-disable` doesn't work no matter the flag is `true`
or `false`.
This PR will fix this issue reported by #36090
Signed-off-by: Dennis Chen <dennis.chen@arm.com>
Signed-off-by: Jianyong Wu <jianyong.wu@arm.com>
This change sets an explicit mount propagation for the daemon root.
This is useful for people who need to bind mount the docker daemon root
into a container.
Since bind mounting the daemon root should only ever happen with at
least `rlsave` propagation (to prevent the container from holding
references to mounts making it impossible for the daemon to clean up its
resources), we should make sure the user is actually able to this.
Most modern systems have shared root (`/`) propagation by default
already, however there are some cases where this may not be so
(e.g. potentially docker-in-docker scenarios, but also other cases).
So this just gives the daemon a little more control here and provides
a more uniform experience across different systems.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
The re-coalesces the daemon stores which were split as part of the
original LCOW implementation.
This is part of the work discussed in https://github.com/moby/moby/issues/34617,
in particular see the document linked to in that issue.
This subtle bug keeps lurking in because error checking for `Mkdir()`
and `MkdirAll()` is slightly different wrt to `EEXIST`/`IsExist`:
- for `Mkdir()`, `IsExist` error should (usually) be ignored
(unless you want to make sure directory was not there before)
as it means "the destination directory was already there"
- for `MkdirAll()`, `IsExist` error should NEVER be ignored.
Mostly, this commit just removes ignoring the IsExist error, as it
should not be ignored.
Also, there are a couple of cases then IsExist is handled as
"directory already exist" which is wrong. As a result, some code
that never worked as intended is now removed.
NOTE that `idtools.MkdirAndChown()` behaves like `os.MkdirAll()`
rather than `os.Mkdir()` -- so its description is amended accordingly,
and its usage is handled as such (i.e. IsExist error is not ignored).
For more details, a quote from my runc commit 6f82d4b (July 2015):
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
83c2152de5 sets the kernel param for
fs.may_detach_mounts, but this is not neccessary for the daemon to
operate. Instead of erroring out (and thus aborting startup) just log
the error.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This is kernel config available in RHEL7.4 based kernels that enables
mountpoint removal where the mountpoint exists in other namespaces.
In particular this is important for making this pattern work:
```
umount -l /some/path
rm -r /some/path
```
Where `/some/path` exists in another mount namespace.
Setting this value will prevent `device or resource busy` errors when
attempting to the removal of `/some/path` in the example.
This setting is the default, and non-configurable, on upstream kernels
since 3.15.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This enables docker cp and ADD/COPY docker build support for LCOW.
Originally, the graphdriver.Get() interface returned a local path
to the container root filesystem. This does not work for LCOW, so
the Get() method now returns an interface that LCOW implements to
support copying to and from the container.
Signed-off-by: Akash Gupta <akagup@microsoft.com>
This also update:
- runc to 3f2f8b84a77f73d38244dd690525642a72156c64
- runtime-specs to v1.0.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Use strongly typed errors to set HTTP status codes.
Error interfaces are defined in the api/errors package and errors
returned from controllers are checked against these interfaces.
Errors can be wraeped in a pkg/errors.Causer, as long as somewhere in the
line of causes one of the interfaces is implemented. The special error
interfaces take precedence over Causer, meaning if both Causer and one
of the new error interfaces are implemented, the Causer is not
traversed.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Since the commit d88fe447df ("Add support for sharing /dev/shm/ and
/dev/mqueue between containers") container's /dev/shm is mounted on the
host first, then bind-mounted inside the container. This is done that
way in order to be able to share this container's IPC namespace
(and the /dev/shm mount point) with another container.
Unfortunately, this functionality breaks container checkpoint/restore
(even if IPC is not shared). Since /dev/shm is an external mount, its
contents is not saved by `criu checkpoint`, and so upon restore any
application that tries to access data under /dev/shm is severily
disappointed (which usually results in a fatal crash).
This commit solves the issue by introducing new IPC modes for containers
(in addition to 'host' and 'container:ID'). The new modes are:
- 'shareable': enables sharing this container's IPC with others
(this used to be the implicit default);
- 'private': disables sharing this container's IPC.
In 'private' mode, container's /dev/shm is truly mounted inside the
container, without any bind-mounting from the host, which solves the
issue.
While at it, let's also implement 'none' mode. The motivation, as
eloquently put by Justin Cormack, is:
> I wondered a while back about having a none shm mode, as currently it is
> not possible to have a totally unwriteable container as there is always
> a /dev/shm writeable mount. It is a bit of a niche case (and clearly
> should never be allowed to be daemon default) but it would be trivial to
> add now so maybe we should...
...so here's yet yet another mode:
- 'none': no /dev/shm mount inside the container (though it still
has its own private IPC namespace).
Now, to ultimately solve the abovementioned checkpoint/restore issue, we'd
need to make 'private' the default mode, but unfortunately it breaks the
backward compatibility. So, let's make the default container IPC mode
per-daemon configurable (with the built-in default set to 'shareable'
for now). The default can be changed either via a daemon CLI option
(--default-shm-mode) or a daemon.json configuration file parameter
of the same name.
Note one can only set either 'shareable' or 'private' IPC modes as a
daemon default (i.e. in this context 'host', 'container', or 'none'
do not make much sense).
Some other changes this patch introduces are:
1. A mount for /dev/shm is added to default OCI Linux spec.
2. IpcMode.Valid() is simplified to remove duplicated code that parsed
'container:ID' form. Note the old version used to check that ID does
not contain a semicolon -- this is no longer the case (tests are
modified accordingly). The motivation is we should either do a
proper check for container ID validity, or don't check it at all
(since it is checked in other places anyway). I chose the latter.
3. IpcMode.Container() is modified to not return container ID if the
mode value does not start with "container:", unifying the check to
be the same as in IpcMode.IsContainer().
3. IPC mode unit tests (runconfig/hostconfig_test.go) are modified
to add checks for newly added values.
[v2: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-51345997]
[v3: addressed review at https://github.com/moby/moby/pull/34087#pullrequestreview-53902833]
[v4: addressed the case of upgrading from older daemon, in this case
container.HostConfig.IpcMode is unset and this is valid]
[v5: document old and new IpcMode values in api/swagger.yaml]
[v6: add the 'none' mode, changelog entry to docs/api/version-history.md]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Switch some more usage of the Stat function and the Stat_t type from the
syscall package to golang.org/x/sys. Those were missing in PR #33399.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
If we get "container not found" error from containerd, it's possibly
because that this container has already been stopped. It will be ok to
ignore this error and just return an empty stats.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
If a caller specifies an SELinux type or MCS Label and still wants to
share an IPC Namespace or the host namespace, we should allow them.
Currently we are ignoring the label specification if ipcmod=container
or pidmode=host.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
cpu.rt_runtime
PR https://github.com/docker/docker/pull/23430 introduced a couple more
flags including `--cpu-rt-runtime` to the docker daemon. It appears
recent changes or merge issues may have broken this. It currently does
not take the cgroup mount point into account when determining the cgroup
files to write values to. This breaks docker setting its own
`cpu.rt_runtime` for the daemon. This also means containers aren't able
to set theirs.
Also, the cgroups.FindCgroupMountpointAndRoot returns back a mount point
that includes the cgroup of the currently running container when docker
is run inside a docker container. this breaks the `--cpu-rt-runtime`
flag when running docker in docker. A fix has been placed here, but
potentially could be pulled up into libcontainer if this is a better
place for it.
Signed-off-by: Erik St. Martin <alakriti@gmail.com>
This also moves some cli specific in `cmd/dockerd` as it does not
really belong to the `daemon/config` package.
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
This fix fixes issue raised in 29492 where it was not
possible to specify a default `--default-shm-size` in daemon
configuration for each `docker run``.
The flag `--default-shm-size` which is reloadable, has been
added to the daemon configuation.
Related docs has been updated.
This fix fixes 29492.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
There was no validation for `docker run --tmpfs foo`.
In this PR, only two obvious rules are implemented:
- path must be absolute
- path must not be "/"
We should add more rules carefully.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This patch fixed below 4 types of code line
1. Remove unnecessary variable assignment
2. Use variables declaration instead of explicit initial zero value
3. Change variable name to underbar when variable not used
4. Add erro check and return for ignored error
Signed-off-by: Daehyeok Mun <daehyeok@gmail.com>
commit 56f77d5ade
added support for cpu-rt-period and cpu-rt-runtime,
but always initialized the cgroup path, even if not
used.
As a result, containers failed to start on a
read-only filesystem.
This patch only creates the cgroup path if
one of these options is set.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
… or could be in `opts` package. Having `runconfig/opts` and `opts`
doesn't really make sense and make it difficult to know where to put
some code.
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
Move plugins to shared distribution stack with images.
Create immutable plugin config that matches schema2 requirements.
Ensure data being pushed is same as pulled/created.
Store distribution artifacts in a blobstore.
Run init layer setup for every plugin start.
Fix breakouts from unsafe file accesses.
Add support for `docker plugin install --alias`
Uses normalized references for default names to avoid collisions when using default hosts/tags.
Some refactoring of the plugin manager to support the change, like removing the singleton manager and adding manager config struct.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
Upon each container create I'm seeing these warning **every** time in the
daemon output:
```
WARN[0002] Your kernel does not support swap memory limit
WARN[0002] Your kernel does not support cgroup rt period
WARN[0002] Your kernel does not support cgroup rt runtime
```
Showing them for each container.create() fills up the logs and encourages
people to ignore the output being generated - which means its less likely
they'll see real issues when they happen. In short, I don't think we
need to show these warnings more than once, so let's only show these
warnings at daemon start-up time.
Signed-off-by: Doug Davis <dug@us.ibm.com>
This fix fixes error messages for `--cpus` from daemon.
When `docker run` takes `--cpus`, it will translate into NanoCPUs
and pass the value to daemon. The `NanoCPU` is not visible to the user.
The error message generated from daemon used 'NanoCPU' which may cause
some confusion to the user.
This fix fixes this issue by returning the error in CPUs instead.
This fix fixes 28456.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
This fix tries to address the proposal raised in 27921 and add
`--cpus` flag for `docker run/create`.
Basically, `--cpus` will allow user to specify a number (possibly partial)
about how many CPUs the container will use. For example, on a 2-CPU system
`--cpus 1.5` means the container will take 75% (1.5/2) of the CPU share.
This fix adds a `NanoCPUs` field to `HostConfig` since swarmkit alreay
have a concept of NanoCPUs for tasks. The `--cpus` flag will translate
the number into reused `NanoCPUs` to be consistent.
This fix adds integration tests to cover the changes.
Related docs (`docker run` and Remote APIs) have been updated.
This fix fixes 27921.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
These features were originally scheduled
for removal in docker 1.13, but we changed
our deprecation policy to keep features
for three releases instead of two.
This updates the deprecation version
to match the deprecation policy.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This fix tries to fix logrus formatting by removing `f` from
`logrus.[Error|Warn|Debug|Fatal|Panic|Info]f` when formatting string
is not present.
Fixed issue #23459
Signed-off-by: Daehyeok Mun <daehyeok@gmail.com>
When processing the --userns-remap flag, add the
capability to call out to `getent` if the user and
group information is not found via local file
parsing code already in libcontainer/user.
Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com>
This fix tries to address the issue raised in 26341
where multiple addresses in a bridge may cause `--fixed-cidr`
to not have the correct addresses.
The issue is that `netutils.ElectInterfaceAddresses(bridgeName)`
only returns the first IPv4 address.
This fix (together with the PR created in libnetwork )
changes `ElectInterfaceAddresses()` and `addresses()`
so that all IPv4 addresses are returned. This will allow the
possibility of selectively choose the address needed.
In `daemon_unix.go`, bridge address is chosen by comparing with
the `--fixed-cidr` first, thus resolve the issue in 26341.
This fix is tested manually, as is described in 26341:
```
brctl addbr cbr0
ip addr add 10.111.111.111/20 dev cbr0 label cbr0:main
ip addr add 10.222.222.222/12 dev cbr0 label cbr0:docker
ip link set cbr0 up
docker daemon --bridge=cbr0 --iptables=false --ip-masq=false --fixed-cidr=10.222.222.222/24
docker run --rm busybox ip route get 8.8.8.8 | grep -Po 'src.*'
src 10.222.222.0
```
This fix fixes 26341.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
containers may specify these cgroup values at runtime. This will allow
processes to change their priority to real-time within the container
when CONFIG_RT_GROUP_SCHED is enabled in the kernel. See #22380.
Also added sanity checks for the new --cpu-rt-runtime and --cpu-rt-period
flags to ensure that that the kernel supports these features and that
runtime is not greater than period.
Daemon will support a --cpu-rt-runtime flag to initialize the parent
cgroup on startup, this prevents the administrator from alotting runtime
to docker after each restart.
There are additional checks that could be added but maybe too far? Check
parent cgroups to ensure values are <= parent, inspecting rtprio ulimit
and issuing a warning.
Signed-off-by: Erik St. Martin <alakriti@gmail.com>
This fix tries to fix an incorrect `WARNING` output in `docker run/create`:
```
ubuntu@ubuntu:~/docker$ docker run -d --cpu-percent 80 busybox top
WARNING: %s does not support CPU percent. Percent discarded.
WARNING: linux
e963d1108e455e7f8f57626ca1305b5f1999e46025d2865b9a21fc8abc51a546
```
The reason was that in `daemon/daemon_unix.go`, the warning string
was not combined with `fmt.Sprintf` before appended to the output.
This fix fixes this issue.
This fix has been manually tested and verified:
```
ubuntu@ubuntu:~/docker$ docker run -d --cpu-percent 80 busybox top
WARNING: linux does not support CPU percent. Percent discarded.
fcf53f79d389235bae846d3d40804834659ac025edbc0d075ed91841a8e4c740
```
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>