Commit graph

7184 commits

Author SHA1 Message Date
Brian Goff
1022c6608e
Merge pull request #41083 from thaJeztah/more_warnings
info: add warnings about missing blkio cgroup support
2020-07-09 11:51:09 -07:00
Brian Goff
dd46bbca08
Merge pull request #41168 from thaJeztah/raise_minimum_memory_limit
Set minimum memory limit to 6M, to account for higher startup memory use
2020-07-09 11:48:34 -07:00
Sebastiaan van Stijn
b42ac8d370
daemon/stats: use const for clockTicksPerSecond
The value comes from `C.sysconf(C._SC_CLK_TCK)`, and on Linux it's a
constant which is safe to be hard coded. See for example in the Musl
libc source code https://git.musl-libc.org/cgit/musl/tree/src/conf/sysconf.c#n29

This removes the github.com/opencontainers/runc/libcontainer/system
dependency from this package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-08 14:22:04 +02:00
Wang Yumu
840a12ac90 Add DefaultAddressPools to docker info #40388
Signed-off-by: Wang Yumu <37442693@qq.com>
2020-07-08 00:53:11 +08:00
Tõnis Tiigi
2b1bd64310
Merge pull request #41157 from AkihiroSuda/improve-info-warn
info: improve "WARNING: Running in rootless-mode without cgroup"
2020-07-02 11:35:57 -07:00
Sebastiaan van Stijn
d2e23405be
Set minimum memory limit to 6M, to account for higher startup memory use
For some time, we defined a minimum limit for `--memory` limits to account for
overhead during startup, and to supply a reasonable functional container.

Changes in the runtime (runc) introduced a higher memory footprint during container
startup, which now lead to obscure error-messages that are unfriendly for users:

    run --rm --memory=4m alpine echo success
    docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"4194304\\\\\\\" to \\\\\\\"/sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes\\\\\\\": write /sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
    ERRO[0000] error waiting for container: context canceled

Containers that fail to start because of this limit, will not be marked as OOMKilled,
which makes it harder for users to find the cause of the failure.

Note that _after_ this memory is only required during startup of the container. After
the container was started, the container may not consume this memory, and limits
could (manually) be lowered, for example, an alpine container running only a shell
can run with 512k of memory;

    echo 524288  > /sys/fs/cgroup/memory/docker/acdd326419f0898be63b0463cfc81cd17fb34d2dae6f8aa3768ee6a075ca5c86/memory.limit_in_bytes

However, restarting the container will reset that manual limit to the container's
configuration. While `docker container update` would allow for the updated limit to
be persisted, (re)starting the container after updating produces the same error message
again, so we cannot use different limits for `docker run` / `docker create` and `docker update`.

This patch raises the minimum memory limnit to 6M, so that a better error-message is
produced if a user tries to create a container with a memory-limit that is too low:

    docker create --memory=4m alpine echo success
    docker: Error response from daemon: Minimum memory limit allowed is 6MB.

Possibly, this constraint could be handled by runc, so that different runtimes
could set a best-matching limit (other runtimes may require less overhead).

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-01 13:29:07 +02:00
Akihiro Suda
97708281eb
info: improve "WARNING: Running in rootless-mode without cgroup"
The cgroup v2 mode uses systemd driver by default.
Suggesting to set exec-opt "native.cgroupdriver=systemd" isn't meaningful.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-06-29 20:59:47 +09:00
Sebastiaan van Stijn
3258d565cf
Fix status code for missing --volumes-from container
If the container specified in `--volumes-from` did not exist, the
API returned a 404 status, which was interpreted by the CLI as the
specified _image_ to be missing (even if that was not the case).

This patch changes these error to return a 400 (bad request);

Before this change:

    # make sure the image is present
    docker pull busybox
    docker create --volumes-from=nosuchcontainer busybox
    # Unable to find image 'busybox:latest' locally
    # latest: Pulling from library/busybox
    # Digest: sha256:95cf004f559831017cdf4628aaf1bb30133677be8702a8c5f2994629f637a209
    # Status: Image is up to date for busybox:latest
    # Error response from daemon: No such container: nosuchcontainer

After this change:

    # make sure the image is present
    docker pull busybox
    docker create --volumes-from=nosuchcontainer busybox
    # Error response from daemon: No such container: nosuchcontainer

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-29 13:28:14 +02:00
Kir Kolyshkin
e3cff19dd1 Untangle CPU RT controller init
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.

This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).

This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.

Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.

This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Kir Kolyshkin
afbeaf6f29 pkg/sysinfo: rm duplicates
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.

Same for CPU realtime.

For compatibility reasons, /info stays the same for now.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-26 16:19:52 -07:00
Akihiro Suda
36218123ff
Merge pull request #41022 from thaJeztah/smarter_resolv
Better selection of DNS server
2020-06-25 21:22:33 +09:00
Sebastiaan van Stijn
4534a7afc3
daemon: use containerd/sys to detect UserNamespaces
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.

In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-15 13:06:08 +02:00
Brian Goff
d984d3053b
Merge pull request #41075 from wangyumu/fix-syslog-empty-lines
Fixes #41010 skip empty lines
2020-06-11 12:53:37 -07:00
Brian Goff
7fa2026620
Merge pull request #40938 from thaJeztah/move_pidslimit
API: swarm: move PidsLimit to TaskTemplate.Resources
2020-06-11 12:04:44 -07:00
Sebastiaan van Stijn
d378625554
info: add warnings about missing blkio cgroup support
These warnings were only logged, and could therefore be overlooked
by users. This patch makes these more visible by returning them as
warnings in the API response.

We should probably consider adding "boolean" (?) fields for these
as well, so that they can be consumed in other ways. In addition,
some of these warnings could potentially be grouped to reduce the
number of warnings that are printed.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-08 17:16:44 +02:00
Sebastiaan van Stijn
3aac5f0bbb
Merge pull request #41018 from akhilerm/identity-mapping
remove group name from identity mapping
2020-06-08 15:15:05 +02:00
Wang Yumu
96556854a7 Fixes #41010 skip empty lines
Signed-off-by: Wang Yumu <37442693@qq.com>
2020-06-06 12:36:50 +08:00
Tibor Vass
5ffd677824
Merge pull request #41020 from thaJeztah/fix_sandbox_cleanup
allocateNetwork: fix network sandbox not cleaned up on failure
2020-06-05 09:55:54 -07:00
Sebastiaan van Stijn
687bdc7c71
API: swarm: move PidsLimit to TaskTemplate.Resources
The initial implementation followed the Swarm API, where
PidsLimit is located in ContainerSpec. This is not the
desired place for this property, so moving the field to
TaskTemplate.Resources in our API.

A similar change should be made in the SwarmKit API (likely
keeping the old field for backward compatibility, because
it was merged some releases back)

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-06-05 12:50:38 +02:00
Tibor Vass
fa38a6cd21
Merge pull request #40937 from thaJeztah/split_resource_types
API: split types for Resources Reservations and Limits
2020-06-04 17:33:47 -07:00
Sebastiaan van Stijn
888da28d42
Merge pull request #41030 from justincormack/default-sysctls
Add default sysctls to allow ping sockets and privileged ports with no capabilities
2020-06-04 22:31:51 +02:00
Justin Cormack
dae652e2e5
Add default sysctls to allow ping sockets and privileged ports with no capabilities
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2020-06-04 18:11:08 +01:00
Akhil Mohan
7ad0da7051
remove group name from identity mapping
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.

Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.

Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
2020-06-03 20:04:42 +05:30
Brian Goff
89382f2f20
Merge pull request #41023 from thaJeztah/better_logs
daemon.allocateNetwork: include original error in logs
2020-05-28 13:42:42 -07:00
Brian Goff
763f9e799b
Merge pull request #40846 from AkihiroSuda/cgroup2-use-systemd-by-default
cgroup2: use "systemd" cgroup driver by default when available
2020-05-28 11:37:39 -07:00
Sebastiaan van Stijn
a5324d6950
Better selection of DNS server
Commit e353e7e3f0 updated selection of the
`resolv.conf` file to use in situations where systemd-resolvd is used as
a resolver.

If a host uses `systemd-resolvd`, the system's `/etc/resolv.conf` file is
updated to set `127.0.0.53` as DNS, which is the local IP address for
systemd-resolvd. The DNS servers that are configured by the user will now
be stored in `/run/systemd/resolve/resolv.conf`, and systemd-resolvd acts
as a forwarding DNS for those.

Originally, Docker copied the DNS servers as configured in `/etc/resolv.conf`
as default DNS servers in containers, which failed to work if systemd-resolvd
is used (as `127.0.0.53` is not available inside the container's networking
namespace). To resolve this, e353e7e3f0 instead
detected if systemd-resolvd is in use, and in that case copied the "upstream"
DNS servers from the `/run/systemd/resolve/resolv.conf` configuration.

While this worked for most situations, it had some downsides, among which:

- we're skipping systemd-resolvd altogether, which means that we cannot take
  advantage of addition functionality provided by it (such as per-interface
  DNS servers)
- when updating DNS servers in the system's configuration, those changes were
  not reflected in the container configuration, which could be problematic in
  "developer" scenarios, when switching between networks.

This patch changes the way we select which resolv.conf to use as template
for the container's resolv.conf;

- in situations where a custom network is attached to the container, and the
  embedded DNS is available, we use `/etc/resolv.conf` unconditionally. If
  systemd-resolvd is used, the embedded DNS forwards external DNS lookups to
  systemd-resolvd, which in turn is responsible for forwarding requests to
  the external DNS servers configured by the user.
- if the container is running in "host mode" networking, we also use the
  DNS server that's configured in `/etc/resolv.conf`. In this situation, no
  embedded DNS server is available, but the container runs in the host's
  networking namespace, and can use the same DNS servers as the host (which
  could be systemd-resolvd or DNSMasq
- if the container uses the default (bridge) network, no embedded DNS is
  available, and the container has its own networking namespace. In this
  situation we check if systemd-resolvd is used, in which case we skip
  systemd-resolvd, and configure the upstream DNS servers as DNS for the
  container. This situation is the same as is used currently, which means
  that dynamically switching DNS servers won't be supported for these
  containers.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-25 18:20:56 +02:00
Sebastiaan van Stijn
288ed93dc5
daemon.allocateNetwork: include original error in logs
When failing to destroy a stale sandbox, we logged that the removal
failed, but omitted the original error message.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-25 18:10:58 +02:00
Sebastiaan van Stijn
84ef60cba2
allocateNetwork: don't assign unneeded variables
allocateNetwork() can return early, in which case these variables were unused.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-25 14:12:33 +02:00
Sebastiaan van Stijn
b98b8df886
allocateNetwork: fix network sandbox not cleaned up on failure
The defer function was checking for the local `err` variable, not
on the error that was returned by the function. As a result, the
sandbox would never be cleaned up for containers that used "none"
networking, and a failiure occured during setup.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-25 14:10:48 +02:00
Tibor Vass
5c10ea6ae8
Merge pull request #40725 from cpuguy83/check_img_platform
Accept platform spec on container create
2020-05-21 11:33:27 -07:00
Akihiro Suda
50867791d6
Merge pull request #40967 from tonistiigi/tls-fix
registry: fix mtls config dir passing
2020-05-20 00:26:13 +09:00
Akihiro Suda
225d64ebf1
Merge pull request #40969 from cpuguy83/fix_flakey_log_rotate_test
Fix flakey test for log file rotate.
2020-05-20 00:24:58 +09:00
Brian Goff
5ea5c02c88 Fix flakey test for log file rotate.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-05-18 10:27:53 -07:00
Sebastiaan van Stijn
84748c7d4e
API: split types for Resources Reservations and Limits
This introduces A new type (`Limit`), which allows Limits
and "Reservations" to have different options, as it's not
possible to make "Reservations" for some kind of limits.

The `GenericResources` have been removed from the new type;
the API did not handle specifying `GenericResources` as a
_Limit_ (only as _Reservations_), and this field would
therefore always be empty (omitted) in the `Limits` case.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-18 14:21:23 +02:00
Jaime Cepeda
f48b7d66f3
Fix filter on expose and publish
- Add tests to ensure it's working
- Rename variables for better clarification
- Fix validation test
- Remove wrong filter assertion based on publish filter
- Change port on test

Signed-off-by: Jaime Cepeda <jcepedavillamayor@gmail.com>
2020-05-15 11:12:03 +02:00
Tonis Tiigi
fdb71e410c registry: fix mtls config dir passing
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2020-05-14 12:02:09 -07:00
Tibor Vass
298ba5b131
Merge pull request #40427 from thaJeztah/prometheus_remove_experimental
Do not require "experimental" for metrics API
2020-05-08 11:10:53 -07:00
Brian Goff
75d655320e
Merge pull request #40920 from cpuguy83/log_rotate_error_handling
logfile: Check if log is closed on close error during rotate
2020-05-07 14:45:42 -07:00
Sebastiaan van Stijn
b453b64d04
Merge pull request #40845 from AkihiroSuda/allow-privileged-cgroupns-private-on-cgroup-v1
support `--privileged --cgroupns=private` on cgroup v1
2020-05-07 21:11:42 +02:00
Brian Goff
47d9489e7c
Merge pull request #40907 from thaJeztah/bump_selinux
vendor: opencontainers/selinux v1.5.1
2020-05-07 11:51:08 -07:00
Brian Goff
3989f91075 logfile: Check if log is closed on close error during rotate
This prevents getting into a situation where a container log cannot make
progress because we tried to rotate a file, got an error, and now the
file is closed. The next time we try to write a log entry it will try
and rotate again but error that the file is already closed.

I wonder if there is more we can do to beef up this rotation logic.
Found this issue while investigating missing logs with errors in the
docker daemon logs like:

```
Failed to log message for json-file: error closing file: close <file>:
file already closed
```

I'm not sure why the original rotation failed since the data was no
longer available.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2020-05-07 11:37:06 -07:00
Brian Goff
232ebf7fa4
Merge pull request #40868 from thaJeztah/what_is_the_cause
Replace errors.Cause() with errors.Is() / errors.As()
2020-05-07 11:33:26 -07:00
Sebastiaan van Stijn
a8216806ce
vendor: opencontainers/selinux v1.5.1
full diff: https://github.com/opencontainers/selinux/compare/v1.3.3...v1.5.1

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-05 20:33:06 +02:00
Brian Goff
f6163d3f7a
Merge pull request #40673 from kolyshkin/scan
Simplify daemon.overlaySupportsSelinux(), fix use of bufio.Scanner.Err()
2020-04-29 17:18:37 -07:00
Sebastiaan van Stijn
c3b3aedfa4
Merge pull request #40662 from AkihiroSuda/cgroup2-dockerinfo
cgroup2: implement `docker info`
2020-04-29 22:57:00 +02:00
Sebastiaan van Stijn
07d60bc257
Replace errors.Cause() with errors.Is() / errors.As()
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-29 00:28:41 +02:00
Wilson Júnior
964731e1d3
Improve error feedback when plugin does not implement desired interface
Signed-off-by: Wilson Júnior <wilsonpjunior@gmail.com>
2020-04-21 18:06:24 -03:00
Akihiro Suda
4714ab5d6c cgroup2: use "systemd" cgroup driver by default when available
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.

This commit does not affect cgroup v1 hosts.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-22 05:13:37 +09:00
Sebastiaan van Stijn
8312004f41
remove uses of deprecated pkg/term
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-04-21 16:29:27 +02:00
Akihiro Suda
33ee7941d4 support --privileged --cgroupns=private on cgroup v1
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-21 23:11:32 +09:00