The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These warnings were only logged, and could therefore be overlooked
by users. This patch makes these more visible by returning them as
warnings in the API response.
We should probably consider adding "boolean" (?) fields for these
as well, so that they can be consumed in other ways. In addition,
some of these warnings could potentially be grouped to reduce the
number of warnings that are printed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The initial implementation followed the Swarm API, where
PidsLimit is located in ContainerSpec. This is not the
desired place for this property, so moving the field to
TaskTemplate.Resources in our API.
A similar change should be made in the SwarmKit API (likely
keeping the old field for backward compatibility, because
it was merged some releases back)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.
Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
Commit e353e7e3f0 updated selection of the
`resolv.conf` file to use in situations where systemd-resolvd is used as
a resolver.
If a host uses `systemd-resolvd`, the system's `/etc/resolv.conf` file is
updated to set `127.0.0.53` as DNS, which is the local IP address for
systemd-resolvd. The DNS servers that are configured by the user will now
be stored in `/run/systemd/resolve/resolv.conf`, and systemd-resolvd acts
as a forwarding DNS for those.
Originally, Docker copied the DNS servers as configured in `/etc/resolv.conf`
as default DNS servers in containers, which failed to work if systemd-resolvd
is used (as `127.0.0.53` is not available inside the container's networking
namespace). To resolve this, e353e7e3f0 instead
detected if systemd-resolvd is in use, and in that case copied the "upstream"
DNS servers from the `/run/systemd/resolve/resolv.conf` configuration.
While this worked for most situations, it had some downsides, among which:
- we're skipping systemd-resolvd altogether, which means that we cannot take
advantage of addition functionality provided by it (such as per-interface
DNS servers)
- when updating DNS servers in the system's configuration, those changes were
not reflected in the container configuration, which could be problematic in
"developer" scenarios, when switching between networks.
This patch changes the way we select which resolv.conf to use as template
for the container's resolv.conf;
- in situations where a custom network is attached to the container, and the
embedded DNS is available, we use `/etc/resolv.conf` unconditionally. If
systemd-resolvd is used, the embedded DNS forwards external DNS lookups to
systemd-resolvd, which in turn is responsible for forwarding requests to
the external DNS servers configured by the user.
- if the container is running in "host mode" networking, we also use the
DNS server that's configured in `/etc/resolv.conf`. In this situation, no
embedded DNS server is available, but the container runs in the host's
networking namespace, and can use the same DNS servers as the host (which
could be systemd-resolvd or DNSMasq
- if the container uses the default (bridge) network, no embedded DNS is
available, and the container has its own networking namespace. In this
situation we check if systemd-resolvd is used, in which case we skip
systemd-resolvd, and configure the upstream DNS servers as DNS for the
container. This situation is the same as is used currently, which means
that dynamically switching DNS servers won't be supported for these
containers.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When failing to destroy a stale sandbox, we logged that the removal
failed, but omitted the original error message.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The defer function was checking for the local `err` variable, not
on the error that was returned by the function. As a result, the
sandbox would never be cleaned up for containers that used "none"
networking, and a failiure occured during setup.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This introduces A new type (`Limit`), which allows Limits
and "Reservations" to have different options, as it's not
possible to make "Reservations" for some kind of limits.
The `GenericResources` have been removed from the new type;
the API did not handle specifying `GenericResources` as a
_Limit_ (only as _Reservations_), and this field would
therefore always be empty (omitted) in the `Limits` case.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Add tests to ensure it's working
- Rename variables for better clarification
- Fix validation test
- Remove wrong filter assertion based on publish filter
- Change port on test
Signed-off-by: Jaime Cepeda <jcepedavillamayor@gmail.com>
This prevents getting into a situation where a container log cannot make
progress because we tried to rotate a file, got an error, and now the
file is closed. The next time we try to write a log entry it will try
and rotate again but error that the file is already closed.
I wonder if there is more we can do to beef up this rotation logic.
Found this issue while investigating missing logs with errors in the
docker daemon logs like:
```
Failed to log message for json-file: error closing file: close <file>:
file already closed
```
I'm not sure why the original rotation failed since the data was no
longer available.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.
This commit does not affect cgroup v1 hosts.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Support for PidsLimit was added to SwarmKit in docker/swarmkit/pull/2415,
but never exposed through the Docker remove API.
This patch exposes the feature in the repote API.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When a container is left running after the daemon exits (e.g. the daemon
is SIGKILL'd or crashes), it should stop any running containers when the
daemon starts back up.
What actually happens is the daemon only sends the container's
configured stop signal and does not check if it has exited.
If the container does not actually exit then it is left running.
This fixes this unexpected behavior by calling the same function to shut
down the container that the daemon shutdown process does.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Before this change, the log decoder function provided by the log driver
to logfile would not be able to re-use buffers, causing undeeded
allocations and memory bloat for dockerd.
This change introduces an interface that allows the log driver to manage
it's memory usge more effectively.
This only affects json-file and local log drivers.
`json-file` still is not great just because of how the json decoder in the
stdlib works.
`local` is significantly improved.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The following fields are unsupported:
* BlkioStats: all fields other than IoServiceBytesRecursive
* CPUStats: CPUUsage.PercpuUsage
* MemoryStats: MaxUsage and Failcnt
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
1. Sscanf is very slow, and we don't use the first two fields -- get rid of it.
2. Since the field we search for is at the end of line and prepended by
a space, we can just use strings.HaveSuffix.
3. Error checking for bufio.Scanner should be done after the Scan()
loop, not inside it.
Fixes: 885b29df09
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Looks like a wrong copy-paste using `RuntimeKernel100ns` twice instead of `RuntimeUser100ns`
Signed-off-by: Vincent Boulineau <vincent.boulineau@datadoghq.com>
This enables image lookup when creating a container to fail when the
reference exists but it is for the wrong platform. This prevents trying
to run an image for the wrong platform, as can be the case with, for
example binfmt_misc+qemu.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Switch to moby/sys/mount and mountinfo. Keep the pkg/mount for potential
outside users.
This commit was generated by the following bash script:
```
set -e -u -o pipefail
for file in $(git grep -l 'docker/docker/pkg/mount"' | grep -v ^pkg/mount); do
sed -i -e 's#/docker/docker/pkg/mount"#/moby/sys/mount"#' \
-e 's#mount\.\(GetMounts\|Mounted\|Info\|[A-Za-z]*Filter\)#mountinfo.\1#g' \
$file
goimports -w $file
done
```
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Metrics collectors generally don't need the daemon to prime the stats
with something to compare since they already have something to compare
with.
Before this change, the API does 2 collection cycles (which takes
roughly 2s) in order to provide comparison for CPU usage over 1s. This
was primarily added so that `docker stats --no-stream` had something to
compare against.
Really the CLI should have just made a 2nd call and done the comparison
itself rather than forcing it on all API consumers.
That ship has long sailed, though.
With this change, clients can set an option to just pull a single stat,
which is *at least* a full second faster:
Old:
```
time curl --unix-socket
/go/src/github.com/docker/docker/bundles/test-integration-shell/docker.sock
http://./containers/test/stats?stream=false\&one-shot=false > /dev/null
2>&1
real0m1.864s
user0m0.005s
sys0m0.007s
time curl --unix-socket
/go/src/github.com/docker/docker/bundles/test-integration-shell/docker.sock
http://./containers/test/stats?stream=false\&one-shot=false > /dev/null
2>&1
real0m1.173s
user0m0.010s
sys0m0.006s
```
New:
```
time curl --unix-socket
/go/src/github.com/docker/docker/bundles/test-integration-shell/docker.sock
http://./containers/test/stats?stream=false\&one-shot=true > /dev/null
2>&1
real0m0.680s
user0m0.008s
sys0m0.004s
time curl --unix-socket
/go/src/github.com/docker/docker/bundles/test-integration-shell/docker.sock
http://./containers/test/stats?stream=false\&one-shot=true > /dev/null
2>&1
real0m0.156s
user0m0.007s
sys0m0.007s
```
This fixes issues with downstreams ability to use the stats API to
collect metrics.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Configuration over the API per container is intentionally left out for
the time being, but is supported to configure the default from the
daemon config.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit cbecf48bc352e680a5390a7ca9cff53098cd16d7)
Signed-off-by: Madhu Venugopal <madhu@docker.com>
This supplements any log driver which does not support reads with a
custom read implementation that uses a local file cache.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit d675e2bf2b75865915c7a4552e00802feeb0847f)
Signed-off-by: Madhu Venugopal <madhu@docker.com>
Support cgroup as in Rootless Podman.
Requires cgroup v2 host with crun.
Tested with Ubuntu 19.10 (kernel 5.3, systemd 242), crun v0.12.1.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The namespaces com.docker.*, io.docker.*, org.dockerproject.*
have been documented to be reserved for Docker's internal use.
Co-Authored-By: Sebastiaan van Stijn <thaJeztah@users.noreply.github.com>
Signed-off-by: Jintao Zhang <zhangjintao9020@gmail.com>
Dont assign the --bip value directly to the subnet
for the default bridge. Instead use the network value
from the ParseCIDR output
Addresses: https://github.com/moby/moby/issues/40392
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
This adds a new `fluentd-request-ack` logging option for the Fluentd
logging driver. If enabled, the server will respond with an acknowledgement.
This option improves the reliability of the message transmission. This
change is not versioned, and affects all API versions.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This extracts parsing the driver's configuration to a
function, and uses the same function both when initializing
the driver, and when validating logging options.
Doing so allows validating if the provided options are in
the correct format when calling `ValidateOpts`, instead
of resulting in an error when initializing the logging driver.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
`fuse-overlayfs` provides rootless overlayfs functionality without depending
on any kernel patch.
Aside from rootless, `fuse-overlayfs` could be potentially used for eliminating
`chown()` calls that happen in userns-remap mode, because `fuse-overlayfs` also
provides shiftfs functionality.
System requirements:
* fuse-overlayfs needs to be installed. Tested with 0.7.6.
* kernel >= 4.18
Unit test: `go test -exec sudo -v ./daemon/graphdriver/fuse-overlayfs`
The implementation is based on Podman's `overlay` driver which supports
both kernel-mode overlayfs and fuse-overlayfs in the single driver instance:
https://github.com/containers/storage/blob/39a8d5ed/drivers/overlay/overlay.go
However, Moby's implementation aims to decouple `fuse-overlayfs` driver from the
kernel-mode driver (`overlay2`) for simplicity.
Fix#40218
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Before the collection goroutine wakes up every 1 second (as configured).
This sleep interval is in case there are no stats to collect we don't
end up in a tight loop.
Instead use a condition variable to signal that a collection is needed.
This prevents us from waking the goroutine needlessly when there is no
one looking for stats.
For now I've kept the sleep just moved it to the end of the loop, which
gives some space between collections.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This makes sure that things like `--tmpfs` mounts over an anonymous
volume don't create volumes uneccessarily.
One method only checks mountpoints, the other checks both mountpoints
and tmpfs... the usage of these should likely be consolidated.
Ideally, processing for `--tmpfs` mounts would get merged in with the
rest of the mount parsing. I opted not to do that for this change so the
fix is minimal and can potentially be backported with fewer changes of
breaking things.
Merging the mount processing for tmpfs can be handled in a followup.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Docker Desktop (on MAC and Windows hosts) allows containers
running inside a Linux VM to connect to the host using
the host.docker.internal DNS name, which is implemented by
VPNkit (DNS proxy on the host)
This PR allows containers to connect to Linux hosts
by appending a special string "host-gateway" to --add-host
e.g. "--add-host=host.docker.internal:host-gateway" which adds
host.docker.internal DNS entry in /etc/hosts and maps it to host-gateway-ip
This PR also add a daemon flag call host-gateway-ip which defaults to
the default bridge IP
Docker Desktop will need to set this field to the Host Proxy IP
so DNS requests for host.docker.internal can be routed to VPNkit
Addresses: https://github.com/docker/for-linux/issues/264
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
Adds support for ReplicatedJob and GlobalJob service modes. These modes
allow running service which execute tasks that exit upon success,
instead of daemon-type tasks.
Signed-off-by: Drew Erny <drew.erny@docker.com>
When IPv6 is enabled, make sure fixed-cidr-ipv6 is set
by the user since there is no default IPv6 local subnet
in the IPAM
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
If TINI_COMMIT isn't set, .go-autogen sets an empty value
as the "expected" commit. Attempting to truncate the value
caused a panic in that situation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
* Requires containerd binaries from containerd/containerd#3799 . Metrics are unimplemented yet.
* Works with crun v0.10.4, but `--security-opt seccomp=unconfined` is needed unless using master version of libseccomp
( containers/crun#156, seccomp/libseccomp#177 )
* Doesn't work with master runc yet
* Resource limitations are unimplemented
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
When a container is started in privileged mode, the device mappings
provided by `--device` flag was ignored. Now the device mappings will
be considered even in privileged mode.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
The [gelf payload specification](http://docs.graylog.org/en/2.4/pages/gelf.html#gelf-payload-specification)
demands that the field `short_message` *MUST* be set by the client library.
Since docker logging via the gelf driver sends messages line by line, it can happen that messages with an empty
`short_message` are passed on. This causes strict downstream processors (like graylog) to raise an exception.
The logger now skips messages with an empty line.
Resolves: #40232
See also: #37572
Signed-off-by: Jonas Heinrich <Jonas@JonasHeinrich.com>
Now that we do check if overlay is working by performing an actual
overlayfs mount, there's no need in extra checks for the kernel version
or the filesystem type. Actual mount check is sufficient.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Before this commit, overlay check was performed by looking for
`overlay` in /proc/filesystem. This obviously might not work
for rootless Docker (fs is there, but one can't use it as non-root).
This commit changes the check to perform the actual mount, by reusing
the code previously written to check for multiple lower dirs support.
The old check is removed from both drivers, as well as the additional
check for the multiple lower dirs support in overlay2 since it's now
a part of the main check.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This moves supportsMultipleLowerDir() to overlayutils
so it can be used from both overlay and overlay2.
The only changes made were:
* replace logger with logrus
* don't use workDirName mergedDirName constants
* add mnt var to improve readability a bit
This is a preparation for the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fix tries to address the issue raised in 39353 where
docker crash when creating namespaces with UID in /etc/subuid and /etc/subgid.
The issue was that, mapping to `/etc/sub[u,g]id` in docker does not
allow numeric ID.
This fix fixes the issue by probing other combinations (uid:groupname, username:gid, uid:gid)
when normal username:groupname fails.
This fix fixes 39353.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
WithBlock makes sure that the following containerd request is reliable.
In one edge case with high load pressure, kernel kills dockerd, containerd
and containerd-shims caused by OOM. When both dockerd and containerd
restart, but containerd will take time to recover all the existing
containers. Before containerd serving, dockerd will failed with gRPC
error. That bad thing is that restore action will still ignore the
any non-NotFound errors and returns running state for
already stopped container. It is unexpected behavior. And
we need to restart dockerd to make sure that anything is OK.
It is painful. Add WithBlock can prevent the edge case. And
n common case, the containerd will be serving in shortly.
It is not harm to add WithBlock for containerd connection.
Signed-off-by: Wei Fu <fuweid89@gmail.com>
The validate step in CI was broken, due to a combination of
086b4541cf, fbdd437d29,
and 85733620eb being merged to master.
```
api/types/filters/parse.go:39:1: exported method `Args.Keys` should have comment or be unexported (golint)
func (args Args) Keys() []string {
^
daemon/config/builder.go:19:6: exported type `BuilderGCFilter` should have comment or be unexported (golint)
type BuilderGCFilter filters.Args
^
daemon/config/builder.go:21:1: exported method `BuilderGCFilter.MarshalJSON` should have comment or be unexported (golint)
func (x *BuilderGCFilter) MarshalJSON() ([]byte, error) {
^
daemon/config/builder.go:35:1: exported method `BuilderGCFilter.UnmarshalJSON` should have comment or be unexported (golint)
func (x *BuilderGCFilter) UnmarshalJSON(data []byte) error {
^
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>