People keep doing this and getting pwned because they accidentally left
it exposed to the internet.
The warning about doing this has been there forever.
This introduces a sleep after warning.
To disable the extra sleep users must explicitly specify `--tls=false`
or `--tlsverify=false`
Warning also specifies this sleep will be removed in the next release
where the flag will be required if running unauthenticated.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
`os.RemoveAll()` should never return this error. From the docs:
> If the path does not exist, RemoveAll returns nil (no error).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This pulls in the migration of go-winio/backuptar from the bundled fork
of archive/tar from Go 1.6 to using Go's current archive/tar unmodified.
This fixes the failure to import an OCI layer (tar stream) containing a
file larger than 8gB.
Fixes: #40444
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Co-Authored-By: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
add all partial metadata available to journald logs to allow easier reassembly of partial messages in downstream logging systems
fixes#41403
Signed-off-by: Christian Becker <christian.becker@sixt.com>
This came up in a review of a5324d6950, but
for some reason that comment didn't find its way to GitHub, and/or I
forgot to push the change.
These files are "copied" by reading their content with ioutil.Readfile(),
resolving the symlinks should therefore not be needed, and paths can be
passed as-is;
```go
func copyFile(src, dst string) error {
sBytes, err := ioutil.ReadFile(src)
if err != nil {
return err
}
return ioutil.WriteFile(dst, sBytes, filePerm)
}
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This code assumes that we missed an exit event since the container is
still marked as running in Docker but attempts to signal the process in
containerd returns a "process not found" error.
There is a case where the event wasn't missed, just that it hasn't been
processed yet.
This change tries to work around that possibility by waiting to see if
the container is eventually marked as stopped. It uses the container's
configured stop timeout for this.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
A nil interface in Go is not the same as a nil pointer that satisfies
the interface. libcontainer/user has special handling for missing
/etc/{passwd,group} files but this is all based on nil interface checks,
which were broken by Docker's usage of the API.
When combined with some recent changes in runc that made read errors
actually be returned to the caller, this results in spurrious -EINVAL
errors when we should detect the situation as "there is no passwd file".
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Add Ulimits field to the ContainerSpec API type and wire it to Swarmkit.
This is related to #40639.
Signed-off-by: Albin Kerouanton <albin@akerouanton.name>
In this case, we are sending a signal to the container (typically this
would be SIGKILL or SIGTERM, but could be any signal), but container
reports that the process does not exist.
At the point this code is happening, dockerd thinks that the container
is running, but containerd reports that it is not.
Since containerd reports that it is not running, try to collect the exit
status of the container from containerd, and mark the container as
stopped in dockerd.
Repro this problem like so:
```
id=$(docker run -d busybox top)
pkill containerd && pkill top
docker stop $id
```
Without this change, `docker stop $id` will first try to send SIGTERM,
wait for exit, then try SIGKILL.
Because the process doesn't exist to begin with, no signal is sent, and
so nothing happens.
Since we won't receive any event here to process, the container can
never be marked as stopped until the daemon is restarted.
With the change `docker stop` succeeds immediately (since the process is
already stopped) and we mark the container as stopped. We handle the
case as if we missed a exit event.
There are definitely some other places in the stack that could use some
improvement here, but this helps people get out of a sticky situation.
With io.containerd.runc.v2, no event is ever recieved by docker because
the shim quits trying to send the event.
With io.containerd.runtime.v1.linux the TastExit event is sent before
dockerd can reconnect to the event stream and we miss the event.
No matter what, we shouldn't be reliant on the shim doing the right
thing here, nor can we rely on a steady event stream.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This patch adds a new "prune" event type to indicate that pruning of a resource
type completed.
This event-type can be used on systems that want to perform actions after
resources have been cleaned up. For example, Docker Desktop performs an fstrim
after resources are deleted (https://github.com/linuxkit/linuxkit/tree/v0.7/pkg/trim-after-delete).
While the current (remove, destroy) events can provide information on _most_
resources, there is currently no event triggered after the BuildKit build-cache
is cleaned.
Prune events have a `reclaimed` attribute, indicating the amount of space that
was reclaimed (in bytes). The attribute can be used, for example, to use as a
threshold for performing fstrim actions. Reclaimed space for `network` events
will always be 0, but the field is added to be consistent with prune events for
other resources.
To test this patch:
Create some resources:
for i in foo bar baz; do \
docker network create network_$i \
&& docker volume create volume_$i \
&& docker run -d --name container_$i -v volume_$i:/volume busybox sh -c 'truncate -s 5M somefile; truncate -s 5M /volume/file' \
&& docker tag busybox:latest image_$i; \
done;
docker pull alpine
docker pull nginx:alpine
echo -e "FROM busybox\nRUN truncate -s 50M bigfile" | DOCKER_BUILDKIT=1 docker build -
Start listening for "prune" events in another shell:
docker events --filter event=prune
Prune containers, networks, volumes, and build-cache:
docker system prune -af --volumes
See the events that are returned:
docker events --filter event=prune
2020-07-25T12:12:09.268491000Z container prune (reclaimed=15728640)
2020-07-25T12:12:09.447890400Z network prune (reclaimed=0)
2020-07-25T12:12:09.452323000Z volume prune (reclaimed=15728640)
2020-07-25T12:12:09.517236200Z image prune (reclaimed=21568540)
2020-07-25T12:12:09.566662600Z builder prune (reclaimed=52428841)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After dicussing with maintainers, it was decided putting the burden of
providing the full cap list on the client is not a good design.
Instead we decided to follow along with the container API and use cap
add/drop.
This brings in the changes already merged into swarmkit.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Kernel memory limit is not supported on cgroup v2.
Even on cgroup v1, kernel memory limit (`kmem.limit_in_bytes`) has been deprecated since kernel 5.4.
0158115f70
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The test was looking for the wrong file name.
Since compression happens asyncronously, sometimes the test would
succeed and sometimes fail.
This change makes sure to wait for the compressed version of the file
since we can't know when the compression is going to occur.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This function was removed in the Linux code as part of
f63f73a4a8, but was not removed in
the Windows code.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The previous default runtime `io.containerd.runtime.v1.linux` is being deprecated (https://github.com/containerd/containerd/issues/4365)
`io.containerd.runc.v2` is available since containerd v1.3.0.
Using v1.3.5 or later is recommended. v1.3.0-v1.3.4 doesn't pass `TestContainerStartOnDaemonRestart`.
Fix#41107
Replace #41115
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
In dockerd we already have a concept of a "runtime", which specifies the
OCI runtime to use (e.g. runc).
This PR extends that config to add containerd shim configuration.
This option is only exposed within the daemon itself (cannot be
configured in daemon.json).
This is due to issues in supporting unknown shims which will require
more design work.
What this change allows us to do is keep all the runtime config in one
place.
So the default "runc" runtime will just have it's already existing shim
config codified within the runtime config alone.
I've also added 2 more "stock" runtimes which are basically runc+shimv1
and runc+shimv2.
These new runtime configurations are:
- io.containerd.runtime.v1.linux - runc + v1 shim using the V1 shim API
- io.containerd.runc.v2 - runc + shim v2
These names coincide with the actual names of the containerd shims.
This allows the user to essentially control what shim is going to be
used by either specifying these as a `--runtime` on container create or
by setting `--default-runtime` on the daemon.
For custom/user-specified runtimes, the default shim config (currently
shim v1) is used.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The value comes from `C.sysconf(C._SC_CLK_TCK)`, and on Linux it's a
constant which is safe to be hard coded. See for example in the Musl
libc source code https://git.musl-libc.org/cgit/musl/tree/src/conf/sysconf.c#n29
This removes the github.com/opencontainers/runc/libcontainer/system
dependency from this package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
For some time, we defined a minimum limit for `--memory` limits to account for
overhead during startup, and to supply a reasonable functional container.
Changes in the runtime (runc) introduced a higher memory footprint during container
startup, which now lead to obscure error-messages that are unfriendly for users:
run --rm --memory=4m alpine echo success
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:415: setting cgroup config for procHooks process caused \\\"failed to write \\\\\\\"4194304\\\\\\\" to \\\\\\\"/sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes\\\\\\\": write /sys/fs/cgroup/memory/docker/1254c8d63f85442e599b17dff895f4543c897755ee3bd9b56d5d3d17724b38d7/memory.limit_in_bytes: device or resource busy\\\"\"": unknown.
ERRO[0000] error waiting for container: context canceled
Containers that fail to start because of this limit, will not be marked as OOMKilled,
which makes it harder for users to find the cause of the failure.
Note that _after_ this memory is only required during startup of the container. After
the container was started, the container may not consume this memory, and limits
could (manually) be lowered, for example, an alpine container running only a shell
can run with 512k of memory;
echo 524288 > /sys/fs/cgroup/memory/docker/acdd326419f0898be63b0463cfc81cd17fb34d2dae6f8aa3768ee6a075ca5c86/memory.limit_in_bytes
However, restarting the container will reset that manual limit to the container's
configuration. While `docker container update` would allow for the updated limit to
be persisted, (re)starting the container after updating produces the same error message
again, so we cannot use different limits for `docker run` / `docker create` and `docker update`.
This patch raises the minimum memory limnit to 6M, so that a better error-message is
produced if a user tries to create a container with a memory-limit that is too low:
docker create --memory=4m alpine echo success
docker: Error response from daemon: Minimum memory limit allowed is 6MB.
Possibly, this constraint could be handled by runc, so that different runtimes
could set a best-matching limit (other runtimes may require less overhead).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The cgroup v2 mode uses systemd driver by default.
Suggesting to set exec-opt "native.cgroupdriver=systemd" isn't meaningful.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
If the container specified in `--volumes-from` did not exist, the
API returned a 404 status, which was interpreted by the CLI as the
specified _image_ to be missing (even if that was not the case).
This patch changes these error to return a 400 (bad request);
Before this change:
# make sure the image is present
docker pull busybox
docker create --volumes-from=nosuchcontainer busybox
# Unable to find image 'busybox:latest' locally
# latest: Pulling from library/busybox
# Digest: sha256:95cf004f559831017cdf4628aaf1bb30133677be8702a8c5f2994629f637a209
# Status: Image is up to date for busybox:latest
# Error response from daemon: No such container: nosuchcontainer
After this change:
# make sure the image is present
docker pull busybox
docker create --volumes-from=nosuchcontainer busybox
# Error response from daemon: No such container: nosuchcontainer
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Commit 56f77d5ade added code that is doing some very ugly things.
In partucular, calling cgroups.FindCgroupMountpointAndRoot() and
daemon.SysInfoRaw() inside a recursively-called initCgroupsPath()
not not a good thing to do.
This commit tries to partially untangle this by moving some expensive
checks and calls earlier, in a minimally invasive way (meaning I
tried hard to not break any logic, however weird it is).
This also removes double call to MkdirAll (not important, but it sticks
out) and renames the function to better reflect what it's doing.
Finally, this wraps some of the errors returned, and fixes the init
function to not ignore the error from itself.
This could be reworked more radically, but at least this this commit
we are calling expensive functions once, and only if necessary.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The CPU CFS cgroup-aware scheduler is one single kernel feature, not
two, so it does not make sense to have two separate booleans
(CPUCfsQuota and CPUCfsPeriod). Merge these into CPUCfs.
Same for CPU realtime.
For compatibility reasons, /info stays the same for now.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The implementation in libcontainer/system is quite complicated,
and we only use it to detect if user-namespaces are enabled.
In addition, the implementation in containerd uses a sync.Once,
so that detection (and reading/parsing `/proc/self/uid_map`) is
only performed once.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These warnings were only logged, and could therefore be overlooked
by users. This patch makes these more visible by returning them as
warnings in the API response.
We should probably consider adding "boolean" (?) fields for these
as well, so that they can be consumed in other ways. In addition,
some of these warnings could potentially be grouped to reduce the
number of warnings that are printed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The initial implementation followed the Swarm API, where
PidsLimit is located in ContainerSpec. This is not the
desired place for this property, so moving the field to
TaskTemplate.Resources in our API.
A similar change should be made in the SwarmKit API (likely
keeping the old field for backward compatibility, because
it was merged some releases back)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Currently default capability CAP_NET_RAW allows users to open ICMP echo
sockets, and CAP_NET_BIND_SERVICE allows binding to ports under 1024.
Both of these are safe operations, and Linux now provides ways that
these can be set, per container, to be allowed without any capabilties
for non root users. Enable these by default. Users can revert to the
previous behaviour by overriding the sysctl values explicitly.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
NewIdentityMapping took group name as an argument, and used
the group name also to parse the /etc/sub{uid,gui}. But as per
linux man pages, the sub{uid,gid} file maps username or uid,
not a group name.
Therefore, all occurrences where mapping is used need to
consider only username and uid. Code trying to map using gid
and group name in the daemon is also removed.
Signed-off-by: Akhil Mohan <akhil.mohan@mayadata.io>
Commit e353e7e3f0 updated selection of the
`resolv.conf` file to use in situations where systemd-resolvd is used as
a resolver.
If a host uses `systemd-resolvd`, the system's `/etc/resolv.conf` file is
updated to set `127.0.0.53` as DNS, which is the local IP address for
systemd-resolvd. The DNS servers that are configured by the user will now
be stored in `/run/systemd/resolve/resolv.conf`, and systemd-resolvd acts
as a forwarding DNS for those.
Originally, Docker copied the DNS servers as configured in `/etc/resolv.conf`
as default DNS servers in containers, which failed to work if systemd-resolvd
is used (as `127.0.0.53` is not available inside the container's networking
namespace). To resolve this, e353e7e3f0 instead
detected if systemd-resolvd is in use, and in that case copied the "upstream"
DNS servers from the `/run/systemd/resolve/resolv.conf` configuration.
While this worked for most situations, it had some downsides, among which:
- we're skipping systemd-resolvd altogether, which means that we cannot take
advantage of addition functionality provided by it (such as per-interface
DNS servers)
- when updating DNS servers in the system's configuration, those changes were
not reflected in the container configuration, which could be problematic in
"developer" scenarios, when switching between networks.
This patch changes the way we select which resolv.conf to use as template
for the container's resolv.conf;
- in situations where a custom network is attached to the container, and the
embedded DNS is available, we use `/etc/resolv.conf` unconditionally. If
systemd-resolvd is used, the embedded DNS forwards external DNS lookups to
systemd-resolvd, which in turn is responsible for forwarding requests to
the external DNS servers configured by the user.
- if the container is running in "host mode" networking, we also use the
DNS server that's configured in `/etc/resolv.conf`. In this situation, no
embedded DNS server is available, but the container runs in the host's
networking namespace, and can use the same DNS servers as the host (which
could be systemd-resolvd or DNSMasq
- if the container uses the default (bridge) network, no embedded DNS is
available, and the container has its own networking namespace. In this
situation we check if systemd-resolvd is used, in which case we skip
systemd-resolvd, and configure the upstream DNS servers as DNS for the
container. This situation is the same as is used currently, which means
that dynamically switching DNS servers won't be supported for these
containers.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When failing to destroy a stale sandbox, we logged that the removal
failed, but omitted the original error message.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The defer function was checking for the local `err` variable, not
on the error that was returned by the function. As a result, the
sandbox would never be cleaned up for containers that used "none"
networking, and a failiure occured during setup.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This introduces A new type (`Limit`), which allows Limits
and "Reservations" to have different options, as it's not
possible to make "Reservations" for some kind of limits.
The `GenericResources` have been removed from the new type;
the API did not handle specifying `GenericResources` as a
_Limit_ (only as _Reservations_), and this field would
therefore always be empty (omitted) in the `Limits` case.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Add tests to ensure it's working
- Rename variables for better clarification
- Fix validation test
- Remove wrong filter assertion based on publish filter
- Change port on test
Signed-off-by: Jaime Cepeda <jcepedavillamayor@gmail.com>
This prevents getting into a situation where a container log cannot make
progress because we tried to rotate a file, got an error, and now the
file is closed. The next time we try to write a log entry it will try
and rotate again but error that the file is already closed.
I wonder if there is more we can do to beef up this rotation logic.
Found this issue while investigating missing logs with errors in the
docker daemon logs like:
```
Failed to log message for json-file: error closing file: close <file>:
file already closed
```
I'm not sure why the original rotation failed since the data was no
longer available.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The "systemd" cgroup driver is always preferred over "cgroupfs" on
systemd-based hosts.
This commit does not affect cgroup v1 hosts.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Support for PidsLimit was added to SwarmKit in docker/swarmkit/pull/2415,
but never exposed through the Docker remove API.
This patch exposes the feature in the repote API.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>