The existing runtimes reload logic went to great lengths to replace the
directory containing runtime wrapper scripts as atomically as possible
within the limitations of the Linux filesystem ABI. Trouble is,
atomically swapping the wrapper scripts directory solves the wrong
problem! The runtime configuration is "locked in" when a container is
started, including the path to the runC binary. If a container is
started with a runtime which requires a daemon-managed wrapper script
and then the daemon is reloaded with a config which no longer requires
the wrapper script (i.e. some args -> no args, or the runtime is dropped
from the config), that container would become unmanageable. Any attempts
to stop, exec or otherwise perform lifecycle management operations on
the container are likely to fail due to the wrapper script no longer
existing at its original path.
Atomically swapping the wrapper scripts is also incompatible with the
read-copy-update paradigm for reloading configuration. A handler in the
daemon could retain a reference to the pre-reload configuration for an
indeterminate amount of time after the daemon configuration has been
reloaded and updated. It is possible for the daemon to attempt to start
a container using a deleted wrapper script if a request to run a
container races a reload.
Solve the problem of deleting referenced wrapper scripts by ensuring
that all wrapper scripts are *immutable* for the lifetime of the daemon
process. Any given runtime wrapper script must always exist with the
same contents, no matter how many times the daemon config is reloaded,
or what changes are made to the config. This is accomplished by using
everyone's favourite design pattern: content-addressable storage. Each
wrapper script file name is suffixed with the SHA-256 digest of its
contents to (probabilistically) guarantee immutability without needing
any concurrency control. Stale runtime wrapper scripts are only cleaned
up on the next daemon restart.
Split the derived runtimes configuration from the user-supplied
configuration to have a place to store derived state without mutating
the user-supplied configuration or exposing daemon internals in API
struct types. Hold the derived state and the user-supplied configuration
in a single struct value so that they can be updated as an atomic unit.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Ensure data-race-free access to the daemon configuration without
locking by mutating a deep copy of the config and atomically storing
a pointer to the copy into the daemon-wide configStore value. Any
operations which need to read from the daemon config must capture the
configStore value only once and pass it around to guarantee a consistent
view of the config.
Signed-off-by: Cory Snider <csnider@mirantis.com>
- Split these options to a separate struct, so that we can handle them in isolation.
- Change some tests to use subtests, and improve coverage
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The containerd client is very chatty at the best of times. Because the
libcontained API is stateless and references containers and processes by
string ID for every method call, the implementation is essentially
forced to use the containerd client in a way which amplifies the number
of redundant RPCs invoked to perform any operation. The libcontainerd
remote implementation has to reload the containerd container, task
and/or process metadata for nearly every operation. This in turn
amplifies the number of context switches between dockerd and containerd
to perform any container operation or handle a containerd event,
increasing the load on the system which could otherwise be allocated to
workloads.
Overhaul the libcontainerd interface to reduce the impedance mismatch
with the containerd client so that the containerd client can be used
more efficiently. Split the API out into container, task and process
interfaces which the consumer is expected to retain so that
libcontainerd can retain state---especially the analogous containerd
client objects---without having to manage any state-store inside the
libcontainerd client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
These tests would panic;
- in WithRLimits(), because HostConfig was not set;
470ae8422f/daemon/oci_linux.go (L46-L47)
- in daemon.mergeUlimits(), because daemon.configStore was not set;
470ae8422f/daemon/oci_linux.go (L1069)
This panic was not discovered because the current version of runc/libcontainer that we vendor
would not always return false for `apparmor.IsEnabled()` when running docker-in-docker or if
`apparmor_parser` is not found. Starting with v1.0.0-rc93 of libcontainer, this is no longer
the case (changed in bfb4ea1b1b)
This patch;
- changes the tests to initialize Daemon.configStore and Container.HostConfig
- Combines TestExecSetPlatformOpt and TestExecSetPlatformOptPrivileged into a new test
(TestExecSetPlatformOptAppArmor)
- Runs the test both if AppArmor is enabled and if not (in which case it tests
that the container's AppArmor profile is left empty).
- Adds a FIXME comment for a possible bug in execSetPlatformOpts, which currently
prefers custom profiles over "privileged".
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The runc/libcontainer apparmor package on master no longer checks if apparmor_parser
is enabled, or if we are running docker-in-docker.
While those checks are not relevant to runc (as it doesn't load the profile), these
checks _are_ relevant to us (and containerd). So switching to use the containerd
apparmor package, which does include the needed checks.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Format the source according to latest goimports.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Exec processes do not automatically inherit AppArmor
profiles from the container.
This patch sets the AppArmor profile for the exec
process.
Before this change:
apparmor_parser -q -r <<EOF
#include <tunables/global>
profile deny-write flags=(attach_disconnected) {
#include <abstractions/base>
file,
network,
deny /tmp/** w,
capability,
}
EOF
docker run -dit --security-opt "apparmor=deny-write" --name aa busybox
docker exec aa sh -c 'mkdir /tmp/test'
(no error)
With this change applied:
docker exec aa sh -c 'mkdir /tmp/test'
mkdir: can't create directory '/tmp/test': Permission denied
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>