The existing runtimes reload logic went to great lengths to replace the
directory containing runtime wrapper scripts as atomically as possible
within the limitations of the Linux filesystem ABI. Trouble is,
atomically swapping the wrapper scripts directory solves the wrong
problem! The runtime configuration is "locked in" when a container is
started, including the path to the runC binary. If a container is
started with a runtime which requires a daemon-managed wrapper script
and then the daemon is reloaded with a config which no longer requires
the wrapper script (i.e. some args -> no args, or the runtime is dropped
from the config), that container would become unmanageable. Any attempts
to stop, exec or otherwise perform lifecycle management operations on
the container are likely to fail due to the wrapper script no longer
existing at its original path.
Atomically swapping the wrapper scripts is also incompatible with the
read-copy-update paradigm for reloading configuration. A handler in the
daemon could retain a reference to the pre-reload configuration for an
indeterminate amount of time after the daemon configuration has been
reloaded and updated. It is possible for the daemon to attempt to start
a container using a deleted wrapper script if a request to run a
container races a reload.
Solve the problem of deleting referenced wrapper scripts by ensuring
that all wrapper scripts are *immutable* for the lifetime of the daemon
process. Any given runtime wrapper script must always exist with the
same contents, no matter how many times the daemon config is reloaded,
or what changes are made to the config. This is accomplished by using
everyone's favourite design pattern: content-addressable storage. Each
wrapper script file name is suffixed with the SHA-256 digest of its
contents to (probabilistically) guarantee immutability without needing
any concurrency control. Stale runtime wrapper scripts are only cleaned
up on the next daemon restart.
Split the derived runtimes configuration from the user-supplied
configuration to have a place to store derived state without mutating
the user-supplied configuration or exposing daemon internals in API
struct types. Hold the derived state and the user-supplied configuration
in a single struct value so that they can be updated as an atomic unit.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Ensure data-race-free access to the daemon configuration without
locking by mutating a deep copy of the config and atomically storing
a pointer to the copy into the daemon-wide configStore value. Any
operations which need to read from the daemon config must capture the
configStore value only once and pass it around to guarantee a consistent
view of the config.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Config reloading has interleaved validations and other fallible
operations with mutating the live daemon configuration. The daemon
configuration could be left in a partially-reloaded state if any of the
operations returns an error. Mutating a copy of the configuration and
atomically swapping the config struct on success is not currently an
option as config values are not copyable due to the presence of
sync.Mutex fields. Introduce a two-phase commit protocol to defer any
mutations of the daemon state until after all fallible operations have
succeeded.
Reload transactions are not yet entirely hermetic. The platform
reloading logic for custom runtimes on *nix could still leave the
directory of generated runtime wrapper scripts in an indeterminate state
if an error is encountered.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Historically, daemon.RegistryHosts() has returned a docker.RegistryHosts
callback function which closes over a point-in-time snapshot of the
daemon configuration. When constructing the BuildKit builder at daemon
startup, the return value of daemon.RegistryHosts() has been used.
Therefore the BuildKit builder would use the registry configuration as
it was at daemon startup for the life of the process, even if the
registry configuration is changed and the configuration reloaded.
Provide BuildKit with a RegistryHosts callback which reflects the
live daemon configuration after reloads so that registry operations
performed by BuildKit always use the same configuration as the rest of
the daemon.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Passing around a bare pointer to the map of configured features in order
to propagate to consumers changes to the configuration across reloads is
dangerous. Map operations are not atomic, so concurrently reading from
the map while it is being updated is a data race as there is no
synchronization. Use a getter function to retrieve the current features
map so the features can be retrieved race-free.
Remove the unused features argument from the build router.
Signed-off-by: Cory Snider <csnider@mirantis.com>
These changes add basic CDI integration to the docker daemon.
A cdi driver is added to handle cdi device requests. This
is gated by an experimental feature flag and is only supported on linux
This change also adds a CDISpecDirs (cdi-spec-dirs) option to the config.
This allows the default values of `/etc/cdi`, /var/run/cdi` to be overridden
which is useful for testing.
Signed-off-by: Evan Lezar <elezar@nvidia.com>
This option was deprecated in 5a922dc162, which
is part of the v24.0.0 release, so we can remove it from master.
This patch;
- adds a check to ValidatePlatformConfig, and produces a fatal error
if oom-score-adjust is set
- removes the deprecated libcontainerd/supervisor.WithOOMScore
- removes the warning from docker info
With this patch:
dockerd --oom-score-adjust=-500 --validate
Flag --oom-score-adjust has been deprecated, and will be removed in the next release.
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" options have been removed.
And when using `daemon.json`:
dockerd --validate
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" options have been removed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
As of Go 1.8, "net/http".Server provides facilities to close all
listeners, making the same facilities in server.Server redundant.
http.Server also improves upon server.Server by additionally providing a
facility to also wait for outstanding requests to complete after closing
all listeners. Leverage those facilities to give in-flight requests up
to five seconds to finish up after all containers have been shut down.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Logging through a dependency-injected interface value was a vestige of
when Trap was in pkg/signal to avoid importing logrus in a reusable
package: cc4da81128.
Now that Trap lives under cmd/dockerd, nobody will be importing this so
we no longer need to worry about minimizing the package's dependencies.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Always calling os.Exit() on clean shutdown may not always be desirable
as deferred functions are not run. Let the cleanup callback decide
whether or not to call os.Exit() itself. Allow the process to exit the
normal way, by returning from func main().
Simplify the trap.Trap implementation. The signal notifications are
buffered in a channel so there is little need to spawn a new goroutine
for each received signal. With all signals being handled in the same
goroutine, there are no longer any concurrency concerns around the
interrupt counter.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The image store sends events when a new image is created/tagged, using
it instead of the reference store makes sure we send the "tag" event
when a new image is built using buildx.
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
The fix to ignore SIGPIPE signals was originally added in the Go 1.4
era. signal.Ignore was first added in Go 1.5.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The `oom-score-adjust` option was added in a894aec8d8,
to prevent the daemon from being OOM-killed before other processes. This
option was mostly added as a "convenience", as running the daemon as a
systemd unit was not yet common.
Having the daemon set its own limits is not best-practice, and something
better handled by the process-manager starting the daemon.
Commit cf7a5be0f2 fixed this option to allow
disabling it, and 2b8e68ef06 removed the default
score adjust.
This patch deprecates the option altogether, recommending users to set these
limits through the process manager used, such as the "OOMScoreAdjust" option
in systemd units.
With this patch:
dockerd --oom-score-adjust=-500 --validate
Flag --oom-score-adjust has been deprecated, and will be removed in the next release.
configuration OK
echo '{"oom-score-adjust":-500}' > /etc/docker/daemon.json
dockerd
INFO[2023-04-12T21:34:51.133389627Z] Starting up
INFO[2023-04-12T21:34:51.135607544Z] containerd not running, starting managed containerd
WARN[2023-04-12T21:34:51.135629086Z] DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" option will be removed in the next release.
docker info
Client:
Context: default
Debug Mode: false
...
DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" option will be removed in the next release
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The GetRepository method interacts directly with the registry, and does
not depend on the snapshotter, but is used for two purposes;
For the GET /distribution/{name:.*}/json route;
dd3b71d17c/api/server/router/distribution/backend.go (L11-L15)
And to satisfy the "executor.ImageBackend" interface as used by Swarm;
58c027ac8b/daemon/cluster/executor/backend.go (L77)
This patch removes the method from the ImageService interface, and instead
implements it through an composite struct that satisfies both interfaces,
and an ImageBackend() method is added to the daemon.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
remove GetRepository from ImageService
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
GRPC is logging a *lot* of garbage at info level.
This configures the GRPC logger such that it is only giving us logs when
at debug level and also adds a log field indicating where the logs are
coming from.
containerd is still currently spewing these same log messages and needs
a separate update.
Without this change `docker build` is extremely noisy in the daemon
logs.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
SearchRegistryForImages does not make sense as part of the image
service interface. The implementation just wraps the search API of the
registry service to filter the results client-side. It has nothing to do
with local image storage, and the implementation of search does not need
to change when changing which backend (graph driver vs. containerd
snapshotter) is used for local image storage.
Filtering of the search results is an implementation detail: the
consumer of the results does not care which actor does the filtering so
long as the results are filtered as requested. Move filtering into the
exported API of the registry service to hide the implementation details.
Only one thing---the registry service implementation---would need to
change in order to support server-side filtering of search results if
Docker Hub or other registry servers were to add support for it to their
APIs.
Use a fake registry server in the search unit tests to avoid having to
mock out the registry API client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
5008409b5c introduced the usage of
`strings.Cut` to help parse listener addresses.
Part of that also made it error out if no addr is specified after the
protocol spec (e.g. `tcp://`).
Before the change a proto spec without an address just used the default
address for that proto.
e.g. `tcp://` would be `tcp://127.0.0.1:2375`, `unix://` would be
`unix:///var/run/docker.sock`.
Critically, socket activation (`fd://`) never has an address.
This change brings back the old behavior but keeps the usage of
`strings.Cut`.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
- Only use the image exporter in build if we don't use containerd
Without this "docker build" fails with:
Error response from daemon: exporter "image" could not be found
- let buildx know we support containerd snapshotter
- Pass the current snapshotter to the buildkit worker
If buildkit uses a different snapshotter we can't list the images any
more because we can't find the snapshot.
builder/builder-next: make ContainerdWorker a minimal wrapper
Note that this makes "Worker" a public field, so technically one could
overwrite it.
builder-next: reenable runc executor
Currently, without special CNI config the builder would
only create host network containers that is a security issue.
Using runc directly instead of shim is faster as well
as builder doesn’t need anything from shim. The overhead
of setting up network sandbox is much slower of course.
builder/builder-next: simplify options handling
Trying to simplify the logic;
- Use an early return if multiple outputs are provided
- Only construct the list of tags if we're using an image (or moby) exporter
- Combine some logic for snapshotter and non-snapshotter handling
Create a constant for the moby exporter
Pass a context when creating a router
The context has a 10 seconds timeout which should be more than enough to
get the answer from containerd.
Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Co-authored-by: Tonis Tiigi <tonistiigi@gmail.com>
Co-authored-by: Nicolas De Loof <nicolas.deloof@gmail.com>
Co-authored-by: Paweł Gronowski <pawel.gronowski@docker.com>
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The authorization.Middleware contains a sync.Mutex field, making it
non-copyable. Remove one of the barriers to allowing deep copies of
config.Config values.
Inject the middleware into Daemon as a constructor argument instead.
Signed-off-by: Cory Snider <csnider@mirantis.com>
It's surprising that the method to begin serving requests is named Wait.
And it is unidiomatic: it is a synchronous call, but it sends its return
value to the channel passed in as an argument instead of just returning
the value. And ultimately it is just a trivial wrapper around serveAPI.
Export the ServeAPI method instead so callers can decide how to call and
synchronize around it.
Call ServeAPI synchronously on the main goroutine in cmd/dockerd. The
goroutine and channel which the Wait() API demanded are superfluous
after all. The notifyReady() call was always concurrent and asynchronous
with respect to serving the API (its implementation spawns a goroutine)
so it makes no difference whether it is called before ServeAPI() or
after `go ServeAPI()`.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The Server.cfg field is never referenced by any code in package
"./api/server". "./api/server".Config struct values are used by
DaemonCli code, but only to pass around configuration copied out of the
daemon config within the "./cmd/dockerd" package. Delete the
"./api/server".Config struct definition and refactor the "./cmd/dockerd"
package to pull configuration directly from cli.Config.
Signed-off-by: Cory Snider <csnider@mirantis.com>
dockerd handles SIGQUIT by dumping all goroutine stacks to standard
error and exiting. In contrast, the Go runtime's default SIGQUIT
behaviour... dumps all goroutine stacks to standard error and exits.
The default SIGQUIT behaviour is implemented directly in the runtime's
signal handler, and so is both more robust to bugs in the Go runtime and
does not perturb the state of the process to anywhere near same degree
as dumping goroutine stacks from a user goroutine. The only notable
difference from a user's perspective is that the process exits with
status 2 instead of 128+SIGQUIT.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Since b58de39ca7, this option was now only used
to produce a fatal error when starting the daemon. That change is in the 23.0
release, so we can remove it from the master branch.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This type was added in 677a6b3506, and named
"common", because at the time, the "docker" and "dockerd" (daemon) code
were still in the same repository, and shared this type. Renaming it, now
that's no longer the case.
As there are no external consumers of this type, I'm not adding an alias.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The configDir (and "DOCKER_CONFIG" environment variable) is now only used
for the default location for TLS certificates to secure the daemon API.
It is a leftover from when the "docker" and "dockerd" CLI shared the same
binary, allowing the DOCKER_CONFIG environment variable to set the location
for certificates to be used by both.
This patch merges it into cmd/dockerd, which is where it was used.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The cli package is a leftover from when the "docker" and "dockerd" cli
were both maintained in this repository; the only consumer of this is
now the dockerd CLI, so we can move this code (it should not be imported
by anyone).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This is only used for tests, and the key is not verified anymore, so
instead of creating a key and storing it, we can just use an ad-hoc
one.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Turned out that the loadOrCreateTrustKey() utility was doing exactly the
same as libtrust.LoadOrCreateTrustKey(), so making it a thin wrapped. I kept
the tests to verify the behavior, but we could remove them as we only need this
for our integration tests.
The storage location for the generated key was changed (again as we only need
this for some integration tests), so we can remove the TrustKeyPath from the
config.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
While this was convenient for our use, it's somewhat unexpected for a function
that writes a file to also create all parent directories; even more because
this function may be executed as root.
This patch makes the package more "safe" to use as a generic package by removing
this functionality, and leaving it up to the caller to create parent directories,
if needed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>