For current implementation of Checkpoint Restore (C/R) in docker, it
will write the checkpoint to content store. However, when restoring
libcontainerd uses .Digest().Encoded(), which will remove the info
of alg, leading to error.
Signed-off-by: huang-jl <1046678590@qq.com>
The monitorDaemon() goroutine calls startContainerd() then blocks on
<-daemonWaitCh to wait for it to exit. The startContainerd() function
would (re)initialize the daemonWaitCh so a restarted containerd could be
waited on. This implementation was race-free because startContainerd()
would synchronously initialize the daemonWaitCh before returning. When
the call to start the managed containerd process was moved into the
waiter goroutine, the code to initialize the daemonWaitCh struct field
was also moved into the goroutine. This introduced a race condition.
Move the daemonWaitCh initialization to guarantee that it happens before
the startContainerd() call returns.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Split task creation and start into two separate method calls in the
libcontainerd API. Clients now have the opportunity to inspect the
freshly-created task and customize its runtime environment before
starting execution of the user-specified binary.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The workaround is no longer required. The bug has been fixed in stable
versions of all supported containerd branches.
This reverts commit fb7ec1555c.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The github.com/containerd/containerd/log package was moved to a separate
module, which will also be used by upcoming (patch) releases of containerd.
This patch moves our own uses of the package to use the new module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The DeepEqual ignore required in the daemon tests is a bit ugly, but it
works given the new protoc output.
We also have to ignore lints related to schema1 deprecations; these do
not apply as we must continue to support this schema version.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
This type was introduced in
0a79e67e4f
Make use of it throughout our log-format handling code, and convert back
to a string before we pass it to the containerd client.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
- use local variables and remove some intermediate variables
- handle the events inside the switch itself; this makes all the
switch branches use the same logic, instead of "some" using
a `continue`, and others falling through to have the event handled
outside of the switch.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Synchronize the code to do the same thing as Exec.
reap doesn't need to be called before the start event was sent.
There's already a defer block which cleans up the process in case where
an error occurs.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Error check in defer block used wrong error variable which is always nil
if the flow reaches the defer. This caused the `newProcess.Kill` to be
never called if the subsequent attemp to attach to the stdio failed.
Although this only happens in Exec (as Start does overwrite the error),
this also adjusts the Start to also use the returned error to avoid this
kind of mistake in future changes.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
This uses otel standard environment variables to configure tracing in
the daemon.
It also adds support for propagating trace contexts in the client and
reading those from the API server.
See
https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/
for details on otel environment variables.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Update docker to support a '--log-format' option, which accepts either
'text' (default) or 'json'. Propagate the log format to containerd as
well, to ensure that everything will be logged consistently.
Signed-off-by: Philip K. Warren <pkwarren@gmail.com>
Many of the fields in LinuxResources struct are pointers to scalars for
some reason, presumably to differentiate between set-to-zero and unset
when unmarshaling from JSON, despite zero being outside the acceptable
range for the corresponding kernel tunables. When creating the OCI spec
for a container, the daemon sets the container's OCI spec CPUShares and
BlkioWeight parameters to zero when the corresponding Docker container
configuration values are zero, signifying unset, despite the minimum
acceptable value for CPUShares being two, and BlkioWeight ten. This has
gone unnoticed as runC does not distingiush set-to-zero from unset as it
also uses zero internally to represent unset for those fields. However,
kata-containers v3.2.0-alpha.3 tries to apply the explicit-zero resource
parameters to the container, exactly as instructed, and fails loudly.
The OCI runtime-spec is silent on how the runtime should handle the case
when those parameters are explicitly set to out-of-range values and
kata's behaviour is not unreasonable, so the daemon must therefore be in
the wrong.
Translate unset values in the Docker container's resources HostConfig to
omit the corresponding fields in the container's OCI spec when starting
and updating a container in order to maximize compatibility with
runtimes.
Signed-off-by: Cory Snider <csnider@mirantis.com>
It turns out that the unnecessary serialization removed in
b75246202a happened to work around a bug
in containerd. When many exec processes are started concurrently in the
same containerd task, it takes seconds to minutes for them all to start.
Add the workaround back in, only deliberately this time.
Signed-off-by: Cory Snider <csnider@mirantis.com>
This option was deprecated in 5a922dc162, which
is part of the v24.0.0 release, so we can remove it from master.
This patch;
- adds a check to ValidatePlatformConfig, and produces a fatal error
if oom-score-adjust is set
- removes the deprecated libcontainerd/supervisor.WithOOMScore
- removes the warning from docker info
With this patch:
dockerd --oom-score-adjust=-500 --validate
Flag --oom-score-adjust has been deprecated, and will be removed in the next release.
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" options have been removed.
And when using `daemon.json`:
dockerd --validate
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" options have been removed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The `oom-score-adjust` option was added in a894aec8d8,
to prevent the daemon from being OOM-killed before other processes. This
option was mostly added as a "convenience", as running the daemon as a
systemd unit was not yet common.
Having the daemon set its own limits is not best-practice, and something
better handled by the process-manager starting the daemon.
Commit cf7a5be0f2 fixed this option to allow
disabling it, and 2b8e68ef06 removed the default
score adjust.
This patch deprecates the option altogether, recommending users to set these
limits through the process manager used, such as the "OOMScoreAdjust" option
in systemd units.
With this patch:
dockerd --oom-score-adjust=-500 --validate
Flag --oom-score-adjust has been deprecated, and will be removed in the next release.
configuration OK
echo '{"oom-score-adjust":-500}' > /etc/docker/daemon.json
dockerd
INFO[2023-04-12T21:34:51.133389627Z] Starting up
INFO[2023-04-12T21:34:51.135607544Z] containerd not running, starting managed containerd
WARN[2023-04-12T21:34:51.135629086Z] DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" option will be removed in the next release.
docker info
Client:
Context: default
Debug Mode: false
...
DEPRECATED: The "oom-score-adjust" config parameter and the dockerd "--oom-score-adjust" option will be removed in the next release
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The signatures of functions in containerd's errdefs packages are very
similar to those in our own, and it's easy to accidentally use the wrong
package.
This patch uses a consistent alias for all occurrences of this import.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Closing stdin of a container or exec (a.k.a.: task or process) has been
somewhat broken ever since support for ContainerD 1.0 was introduced
back in Docker v17.11: the error returned from the CloseIO() call was
effectively ignored due to it being assigned to a local variable which
shadowed the intended variable. Serendipitously, that oversight
prevented a data race. In my recent refactor of libcontainerd, I
corrected the variable shadowing issue and introduced the aforementioned
data race in the process.
Avoid deadlocking when closing stdin without swallowing errors or
introducing data races by calling CloseIO() synchronously if the process
handle is available, falling back to an asynchronous close-and-log
strategy otherwise. This solution is inelegant and complex, but looks to
be the best that could be done without changing the libcontainerd API.
Signed-off-by: Cory Snider <csnider@mirantis.com>
In preparation for containerd v1.7 which migrates off gogo/protobuf
and changes the protobuf Any type to one that's not supported by our
vendored version of typeurl.
This fixes a compile error on usages of `typeurl.UnmarshalAny` when
upgrading to containerd v1.7.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
`cp` variable is used later to populate the `info.Checkpoint` field
option used by Task creation.
Previous changes mistakenly changed assignment of the `cp` variable to
declaration of a new variable that's scoped only to the if block.
Restore the old assignment behavior.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Kubernetes only permits RuntimeClass values which are valid lowercase
RFC 1123 labels, which disallows the period character. This prevents
cri-dockerd from being able to support configuring alternative shimv2
runtimes for a pod as shimv2 runtime names must contain at least one
period character. Add support for configuring named shimv2 runtimes in
daemon.json so that runtime names can be aliased to
Kubernetes-compatible names.
Allow options to be set on shimv2 runtimes in daemon.json.
The names of the new daemon runtime config fields have been selected to
correspond with the equivalent field names in cri-containerd's
configuration so that users can more easily follow documentation from
the runtime vendor written for cri-containerd and apply it to
daemon.json.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The ID of the task is known at the time that the FIFOs need to be
created (it's passed into the IO-creator callback, and is also the same
as the container ID) so there is no need to hardcode it to "init". Name
the FIFOs after the task ID to be consistent with the FIFO names of
exec'ed processes. Delete the now-unused InitProcessName constant so it
can never again be used in place of a task/process ID.
Signed-off-by: Cory Snider <csnider@mirantis.com>
ContainerD unconditionally sets the ID of a task to its container's ID.
Emulate this behaviour in the libcontainerd local_windows implementation
so that the daemon can use ProcessID == ContainerID (in libcontainerd
terminology) to identify that an exit event is for the container's task
and not for another process (i.e. an exec) in the same container.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Deleting a containerd task whose status is Created fails with a
"precondition failed" error. This is because (aside from Windows)
a process is spawned when the task is created, and deleting the task
while the process is running would leak the process if it was allowed.
libcontainerd and the containerd plugin executor mistakenly try to clean
up from a failed start by deleting the created task, which will always
fail with the aforementined error. Change them to pass the
`WithProcessKill` delete option so the cleanup has a chance to succeed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
On Linux, when (os/exec.Cmd).SysProcAttr.Pdeathsig is set, the signal
will be sent to the process when the OS thread on which cmd.Start() was
executed dies. The runtime terminates an OS thread when a goroutine
exits after being wired to the thread with runtime.LockOSThread(). If
other goroutines are allowed to be scheduled onto a thread which called
cmd.Start(), an unrelated goroutine could cause the thread to be
terminated and prematurely signal the command. See
https://github.com/golang/go/issues/27505 for more information.
Prevent started subprocesses with Pdeathsig from getting signaled
prematurely by wiring the starting goroutine to the OS thread until the
subprocess has exited. No other goroutines can be scheduled onto a
locked thread so it will remain alive until unlocked or the daemon
process exits.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The existing logic to handle container ID conflicts when attempting to
create a plugin container is not nearly as robust as the implementation
in daemon for user containers. Extract and refine the logic from daemon
and use it in the plugin executor.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The containerd client is very chatty at the best of times. Because the
libcontained API is stateless and references containers and processes by
string ID for every method call, the implementation is essentially
forced to use the containerd client in a way which amplifies the number
of redundant RPCs invoked to perform any operation. The libcontainerd
remote implementation has to reload the containerd container, task
and/or process metadata for nearly every operation. This in turn
amplifies the number of context switches between dockerd and containerd
to perform any container operation or handle a containerd event,
increasing the load on the system which could otherwise be allocated to
workloads.
Overhaul the libcontainerd interface to reduce the impedance mismatch
with the containerd client so that the containerd client can be used
more efficiently. Split the API out into container, task and process
interfaces which the consumer is expected to retain so that
libcontainerd can retain state---especially the analogous containerd
client objects---without having to manage any state-store inside the
libcontainerd client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The OOMKilled flag on a container's state has historically behaved
rather unintuitively: it is updated on container exit to reflect whether
or not any process within the container has been OOM-killed during the
preceding run of the container. The OOMKilled flag would be set to true
when the container exits if any process within the container---including
execs---was OOM-killed at any time while the container was running,
whether or not the OOM-kill was the cause of the container exiting. The
flag is "sticky," persisting through the next start of the container;
only being cleared once the container exits without any processes having
been OOM-killed that run.
Alter the behavior of the OOMKilled flag such that it signals whether
any process in the container had been OOM-killed since the most recent
start of the container. Set the flag immediately upon any process being
OOM-killed, and clear it when the container transitions to the "running"
state.
There is an ulterior motive for this change. It reduces the amount of
state the libcontainerd client needs to keep track of and clean up on
container exit. It's one less place the client could leak memory if a
container was to be deleted without going through libcontainerd.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Change the log-level for messages about starting the managed containerd instance
to be the same as for the main API. And remove a redundant debug-log.
With this patch:
dockerd
INFO[2022-08-11T11:46:32.573299176Z] Starting up
INFO[2022-08-11T11:46:32.574304409Z] containerd not running, starting managed containerd
INFO[2022-08-11T11:46:32.575289181Z] started new containerd process address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=5370
cmd/dockerd: initContainerD(): clean-up some logs
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>