Deleting a containerd task whose status is Created fails with a
"precondition failed" error. This is because (aside from Windows)
a process is spawned when the task is created, and deleting the task
while the process is running would leak the process if it was allowed.
libcontainerd and the containerd plugin executor mistakenly try to clean
up from a failed start by deleting the created task, which will always
fail with the aforementined error. Change them to pass the
`WithProcessKill` delete option so the cleanup has a chance to succeed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
On Linux, when (os/exec.Cmd).SysProcAttr.Pdeathsig is set, the signal
will be sent to the process when the OS thread on which cmd.Start() was
executed dies. The runtime terminates an OS thread when a goroutine
exits after being wired to the thread with runtime.LockOSThread(). If
other goroutines are allowed to be scheduled onto a thread which called
cmd.Start(), an unrelated goroutine could cause the thread to be
terminated and prematurely signal the command. See
https://github.com/golang/go/issues/27505 for more information.
Prevent started subprocesses with Pdeathsig from getting signaled
prematurely by wiring the starting goroutine to the OS thread until the
subprocess has exited. No other goroutines can be scheduled onto a
locked thread so it will remain alive until unlocked or the daemon
process exits.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The existing logic to handle container ID conflicts when attempting to
create a plugin container is not nearly as robust as the implementation
in daemon for user containers. Extract and refine the logic from daemon
and use it in the plugin executor.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The containerd client is very chatty at the best of times. Because the
libcontained API is stateless and references containers and processes by
string ID for every method call, the implementation is essentially
forced to use the containerd client in a way which amplifies the number
of redundant RPCs invoked to perform any operation. The libcontainerd
remote implementation has to reload the containerd container, task
and/or process metadata for nearly every operation. This in turn
amplifies the number of context switches between dockerd and containerd
to perform any container operation or handle a containerd event,
increasing the load on the system which could otherwise be allocated to
workloads.
Overhaul the libcontainerd interface to reduce the impedance mismatch
with the containerd client so that the containerd client can be used
more efficiently. Split the API out into container, task and process
interfaces which the consumer is expected to retain so that
libcontainerd can retain state---especially the analogous containerd
client objects---without having to manage any state-store inside the
libcontainerd client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The OOMKilled flag on a container's state has historically behaved
rather unintuitively: it is updated on container exit to reflect whether
or not any process within the container has been OOM-killed during the
preceding run of the container. The OOMKilled flag would be set to true
when the container exits if any process within the container---including
execs---was OOM-killed at any time while the container was running,
whether or not the OOM-kill was the cause of the container exiting. The
flag is "sticky," persisting through the next start of the container;
only being cleared once the container exits without any processes having
been OOM-killed that run.
Alter the behavior of the OOMKilled flag such that it signals whether
any process in the container had been OOM-killed since the most recent
start of the container. Set the flag immediately upon any process being
OOM-killed, and clear it when the container transitions to the "running"
state.
There is an ulterior motive for this change. It reduces the amount of
state the libcontainerd client needs to keep track of and clean up on
container exit. It's one less place the client could leak memory if a
container was to be deleted without going through libcontainerd.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Change the log-level for messages about starting the managed containerd instance
to be the same as for the main API. And remove a redundant debug-log.
With this patch:
dockerd
INFO[2022-08-11T11:46:32.573299176Z] Starting up
INFO[2022-08-11T11:46:32.574304409Z] containerd not running, starting managed containerd
INFO[2022-08-11T11:46:32.575289181Z] started new containerd process address=/var/run/docker/containerd/containerd.sock module=libcontainerd pid=5370
cmd/dockerd: initContainerD(): clean-up some logs
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
the `--log-level` flag overrides whatever is in the containerd configuration file;
f033f6ff85/cmd/containerd/command/main.go (L339-L352)
Given that we set that flag when we start the containerd binary, there is no need
to write it both to the generated config-file and pass it as flag.
This patch also slightly changes the behavior; as both dockerd and containerd use
"info" as default log-level, don't set the log-level if it's the default.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adding a remote.configFile to store the location instead of re-constructing its
location each time. Also fixing a minor inconsistency in the error formats.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Adding a remote.pidFile to store the location instead of re-constructing its
location each time. Also performing a small refactor to use `strconv.Itoa`
instead of `fmt.Sprintf`.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Containerd, like dockerd has a OOMScore configuration option to adjust its own
OOM score. In dockerd, this option was added when default installations were not
yet running the daemon as a systemd unit, which made it more complicated to set
the score, and adding a daemon option was convenient.
A binary adjusting its own score has been frowned upon, as it's more logical to
make that the responsibility of the process manager _starting_ the daemon, which
is what we did for dockerd in 21578530d7.
There have been discussions on deprecating the daemon flag for dockerd, and
similar discussions have been happening for containerd.
This patch changes how we set the OOM score for the containerd child process,
and to have dockerd (supervisor) set the OOM score, as it's acting as process
manager in this case (performing a role similar to systemd otherwise).
With this patch, the score is still adjusted as usual, but not written to the
containerd configuration file;
dockerd --oom-score-adjust=-123
cat /proc/$(pidof containerd)/oom_score_adj
-123
As a follow-up, we may consider to adjust the containerd OOM score based on the
daemon's own score instead of on the `cli.OOMScoreAdjust` configuration so that
we will also adjust the score in situations where dockerd's OOM score was set
through other ways (systemd or manually adjusting the cgroup). A TODO was added
for this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Consider Address() (Config.GRPC.Addres) to be the source of truth for
the location of the containerd socket.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This RWMutex was added in 9c4570a958, and used in
the `remote.Client()` method. Commit dd2e19ebd5
split the code for client and daemon, but did not remove the mutex.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The existing implementation used a `nil` value for the CRI plugin's configuration
to indicate that the plugin had to be disabled. Effectively, the `Plugins` value
was only used as an intermediate step, only to be removed later on, and to instead
add the given plugin to `DisabledPlugins` in the containerd configuration.
This patch removes the intermediate step; as a result we also don't need to mask
the containerd `Plugins` field, which was added to allow serializing the toml.
A code comment was added as well to explain why we're (currently) disabling the
CRI plugin by default, which may help future visitors of the code to determin
if that default is still needed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This removes the `WithRemoteAddr()`, `WithRemoteAddrUser()`, `WithDebugAddress()`,
and `WithMetricsAddress()` options, added in ddae20c032,
but most of them were never used, and `WithRemoteAddr()` no longer in use since
dd2e19ebd5.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The correct formatting for machine-readable comments is;
//<some alphanumeric identifier>:<options>[,<option>...][ // comment]
Which basically means:
- MUST NOT have a space before `<identifier>` (e.g. `nolint`)
- Identified MUST be alphanumeric
- MUST be followed by a colon
- MUST be followed by at least one `<option>`
- Optionally additional `<options>` (comma-separated)
- Optionally followed by a comment
Any other format will not be considered a machine-readable comment by `gofmt`,
and thus formatted as a regular comment. Note that this also means that a
`//nolint` (without anything after it) is considered invalid, same for `//#nosec`
(starts with a `#`).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Older versions of Go don't format comments, so committing this as
a separate commit, so that we can already make these changes before
we upgrade to Go 1.19.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This allows the postContainersKill() handler to pass values as-is. As part of
the rewrite, I also moved the daemon.GetContainer(name) call later in the
function, so that we can fail early if an invalid signal is passed, before
doing the (heavier) fetching of the container.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Arbitrary here does not include '', best to catch that one early as it's
almost certainly a mistake (possibly an attempt to pass a POSIX path
through this API)
Signed-off-by: Paul "TBBle" Hampson <Paul.Hampson@Pobox.com>
Windows Server 2016 (RS1) reached end of support, and Docker Desktop requires
Windows 10 V19H2 (version 1909, build 18363) as a minimum.
This patch makes Windows Server RS5 / ltsc2019 (build 17763) the minimum version
to run the daemon, and removes some hacks for older versions of Windows.
There is one check remaining that checks for Windows RS3 for a workaround
on older versions, but recent changes in Windows seemed to have regressed
on the same issue, so I kept that code for now to check if we may need that
workaround (again);
085c6a98d5/daemon/graphdriver/windows/windows.go (L319-L341)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The io/ioutil package has been deprecated in Go 1.16. This commit
replaces the existing io/ioutil functions with their new definitions in
io and os packages.
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
Taking the same approach as was taken in containerd
The new library has a slightly different output;
- keys at the same level are sorted alphabetically
- empty sections not omitted (`proxy_plugins`, `stream_processors`, `timeouts`),
which could possibly be be addressed with an "omitempty" in containerd's struct.
- empty slices are not omitted (`imports`, `required_plugins`)
After sorting the "before" configuration the diff looks like this:
```patch
diff --git a/config-before-sorted.toml b/config-after.toml
index cc771ce7ab..43a727f589 100644
--- a/config-before-sorted.toml
+++ b/config-after.toml
@@ -1,6 +1,8 @@
disabled_plugins = ["cri"]
+imports = []
oom_score = 0
plugin_dir = ""
+required_plugins = []
root = "/var/lib/docker/containerd/daemon"
state = "/var/run/docker/containerd/daemon"
version = 0
@@ -37,6 +39,12 @@ version = 0
shim = "containerd-shim"
shim_debug = true
+[proxy_plugins]
+
+[stream_processors]
+
+[timeouts]
+
[ttrpc]
address = ""
gid = 0
```
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
These were added in 94d70d8355 for Windows TP4,
but no longer used after 331c8a86d4 removed
support for TP4.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Checkpoint/Restore is horribly broken all around.
But on the, now default, v2 runtime it's even more broken.
This at least makes checkpoint equally broken on both runtimes.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This patch makes two changes if --oom-score-adj is set to 0
- do not adjust the oom-score-adjust cgroup for dockerd
- do not set the hard-coded -999 score for containerd if
containerd is running as child process
Before this change:
oom-score-adj | dockerd | containerd as child-process
--------------|---------------|----------------------------
- | -500 | -500 (same as dockerd)
-100 | -100 | -100 (same as dockerd)
0 | 0 | -999 (hard-coded default)
With this change:
oom-score-adj | dockerd | containerd as child-process
--------------|---------------|----------------------------
- | -500 | -500 (same as dockerd)
-100 | -100 | -100 (same as dockerd)
0 | not adjusted | not adjusted
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The event subscriber can only be cancelled by cancelling the context.
In the case where we have to restart event processing we are never
cancelling the old subscribiption.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Instead of sleeping an arbitrary amount of time, using the client to
tell us when it's ready so we can start processing events sooner.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This function was removed in the Linux code as part of
f63f73a4a8, but was not removed in
the Windows code.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
In dockerd we already have a concept of a "runtime", which specifies the
OCI runtime to use (e.g. runc).
This PR extends that config to add containerd shim configuration.
This option is only exposed within the daemon itself (cannot be
configured in daemon.json).
This is due to issues in supporting unknown shims which will require
more design work.
What this change allows us to do is keep all the runtime config in one
place.
So the default "runc" runtime will just have it's already existing shim
config codified within the runtime config alone.
I've also added 2 more "stock" runtimes which are basically runc+shimv1
and runc+shimv2.
These new runtime configurations are:
- io.containerd.runtime.v1.linux - runc + v1 shim using the V1 shim API
- io.containerd.runc.v2 - runc + shim v2
These names coincide with the actual names of the containerd shims.
This allows the user to essentially control what shim is going to be
used by either specifying these as a `--runtime` on container create or
by setting `--default-runtime` on the daemon.
For custom/user-specified runtimes, the default shim config (currently
shim v1) is used.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The following fields are unsupported:
* BlkioStats: all fields other than IoServiceBytesRecursive
* CPUStats: CPUUsage.PercpuUsage
* MemoryStats: MaxUsage and Failcnt
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
* Requires containerd binaries from containerd/containerd#3799 . Metrics are unimplemented yet.
* Works with crun v0.10.4, but `--security-opt seccomp=unconfined` is needed unless using master version of libseccomp
( containers/crun#156, seccomp/libseccomp#177 )
* Doesn't work with master runc yet
* Resource limitations are unimplemented
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>