The event filter used two separate filter-conditions for
"namespace" and "topic". As a result, both events matching
"topic" and events matching "namespace" were subscribed to,
causing events to be handled both by the "plugin" client, and
"container" client.
This patch rewrites the filter to match only if both namespace
and topic match.
Thanks to Stephen Day for providing the correct filter :)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
With the contianerd 1.0 migration we now have strongly typed errors that
we can check for process not found.
We also had some bad error checks looking for `ESRCH` which would only
be returned from `unix.Kill` and never from containerd even though we
were checking containerd responses for it.
Fixes some race conditions around process handling and our error checks
that could lead to errors that propagate up to the user that should not.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The `docker info` code was shelling out to obtain the
version of containerd (using the `--version` flag).
Parsing the output of this version string is error-prone,
and not needed, as the containerd API can return the
version.
This patch adds a `Version()` method to the containerd Client
interface, and uses this to get the containerd version.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This also update:
- runc to 3f2f8b84a77f73d38244dd690525642a72156c64
- runtime-specs to v1.0.0
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Current insider builds of Windows have support for mounting individual
named pipe servers from the host to the guest. This allows, for example,
exposing the docker engine's named pipe to a container.
This change allows the user to request such a mount via the normal bind
mount syntax in the CLI:
docker run -v \\.\pipe\docker_engine:\\.\pipe\docker_engine <args>
Signed-off-by: John Starks <jostarks@microsoft.com>
Changes most references of syscall to golang.org/x/sys/
Ones aren't changes include, Errno, Signal and SysProcAttr
as they haven't been implemented in /x/sys/.
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
[s390x] switch utsname from unsigned to signed
per 33267e036f
char in s390x in the /x/sys/unix package is now signed, so
change the buildtags
Signed-off-by: Christopher Jones <tophj@linux.vnet.ibm.com>
During container startup we end up spending a fair amount of time
encoding/decoding json.
This cuts out some of that since we already have the decoded object in
memory.
The old flow looked like:
1. Start container request
2. Create file
3. Encode container spec to json
4. Write to file
5. Close file
6. Open file
7. Read file
8. Decode container spec
9. Close file
10. Send to containerd.
The new flow cuts out steps 6-9 completely, and with it a lot of time
spent in reflect and file IO.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Docker use default GRPC backoff strategy to reconnect to containerd when
connection is lost. and the delay time grows exponentially, until reaches 120s.
So Change the max delay time to 2s to avoid docker and containerd
connection failure.
Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
It has observed defunct containerd processes accumulating over
time while dockerd was permanently failing to restart containerd.
Due to a bug in the runContainerdDaemon() function, dockerd does not clean up
its child process if containerd already exits very soon after the (re)start.
The reproducer and analysis below comes from docker 1.12.x but bug
still applies on latest master.
- from libcontainerd/remote_linux.go:
329 func (r *remote) runContainerdDaemon() error {
:
: // start the containerd child process
:
403 if err := cmd.Start(); err != nil {
404 return err
405 }
:
: // If containerd exits very soon after (re)start, it is
possible
: // that containerd is already in defunct state at the time
when
: // dockerd gets here. The setOOMScore() function tries to
write
: // to /proc/PID_OF_CONTAINERD/oom_score_adj. However, this
fails
: // with errno EINVAL because containerd is defunct. Please see
: // snippets of kernel source code and further explanation
below.
:
407 if err := setOOMScore(cmd.Process.Pid, r.oomScore); err != nil
{
408 utils.KillProcess(cmd.Process.Pid)
:
: // Due to the error from write() we return here. As
the
: // goroutine that would clean up the child has not
been
: // started yet, containerd remains in the defunct
state
: // and never gets reaped.
:
409 return err
410 }
:
417 go func() {
418 cmd.Wait()
419 close(r.daemonWaitCh)
420 }() // Reap our child when needed
:
423 }
This is the kernel function that gets invoked when dockerd tries to
write
to /proc/PID_OF_CONTAINERD/oom_score_adj.
- from fs/proc/base.c:
1197 static ssize_t oom_score_adj_write(struct file *file, ...
1198 size_t count, loff_t
*ppos)
1199 {
:
1223 task = get_proc_task(file_inode(file));
:
: // The defunct containerd process does not have a virtual
: // address space anymore, i.e. task->mm is NULL. Thus the
: // following code returns errno EINVAL to dockerd.
:
1230 if (!task->mm) {
1231 err = -EINVAL;
1232 goto err_task_lock;
1233 }
:
1253 err_task_lock:
:
1257 return err < 0 ? err : count;
1258 }
The purpose of the following program is to demonstrate the behavior of
the oom_score_adj_write() function in connection with a defunct process.
$ cat defunct_test.c
\#include <unistd.h>
main()
{
pid_t pid = fork();
if (pid == 0)
// child
_exit(0);
// parent
pause();
}
$ make defunct_test
cc defunct_test.c -o defunct_test
$ ./defunct_test &
[1] 3142
$ ps -f | grep defunct_test | grep -v grep
root 3142 2956 0 13:04 pts/0 00:00:00 ./defunct_test
root 3143 3142 0 13:04 pts/0 00:00:00 [defunct_test] <defunct>
$ echo "ps 3143" | crash -s
PID PPID CPU TASK ST %MEM VSZ RSS COMM
3143 3142 2 ffff880035def300 ZO 0.0 0 0
defunct_test
$ echo "px ((struct task_struct *)0xffff880035def300)->mm" | crash -s
$1 = (struct mm_struct *) 0x0
^^^ task->mm is NULL
$ cat /proc/3143/oom_score_adj
0
$ echo 0 > /proc/3143/oom_score_adj
-bash: echo: write error: Invalid argument"
---
This patch fixes the above issue by making sure we start the reaper
goroutine as soon as possible.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Description:
Kill docker-containerd continuously, and use kill -SIGUSR1 <dockerpid>
to check docker callstacks. And we will find that event
handler: startEventsMonitor or handleEventStream will exit.
This will only happen when system is busy, containerd need more time to
startup, and the monitor gorotine maybe exit.
Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
Signed-off-by: John Howard <jhoward@microsoft.com>
This ensures that any compute processes in HCS are cleanedup
during daemon restore. Note Windows cannot (currently) reconnect
to containers on restore.
Instead of a timeout the context is cancelled on error to ensure
proper cleanup of the associated fifos' goroutines.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
Keeping the current behavior for exec, i.e., inheriting
variables from main process. New variables will be added
to current ones. If there's already a variable with that
name it will be overwritten.
Example of usage: docker exec -it -e TERM=vt100 <container> top
Closes#24355.
Signed-off-by: Jonh Wendell <jonh.wendell@redhat.com>
The code that handles waiting for the servicing container to complete correctly grabs the exit code and logs a failure, but doesn't return that failure to the caller, mistakenly causing servicing operations to look successful when they really failed during processing.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
During the recent OCI changes, I mistakenly thought LayerFolderPath is only needed for Windows Server containers (isolation=process) and not for Hyper-V Containers, but it turns out it is also required for servicing containers used to finish installing updates. Since the servicing containers need to reuse the container's create options, this change makes it so that LayerFolderPath is always filled in for all containers as part of constructing the create options.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
Microsoft will be distributing non-base layers that have utility VM image
updates. Update libcontainerd to use the top-most utility VM image that is
available in the image chain when launching Hyper-V-isolated container.
Signed-off-by: John Starks <jostarks@microsoft.com>
New grpc uses failfast by default, but that code was written with other
default in mind, so just preserve it for now.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
Fix issue where environment variables with embedded equals signs were
being dropped and not passed to the container.
Fixes#26178.
Signed-off-by: Matt Richardson <matt.richardson@octopus.com>
There exists a race in container servicing on Windows where, during normal operation, the container will begin to shut itself down while docker calls shutdown explicitly. If the former succeeds just as the latter is attempting to communicate with the container to request the shutdown, an error comes back that can cause the servicing to incorrectly register as a failure. Instead, we just wait for the servicing container to shutdown on it's own, using a reasonable timeout to allow for merging in the updates.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
When there is no event for the container it can happen because of a
crash and the container state on the persistent disk will have a
mismatch between what was in `/run` ( machine crash ).
This situation will create an unkillable container in docker because
containerd does not see it and it is not running but docker thinks it is
and you cannot tell it anything different.
This fixes the issue by checking if containerd has the container running
if we do not have an event instead of just returning.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This was preventing the "exit" event to be correctly processed during
the restore process without live-restore enabled.
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
This version introduces the following:
- uses nanosecond timestamps for event
- ensure events are sent once their effect is "live"
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
This adds an `--oom-score-adjust` flag to the daemon so that the value
provided can be set for the docker daemon's process. The default value
for the flag is -500. This will allow the docker daemon to have a
less chance of being killed before containers do. The default value for
processes is 0 with a min/max of -1000/1000.
-500 is a good middle ground because it is less than the default for
most processes and still not -1000 which basically means never kill this
process in an OOM condition on the host machine. The only processes on
my machine that have a score less than -500 are dbus at -900 and sshd
and xfce( my window manager ) at -1000. I don't think docker should be
set lower, by default, than dbus or sshd so that is why I chose -500.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The Windows TP5 image is not compatible with the Hyper-V isolated
container clone feature. Detect old images and pass a flag specifying that
clone should not be enabled.
Signed-off-by: John Starks <jostarks@microsoft.com>
Right now, if we hit an error retrieving the exit code in HCS process.ExitCode, we return that 0 and that error. Golang convention says that if an error is returned the other values should not be used, but the caller of ExitCode in libcontainerd has to fall through if an error is received. Rather than return a success exit code in that failure case, we should return -1 to indicate a generic failure.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>
This flags enables full support of daemonless containers in docker. It
ensures that docker does not stop containers on shutdown or restore and
properly reconnects to the container when restarted.
This is not the default because of backwards compat but should be the
desired outcome for people running containers in prod.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This change adjusts the calling pattern for servcing containers to use waiting on the process instead of expecting start to block. This is safer, as it avoids timeouts in the start code path for the potentially expensive update operation.
Signed-off-by: Stefan J. Wernli <swernli@microsoft.com>