While investigating a test failure, I found this in the logs:
```
time="2019-07-04T15:06:32.622506760Z" level=warning msg="Error while setting daemon root propagation, this is not generally critical but may cause some functionality to not work or fallback to less desirable behavior" dir=/go/src/github.com/docker/docker/bundles/test-integration/d1285b8250308/root error="error writing file to signal mount cleanup on shutdown: open /tmp/dxr/d1285b8250308/unmount-on-shutdown: no such file or directory"
```
This path is generated from the daemon's exec-root, which appears to not
exist yet. This change just makes sure it exists before we try to write
a file.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
(cherry picked from commit 7725b88edc)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Fixes#39427
This always sends the exec exit events even when the exec fails to find
the binary. A standard 127 exit status is sent in this situation.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
(cherry picked from commit c08d4da6e5)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Trying to link to a non-existing container is not valid, and should return an
"invalid parameter" (400) error. Returning a "not found" error in this situation
would make the client report the container's image could not be found.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
(cherry picked from commit 422067ba7b)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
When mounting overlays which have children, enforce that
the mount is always performed as read only. Newer versions
of the kernel return a device busy error when a lower directory
is in use as an upper directory in another overlay mount.
Adds committed file to indicate when an overlay is being used
as a parent, ensuring it will no longer be mounted with an
upper directory.
Signed-off-by: Derek McGowan <derek@mcgstyle.net>
(cherry picked from commit 477bf1e413)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Docker daemon always stops healthcheck before sending signal to a
container now. However, when we use "docker kill" to send signals
other than SIGTERM or SIGKILL to a container, such as SIGINT,
daemon still stops container health check though container process
handles the signal normally and continues to work.
Signed-off-by: Ruilin Li <liruilin4@huawei.com>
(cherry picked from commit da574f9343)
Signed-off-by: Dani Louca <dani.louca@docker.com>
1. Use "in-place" variables for if statements to limit their scope to
the respectful `if` block.
2. Report the error returned from sd_journal_* by using CErr().
3. Use errors.New() instead of fmt.Errorf().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 20a0e58a79)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
From the first glance, `docker logs --tail 0` does not make sense,
as it is supposed to produce no output, but `tail -n 0` from GNU
coreutils is working like that, plus there is even a test case
(`TestLogsTail` in integration-cli/docker_cli_logs_test.go).
Now, something like `docker logs --follow --tail 0` makes total
sense, so let's make it work.
(NOTE if --tail is not used, config.Tail is set to -1)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit dd4bfe30a8)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
If we take a long time to process log messages, and during that time
journal file rotation occurs, the journald client library will keep
those rotated files open until sd_journal_process() is called.
By periodically calling sd_journal_process() during the processing
loop we shrink the window of time a client instance has open file
descriptors for rotated (deleted) journal files.
This code is modelled after that of journalctl [1]; the above explanation
as well as the value of 1024 is taken from there.
[v2: fix CErr() argument]
[1] https://github.com/systemd/systemd/blob/dc16327c48d/src/journal/journalctl.c#L2676
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit b73fb8fd5d)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
TL;DR: simplify the code, fix --follow hanging indefinitely
Do the following to simplify the followJournal() code:
1. Use Go-native select instead of C-native polling.
2. Use Watch{Producer,Consumer}Gone(), eliminating the need
to have journald.closed variable, and an extra goroutine.
3. Use sd_journal_wait(). In the words of its own man page:
> A synchronous alternative for using sd_journal_get_fd(),
> sd_journal_get_events(), sd_journal_get_timeout() and
> sd_journal_process() is sd_journal_wait().
Unfortunately, the logic is still not as simple as it
could be; the reason being, once the container has exited,
journald might still be writing some logs from its internal
buffers onto journal file(s), and there is no way to
figure out whether it's done so we are guaranteed to
read all of it back. This bug can be reproduced with
something like
> $ ID=$(docker run -d busybox seq 1 150000); docker logs --follow $ID
> ...
> 128123
> $
(The last expected output line should be `150000`).
To avoid exiting from followJournal() early, add the
following logic: once the container is gone, keep trying
to drain the journal until there's no new data for at
least `waitTimeout` time period.
Should fix https://github.com/docker/for-linux/issues/575
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit f091febc94)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. The journald client library initializes inotify watch(es)
during the first call to sd_journal_get_fd(), and it make sense
to open it earlier in order to not lose any journal file rotation
events.
2. It only makes sense to call this if we're going to use it
later on -- so add a check for config.Follow.
3. Remove the redundant call to sd_journal_get_fd().
NOTE that any subsequent calls to sd_journal_get_fd() return
the same file descriptor, so there's no real need to save it
for later use in wait_for_data_cancelable().
Based on earlier patch by Nalin Dahyabhai <nalin@redhat.com>.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 981c01665b)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case the LogConsumer is gone, the code that sends the message can
stuck forever. Wrap the code in select case, as all other loggers do.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 79039720c8)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case Tail=N parameter is requested, we need to show N lines.
It does not make sense to walk backwards one by one if we can
do it at once. Now, if Since=T is also provided, make sure we
haven't jumped too far (before T), and if we did, move forward.
The primary motivation for this was to make the code simpler.
This also fixes a tiny bug in the "since" implementation.
Before this commit:
> $ docker logs -t --tail=6000 --since="2019-03-10T03:54:25.00" $ID | head
> 2019-03-10T03:54:24.999821000Z 95981
After:
> $ docker logs -t --tail=6000 --since="2019-03-10T03:54:25.00" $ID | head
> 2019-03-10T03:54:25.000013000Z 95982
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit ff3cd167ea)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Minor code simplification.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit e8f6166791)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Clean up a deferred function call in the journal reading logic.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
(cherry picked from commit 1ada3e85bf)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are a few more places, apparently, that List operations against
Swarm exist, besides just in the List methods. This increases the max
received message size in those places.
Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit a84a78e976)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Protect access to q.quotas map, and lock around changing nextProjectID.
Techinically, the lock in findNextProjectID() is not needed as it is
only called during initialization, but one can never be too careful.
Fixes: 52897d1c09 ("projectquota: utility class for project quota controls")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 1ac0a66a64)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This reverts commit 98fc09128b in order to
keep registry v2 schema1 handling and libtrust-key-based engine ID.
Because registry v2 schema1 was not officially deprecated and
registries are still relying on it, this patch puts its logic back.
However, registry v1 relics are not added back since v1 logic has been
removed a while ago.
This also fixes an engine upgrade issue in a swarm cluster. It was relying
on the Engine ID to be the same upon upgrade, but the mentioned commit
modified the logic to use UUID and from a different file.
Since the libtrust key is always needed to support v2 schema1 pushes,
that the old engine ID is based on the libtrust key, and that the engine ID
needs to be conserved across upgrades, adding a UUID-based engine ID logic
seems to add more complexity than it solves the problems.
Hence reverting the engine ID changes as well.
Signed-off-by: Tibor Vass <tibor@docker.com>
(cherry picked from commit f695e98cb7)
Signed-off-by: Tibor Vass <tibor@docker.com>
Before 7a7357da, archive.TarResourceRebase was being used to copy files
and folders from the container. That function splits the source path
into a dirname + basename pair to support copying a file:
if you wanted to tar `dir/file` it would tar from `dir` the file `file`
(as part of the IncludedFiles option).
However, that path splitting logic was kept for folders as well, which
resulted in weird inputs to archive.TarWithOptions:
if you wanted to tar `dir1/dir2` it would tar from `dir1` the directory
`dir2` (as part of IncludedFiles option).
Although it was weird, it worked fine until we started chrooting into
the container rootfs when doing a `docker cp` with container source set
to `/` (cf 3029e765).
The fix is to only do the path splitting logic if the source is a file.
Unfortunately, 7a7357da added support for LCOW by duplicating some of
this subtle logic. Ideally we would need to do more refactoring of the
archive codebase to properly encapsulate these behaviors behind well-
documented APIs.
This fix does not do that. Instead, it fixes the issue inline.
Signed-off-by: Tibor Vass <tibor@docker.com>
(cherry picked from commit 171538c190)
Signed-off-by: Tibor Vass <tibor@docker.com>
retries to attach to a network, it is already connected to
Fixes - https://github.com/docker/for-linux/issues/632
Signed-off-by: Arko Dasgupta <arko.dasgupta@docker.com>
(cherry picked from commit 871acb1c86)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Running a bundled aufs benchmark sometimes results in this warning:
> WARN[0001] Couldn't run auplink before unmount /tmp/aufs-tests/aufs/mnt/XXXXX error="exit status 22" storage-driver=aufs
If we take a look at what aulink utility produces on stderr, we'll see:
> auplink:proc_mnt.c:96: /tmp/aufs-tests/aufs/mnt/XXXXX: Invalid argument
and auplink exits with exit code of 22 (EINVAL).
Looking into auplink source code, what happens is it tries to find a
record in /proc/self/mounts corresponding to the mount point (by using
setmntent()/getmntent_r() glibc functions), and it fails.
Some manual testing, as well as runtime testing with lots of printf
added on mount/unmount, as well as calls to check the superblock fs
magic on mount point (as in graphdriver.Mounted(graphdriver.FsMagicAufs, target)
confirmed that this record is in fact there, but sometimes auplink
can't find it. I was also able to reproduce the same error (inability
to find a mount in /proc/self/mounts that should definitely be there)
using a small C program, mocking what `auplink` does:
```c
#include <stdio.h>
#include <err.h>
#include <mntent.h>
#include <string.h>
#include <stdlib.h>
int main(int argc, char **argv)
{
FILE *fp;
struct mntent m, *p;
char a[4096];
char buf[4096 + 1024];
int found =0, lines = 0;
if (argc != 2) {
fprintf(stderr, "Usage: %s <mountpoint>\n", argv[0]);
exit(1);
}
fp = setmntent("/proc/self/mounts", "r");
if (!fp) {
err(1, "setmntent");
}
setvbuf(fp, a, _IOLBF, sizeof(a));
while ((p = getmntent_r(fp, &m, buf, sizeof(buf)))) {
lines++;
if (!strcmp(p->mnt_dir, argv[1])) {
found++;
}
}
printf("found %d entries for %s (%d lines seen)\n", found, argv[1], lines);
return !found;
}
```
I have also wrote a few other C proggies -- one that reads
/proc/self/mounts directly, one that reads /proc/self/mountinfo instead.
They are also prone to the same occasional error.
It is not perfectly clear why this happens, but so far my best theory
is when a lot of mounts/unmounts happen in parallel with reading
contents of /proc/self/mounts, sometimes the kernel fails to provide
continuity (i.e. it skips some part of file or mixes it up in some
other way). In other words, this is a kernel bug (which is probably
hard to fix unless some other interface to get a mount entry is added).
Now, there is no real fix, and a workaround I was able to come up
with is to retry when we got EINVAL. It usually works on the second
attempt, although I've once seen it took two attempts to go through.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit ae431b10a9)
Do not use filepath.Walk() as there's no requirement to recursively
go into every directory under mnt -- a (non-recursive) list of
directories in mnt is sufficient.
With filepath.Walk(), in case some container will fail to unmount,
it'll go through the whole container filesystem which is both
excessive and useless.
This is similar to commit f1a4592297 ("devmapper.shutdown:
optimize")
While at it, raise the priority of "unmount error" message from debug
to a warning. Note we don't have to explicitly add `m` as unmount error (from
pkg/mount) will have it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 8fda12c607)
In case there are a big number of layers, so that mount data won't fit
into a single memory page (4096 bytes on most platforms, which is good
enough for about 40 layers, depending on how long graphdriver root path
is), we supply additional layers with O_REMOUNT, as described in aufs
documentation.
Problem is, the current implementation does that one layer at a time
(i.e. there is one mount syscall per each additional layer).
Optimize the code to supply as many layers as we can fit in one page
(basically reusing the same code as for the original mount).
Note, per aufs docs, "[a]t remount-time, the options are interpreted
in the given order, e.g. left to right" so we should be good.
Tested on an image with ~100 layers.
Before (35 syscalls):
> [pid 22756] 1556919088.686955 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.000504>
> [pid 22756] 1556919088.687643 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451b0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000105>
> [pid 22756] 1556919088.687851 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c451ba, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000098>
> ..... (~30 lines skipped for clarity)
> [pid 22756] 1556919088.696182 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/a86f8c9dd0ec2486293119c20b0ec026e19bbc4d51332c554f7cf05d777c9866", 0xc000c45310, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.000266>
After (2 syscalls):
> [pid 24352] 1556919361.799889 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", "aufs", 0, "br:/mnt/volume_sfo2_09/docker-au"...) = 0 <0.001717>
> [pid 24352] 1556919361.801761 mount("none", "/mnt/volume_sfo2_09/docker-aufs/aufs/mnt/8e7ba189e347a834e99eea4ed568f95b86cec809c227516afdc7c70286ff9a20", 0xc000dbecb0, MS_REMOUNT, "append:/mnt/volume_sfo2_09/docke"...) = 0 <0.001358>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit d58c434bff)
Apparently there is some kind of race in aufs kernel module code,
which leads to the errors like:
[98221.158606] aufs au_xino_create2:186:dockerd[25801]: aufs.xino create err -17
[98221.162128] aufs au_xino_set:1229:dockerd[25801]: I/O Error, failed creating xino(-17).
[98362.239085] aufs au_xino_create2:186:dockerd[6348]: aufs.xino create err -17
[98362.243860] aufs au_xino_set:1229:dockerd[6348]: I/O Error, failed creating xino(-17).
[98373.775380] aufs au_xino_create:767:dockerd[27435]: open /dev/shm/aufs.xino(-17)
[98389.015640] aufs au_xino_create2:186:dockerd[26753]: aufs.xino create err -17
[98389.018776] aufs au_xino_set:1229:dockerd[26753]: I/O Error, failed creating xino(-17).
[98424.117584] aufs au_xino_create:767:dockerd[27105]: open /dev/shm/aufs.xino(-17)
So, we have to have a lock around mount syscall.
While at it, don't call the whole Unmount() on an error path, as
it leads to bogus error from auplink flush.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 5cd62852fa)
1. Use mount.Unmount() which ignores EINVAL ("not mounted") error,
and provides better error diagnostics (so we don't have to explicitly
add target to error messages).
2. Since we're ignoring "not mounted" error, we can call
multiple unmounts without any locking -- but since "auplink flush"
is still involved and can produce an error in logs, let's keep
the check for fs being mounted (it's just a statfs so should be fast).
2. While at it, improve the "can't unmount" error message in Put().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 4beee98026)
Both mount and unmount calls are already protected by fine-grained
(per id) locks in Get()/Put() introduced in commit fc1cf1911b
("Add more locking to storage drivers"), so there's no point in
having a global lock in mount/unmount.
The only place from which unmount is called without any locking
is Cleanup() -- this is to be addressed in the next patch.
This reverts commit 824c24e680.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit f93750b2c4)