The plugin spec says that plugins can live in one of:
- /var/run/docker/plugins/<name>.sock
- /var/run/docker/plugins/<name>/<name>.sock
- /etc/docker/plugins/<name>.[json,spec]
- /etc/docker/plugins/<name>/<name>.<json,spec>
- /usr/lib/docker/plugins/<name>.<json,spec>
- /usr/lib/docker/plugins/<name>/<name>.<json,spec>
However, the plugin scanner which is used by the volume list API was
doing `filepath.Walk`, which will walk the entire tree for each of the
supported paths.
This means that even v2 plugins in
`/var/run/docker/plugins/<id>/<name>.sock` were being detected as a v1
plugin.
When the v1 plugin loader tried to load such a plugin it would log an
error that it couldn't find it because it doesn't match one of the
supported patterns... e.g. when in a subdir, the subdir name must match
the plugin name for the socket.
There is no behavior change as the error is only on the `Scan()` call,
which is passing names to the plugin registry when someone calls the
volume list API.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Previously adding files to vendor/ without adding to vendor.conf would not fail the
validation.
Also be consistent with indentation and use tabs.
Signed-off-by: Daniel Nephin <dnephin@gmail.com>
Attachable networks are networks created on the cluster which can then
be attached to by non-swarm containers. These networks are lazily
created on the node that wants to attach to that network.
When no container is currently attached to one of these networks on a
node, and then multiple containers which want that network are started
concurrently, this can cause a race condition in the network attachment
where essentially we try to attach the same network to the node twice.
To easily reproduce this issue you must use a multi-node cluster with a
worker node that has lots of CPUs (I used a 36 CPU node).
Repro steps:
1. On manager, `docker network create -d overlay --attachable test`
2. On worker, `docker create --restart=always --network test busybox
top`, many times... 200 is a good number (but not much more due to
subnet size restrictions)
3. Restart the daemon
When the daemon restarts, it will attempt to start all those containers
simultaneously. Note that you could try to do this yourself over the API,
but it's harder to trigger due to the added latency from going over
the API.
The error produced happens when the daemon tries to start the container
upon allocating the network resources:
```
attaching to network failed, make sure your network options are correct and check manager logs: context deadline exceeded
```
What happens here is the worker makes a network attachment request to
the manager. This is an async call which in the happy case would cause a
task to be placed on the node, which the worker is waiting for to get
the network configuration.
In the case of this race, the error ocurrs on the manager like this:
```
task allocation failure" error="failed during network allocation for task n7bwwwbymj2o2h9asqkza8gom: failed to allocate network IP for task n7bwwwbymj2o2h9asqkza8gom network rj4szie2zfauqnpgh4eri1yue: could not find an available IP" module=node node.id=u3489c490fx1df8onlyfo1v6e
```
The task is not created and the worker times out waiting for the task.
---
The mitigation for this is to make sure that only one attachment reuest
is in flight for a given network at a time *when the network doesn't
already exist on the node*. If the network already exists on the node
there is no need for synchronization because the network is already
allocated and on the node so there is no need to request it from the
manager.
This basically comes down to a race with `Find(network) ||
Create(network)` without any sort of syncronization.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
This fix tries to address the issue raised in 36139 where
ExitCode and PID does not show up in Task.Status.ContainerStatus
The issue was caused by `json:",omitempty"` in PID and ExitCode
which interprate 0 as null.
This is confusion as ExitCode 0 does have a meaning.
This fix removes `json:",omitempty"` in ExitCode and PID,
but changes ContainerStatus to pointer so that ContainerStatus
does not show up at all if no content. If ContainerStatus
does have a content, then ExitCode and PID will show up (even if
they are 0).
This fix fixes 36139.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
ReleaseRWLayer can and should only be called once (unless it returns
an error), but might be called twice in case of a failure from
`system.EnsureRemoveAll(container.Root)`. This results in the
following error:
> Error response from daemon: driver "XXX" failed to remove root filesystem for YYY: layer not retained
The obvious fix is to set container.RWLayer to nil as soon as
ReleaseRWLayer() succeeds.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Moves builder/shell_parser and into its own subpackage at builder/shell since it
has no dependencies other than the standard library. This will make it
much easier to vendor for downstream libraries, without pulling all the
dependencies of builder/.
Fixes#36154
Signed-off-by: Matt Rickard <mrick@google.com>
This fix tries to address the issue raised in 36142 where
there are discrepancies between Swarm API and swagger.yaml.
This fix adds two recently added state `REMOVE` and `ORPHANED` to TaskState.
This fix fixes 36142.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>