A common pattern in libnetwork is to delete an object using
`DeleteAtomic`, ie. to check the optimistic lock, but put in a retry
loop to refresh the data and the version index used by the optimistic
lock.
This commit introduces a new `Delete` method to delete without
checking the optimistic lock. It focuses only on the few places where
it's obvious the calling code doesn't rely on the side-effects of the
retry loop (ie. refreshing the object to be deleted).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Prior to release 25.0.0, the bridge in an internal network was assigned
an IP address - making the internal network accessible from the host,
giving containers on the network access to anything listening on the
bridge's address (or INADDR_ANY on the host).
This change restores that behaviour. It does not restore the default
route that was configured in the container, because packets sent outside
the internal network's subnet have always been dropped. So, a 'connect()'
to an address outside the subnet will still fail fast.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Since 964ab7158c, we explicitly set the bridge MTU if it was specified.
Unfortunately, kernel <v4.17 have a check preventing us to manually set
the MTU to anything greater than 1500 if no links is attached to the
bridge, which is how we do things -- create the bridge, set its MTU and
later on, attach veths to it.
Relevant kernel commit: 804b854d37
As we still have to support CentOS/RHEL 7 (and their old v3.10 kernels)
for a few more months, we need to ignore EINVAL if the MTU is > 1500
(but <= 65535).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Containers attached to an 'internal' bridge network are unable to
communicate when the host is running firewalld.
Non-internal bridges are added to a trusted 'docker' firewalld zone, but
internal bridges were not.
DOCKER-ISOLATION iptables rules are still configured for an internal
network, they block traffic to/from addresses outside the network's subnet.
Signed-off-by: Rob Murray <rob.murray@docker.com>
The bbolt library wants exclusive access to the boltdb file and uses
file locking to assure that is the case. The controller and each network
driver that needs persistent storage instantiates its own unique
datastore instance, backed by the same boltdb file. The boltdb kvstore
implementation works around multiple access to the same boltdb file by
aggressively closing the boltdb file between each transaction. This is
very inefficient. Have the controller pass its datastore instance into
the drivers and enable the PersistConnection option to disable closing
the boltdb between transactions.
Set data-dir in unit tests which instantiate libnetwork controllers so
they don't hang trying to lock the default boltdb database file.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Some configuration in a container depends on whether it has support for
IPv6 (including default entries for '::1' etc in '/etc/hosts').
Before this change, the container's support for IPv6 was determined by
whether it was connected to any IPv6-enabled networks. But, that can
change over time, it isn't a property of the container itself.
So, instead, detect IPv6 support by looking for '::1' on the container's
loopback interface. It will not be present if the kernel does not have
IPv6 support, or the user has disabled it in new namespaces by other
means.
Once IPv6 support has been determined for the container, its '/etc/hosts'
is re-generated accordingly.
The daemon no longer disables IPv6 on all interfaces during initialisation.
It now disables IPv6 only for interfaces that have not been assigned an
IPv6 address. (But, even if IPv6 is disabled for the container using the
sysctl 'net.ipv6.conf.all.disable_ipv6=1', interfaces connected to IPv6
networks still get IPv6 addresses that appear in the internal DNS. There's
more to-do!)
Signed-off-by: Rob Murray <rob.murray@docker.com>
A check was added to the bridge driver to detect when it was called to
create the default bridge nw whereas a stale default bridge already
existed. In such case, the bridge driver was deleting the stale network
before re-creating it. This check was introduced in docker/libnetwork@6b158eac6a
to fix an issue related to newly introduced live-restore.
However, since commit docker/docker@ecffb6d58c,
the daemon doesn't even try to create default networks if there're
active sandboxes (ie. due to live-restore).
Thus, now it's impossible for the default bridge network to be stale and
to exists when the driver's CreateNetwork() method is called. As such,
the check introduced in the first commit mentioned above is dead code
and can be safely removed.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
When mapping a port with the userland-proxy enabled, the daemon would
perform an "exec.LookPath" for every mapped port (which, in case of
a range of ports, would be for every port in the range).
This was both inefficient (looking up the binary for each port), inconsistent
(when running in rootless-mode, the binary was looked-up once), as well as
inconvenient, because a missing binary, or a mis-configureed userland-proxy-path
would not be detected daeemon startup, and not produce an error until starting
the container;
docker run -d -P nginx:alpine
4f7b6589a1680f883d98d03db12203973387f9061e7a963331776170e4414194
docker: Error response from daemon: driver failed programming external connectivity on endpoint romantic_wiles (7cfdc361821f75cbc665564cf49856cf216a5b09046d3c22d5b9988836ee088d): fork/exec docker-proxy: no such file or directory.
However, the container would still be created (but invalid);
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
869f41d7e94f nginx:alpine "/docker-entrypoint.…" 10 seconds ago Created romantic_wiles
This patch changes how the userland-proxy is configured;
- The path of the userland-proxy is now looked up / configured at daemon
startup; this is similar to how the proxy is configured in rootless-mode.
- A warning is logged when failing to lookup the binary.
- If the daemon is configured with "userland-proxy" enabled, an error is
produced, and the daemon will refuse to start.
- The "proxyPath" argument for newProxyCommand() (in libnetwork/portmapper)
is now required to be set. It no longer looks up the executable, and
produces an error if no path was provided. While this change was not
required, it makes the daemon config the canonical source of truth, instead
of logic spread accross multiplee locations.
Some of this logic is a change of behavior, but these changes were made with
the assumption that we don't want to support;
- installing the userland proxy _after_ the daemon was started
- moving the userland proxy (or installing a proxy with a higher
preference in PATH)
With this patch:
Validating the config produces an error if the binary is not found:
dockerd --validate
WARN[2023-12-29T11:36:39.748699591Z] failed to lookup default userland-proxy binary error="exec: \"docker-proxy\": executable file not found in $PATH"
userland-proxy is enabled, but userland-proxy-path is not set
Disabling userland-proxy prints a warning, but validates as "OK":
dockerd --userland-proxy=false --validate
WARN[2023-12-29T11:38:30.752523879Z] ffailed to lookup default userland-proxy binary error="exec: \"docker-proxy\": executable file not found in $PATH"
configuration OK
Speficying a non-absolute path produces an error:
dockerd --userland-proxy-path=docker-proxy --validate
invalid userland-proxy-path: must be an absolute path: docker-proxy
Befor this patch, we would not validate this path, which would allow the daemon
to start, but fail to map a port;
docker run -d -P nginx:alpine
4f7b6589a1680f883d98d03db12203973387f9061e7a963331776170e4414194
docker: Error response from daemon: driver failed programming external connectivity on endpoint romantic_wiles (7cfdc361821f75cbc665564cf49856cf216a5b09046d3c22d5b9988836ee088d): fork/exec docker-proxy: no such file or directory.
Specifying an invalid userland-proxy-path produces an error as well:
dockerd --userland-proxy-path=/usr/local/bin/no-such-binary --validate
userland-proxy-path is invalid: stat /usr/local/bin/no-such-binary: no such file or directory
mkdir -p /usr/local/bin/not-a-file
dockerd --userland-proxy-path=/usr/local/bin/not-a-file --validate
userland-proxy-path is invalid: exec: "/usr/local/bin/not-a-file": is a directory
touch /usr/local/bin/not-an-executable
dockerd --userland-proxy-path=/usr/local/bin/not-an-executable --validate
userland-proxy-path is invalid: exec: "/usr/local/bin/not-an-executable": permission denied
Same when using the daemon.json config-file;
echo '{"userland-proxy-path":"no-such-binary"}' > /etc/docker/daemon.json
dockerd --validate
unable to configure the Docker daemon with file /etc/docker/daemon.json: merged configuration validation from file and command line flags failed: invalid userland-proxy-path: must be an absolute path: no-such-binary
dockerd --userland-proxy-path=hello --validate
unable to configure the Docker daemon with file /etc/docker/daemon.json: the following directives are specified both as a flag and in the configuration file: userland-proxy-path: (from flag: hello, from file: /usr/local/bin/docker-proxy)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
If IPv6 is enabled for a bridge network, by the time configuration
is applied, the bridge will always have an address. Assert that, by
raising an error when the configuration is validated.
Use that to simplify the logic used to calculate which addresses
should be assigned to a bridge. Also remove a redundant check in
setupGatewayIPv6() and the error associated with it.
Fix unit tests that enabled IPv6, but didn't supply an IPv6 IPAM
address/pool. Before this change, these tests passed but silently
left the bridge without an IPv6 address.
(The daemon already ensured there was an IPv6 address, this change
does not add a new restriction on config at that level.)
Signed-off-by: Rob Murray <rob.murray@docker.com>
Some checks in 'networkConfiguration.Validate()' were not running as
expected, they'd always pass - because 'parseNetworkOptions()' called
it before 'config.processIPAM()' had added IP addresses and gateways.
Signed-off-by: Rob Murray <rob.murray@docker.com>
Calculate the IPv6 addreesses needed on a bridge, then reconcile them
with the addresses on an existing bridge by deleting then adding as
required.
(Previously, required addresses were added one-by-one, then unwanted
addresses were removed. This meant the daemon failed to start if, for
example, an existing bridge had address '2000:db8::/64' and the config
was changed to '2000:db8::/80'.)
IPv6 addresses are now calculated and applied in one go, so there's no
need for setupVerifyAndReconcile() to check the set of IPv6 addresses on
the bridge. And, it was guarded by !config.InhibitIPv4, which can't have
been right. So, removed its IPv6 parts, and added IPv4 to its name.
Link local addresses, the example given in the original ticket, are now
released when containers are stopped. Not releasing them meant that
when using an LL subnet on the default bridge, no container could be
started after a container was stopped (because the calculated address
could not be re-allocated). In non-default bridge networks using an
LL subnet, addresses leaked.
Linux always uses the standard 'fe80::/64' LL network. So, if a bridge
is configured with an LL subnet prefix that overlaps with it, a config
error is reported. Non-overlapping LL subnet prefixes are allowed.
Signed-off-by: Rob Murray <rob.murray@docker.com>
This is purely cosmetic - if a non-default MTU is configured, the bridge
will have the default MTU=1500 until a container's 'veth' is connected
and an MTU is set on the veth. That's a disconcerting, it looks like the
config has been ignored - so, set the bridge's MTU explicitly.
Fixes#37937
Signed-off-by: Rob Murray <rob.murray@docker.com>
Add a new `com.docker.network.host_ipv6` bridge option to compliment
the existing `com.docker.network.host_ipv4` option. When set to an
IPv6 address, this causes the bridge to insert `SNAT` rules instead of
`MASQUERADE` rules (assuming `ip6tables` is enabled). `SNAT` makes it
possible for users to control the source IP address used for outgoing
connections.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Rename all variables/fields/map keys associated with the
`com.docker.network.host_ipv4` option from `HostIP` to `HostIPv4`.
Rationale:
* This makes the variable/field name consistent with the option
name.
* This makes the code more readable because it is clear that the
variable/field does not hold an IPv6 address. This will hopefully
avoid bugs like <https://github.com/moby/moby/issues/46445> in the
future.
* If IPv6 SNAT support is ever added, the names will be symmetric.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Rather than pass an `iptables.IPVersion` value alongside every
`iptRule` parameter, embed the IP version in the `iptRule` struct.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
That field was only used to pass `-t nat` for NAT rules. Now `-t
<tableName>` (where `<tableName>` is one of the `iptables.Table`
values) is always passed, eliminating the need for `preArgs`.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Pass the entire `*networkConfiguration` struct to
`setupIPTablesInternal` to simplify the function signature and improve
code readability.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
The github.com/containerd/containerd/log package was moved to a separate
module, which will also be used by upcoming (patch) releases of containerd.
This patch moves our own uses of the package to use the new module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
So far, internal networks were only isolated from the host by iptables
DROP rules. As a consequence, outbound connections from containers would
timeout instead of being "rejected" through an immediate ICMP dest/port
unreachable, a TCP RST or a failing `connect` syscall.
This was visible when internal containers were trying to resolve a
domain that don't match any container on the same network (be it a truly
"external" domain, or a container that don't exist/is dead). In that
case, the embedded resolver would try to forward DNS queries for the
different values of resolv.conf `search` option, making DNS resolution
slow to return an error, and the slowness being exacerbated by some libc
implementations.
This change makes `connect` syscall to return ENETUNREACH, and thus
solves the broader issue of failing fast when external connections are
attempted.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
None of the code using this function was setting the value, so let's
simplify and remove the argument.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Before this commit, setting the `com.docker.network.host_ipv4` bridge
option when `enable_ipv6` is true and the experimental `ip6tables`
option is enabled would cause Docker to fail to create the network:
> failed to create network `test-network`: Error response from daemon:
> Failed to Setup IP tables: Unable to enable NAT rule: (iptables
> failed: `ip6tables --wait -t nat -I POSTROUTING -s fd01::/64 ! -o
> br-test -j SNAT --to-source 192.168.0.2`: ip6tables
> v1.8.7 (nf_tables): Bad IP address "192.168.0.2"
>
> Try `ip6tables -h` or `ip6tables --help` for more information.
> (exit status 2))
Fix this error by passing nil -- not the `host_ipv4` address -- when
creating the IPv6 rules.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Interface-matching should generally happen on the receiver side, and this
function was only used in a single location, and passed as argument to
Driver.CreateEndpoint, which already matches the interface by accepting
a driverapi.InterfaceInfo.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
InvalidParameter is now compatible with errdefs.InvalidParameter. Thus,
these errors will now return a 400 status code instead of a 500.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This function has _four_ output variables of the same type, and several
defer statements that checked the error returned (but using the `err`
variable).
This patch names the return variables to make it clearer what's being
returned, and renames the error-return to `retErr` to make it clearer
where we're dealing with the returned error (and not any local err), to
prevent accidentally shadowing.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This makes it easier to spot if code is only used on Linux. Note that "all of"
the bridge driver is Linux-only.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Collect a list of all the links we successfully enabled (if any), and
use a single defer to disable them.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>