This is purely cosmetic - if a non-default MTU is configured, the bridge
will have the default MTU=1500 until a container's 'veth' is connected
and an MTU is set on the veth. That's a disconcerting, it looks like the
config has been ignored - so, set the bridge's MTU explicitly.
Fixes#37937
Signed-off-by: Rob Murray <rob.murray@docker.com>
I was trying to find out why `docker info` was sometimes slow so
plumbing a context through to propagate trace data through.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
The forwarding database (fdb) of Linux VXLAN links are restricted to
entries with destination VXLAN tunnel endpoint (VTEP) address of a
single address family. Which address family is permitted is set when the
link is created and cannot be modified. The overlay network driver
creates VXLAN links such that the kernel only allows fdb entries to be
created with IPv4 destination VTEP addresses. If the Swarm is configured
with IPv6 advertise addresses, creating fdb entries for remote peers
fails with EAFNOSUPPORT (address family not supported by protocol).
Make overlay networks functional over IPv6 transport by configuring the
VXLAN links for IPv6 VTEPs if the local node's advertise address is an
IPv6 address. Make encrypted overlay networks secure over IPv6 transport
by applying the iptables rules to the ip6tables when appropriate.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Early return if the iface or its address is nil to make the whole
function slightly easier to read.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
I am finally convinced that, given two netip.Prefix values a and b, the
expression
a.Contains(b.Addr()) || b.Contains(a.Addr())
is functionally equivalent to
a.Overlaps(b)
The (netip.Prefix).Contains method works by masking the address with the
prefix's mask and testing whether the remaining most-significant bits
are equal to the same bits in the prefix. The (netip.Prefix).Overlaps
method works by masking the longer prefix to the length of the shorter
prefix and testing whether the remaining most-significant bits are
equal. This is equivalent to
shorterPrefix.Contains(longerPrefix.Addr()), therefore applying Contains
symmetrically to two prefixes will always yield the same result as
applying Overlaps to the two prefixes in either order.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Add a new `com.docker.network.host_ipv6` bridge option to compliment
the existing `com.docker.network.host_ipv4` option. When set to an
IPv6 address, this causes the bridge to insert `SNAT` rules instead of
`MASQUERADE` rules (assuming `ip6tables` is enabled). `SNAT` makes it
possible for users to control the source IP address used for outgoing
connections.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
While there is nothing inherently wrong with goto statements, their use
here is not helping with readability.
Signed-off-by: Cory Snider <csnider@mirantis.com>
ds.cache is never nil so the uncached code paths are unreachable in
practice. And given how many KVObject deep-copy implementations shallow
copy pointers and other reference-typed values, there is the distinct
possibility that disabling the datastore cache could break things.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The datastore cache only uses the reference to its datastore to get a
reference to the backing store. Modify the cache to take the backing
store reference directly so that methods on the datastore can't get
called, as that might result in infinite recursion between datastore and
cache methods.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Inline the tortured logic for deciding when to skip updating the svc
records to give us a fighting chance at deciphering the logic behind the
logic and spotting logic bugs.
Update the service records synchronously. The only potential for issues
is if this change introduces deadlocks, which should be fixed by
restrucuting the mutexes rather than papering over the issue with
sketchy hacks like deferring the operation to a goroutine.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Its only remaining purpose is to elide removing the endpoint from the
service records if it was not previously added. Deleting the service
records is an idempotent operation so it is harmless to delete service
records which do not exist.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The service db entry for each network is deleted by
(*Controller).cleanupServiceDiscovery() when the network is deleted.
There is no need to also eagerly delete it whenever the network's
endpoint count drops to zero.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The logic to rename an endpoint includes code which would synchronize
the renamed service records to peers through the distributed datastore.
It would trigger the remote peers to pick up the rename by touching a
datastore object which remote peers would have subscribed to events on.
The code also asserts that the local peer is subscribed to updates on
the network associated with the endpoint, presumably as a proxy for
asserting that the remote peers would also be subscribed.
https://github.com/moby/libnetwork/pull/712
Libnetwork no longer has support for distributed datastores or
subscribing to datastore object updates, so this logic can be deleted.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The meaning of the (*Controller).isDistributedControl() method is not
immediately clear from the name, and it does not have any doc comment.
It returns true if and only if the controller is neither a manager node
nor an agent node -- that is, if the daemon is _not_ participating in a
Swarm cluster. The method name likely comes from the old abandoned
datastore-as-IPC control plane architecture for libnetwork. Refactor
c.isDistributedControl() -> !c.isSwarmNode()
to make it easier to understand code which consumes the method.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Rename all variables/fields/map keys associated with the
`com.docker.network.host_ipv4` option from `HostIP` to `HostIPv4`.
Rationale:
* This makes the variable/field name consistent with the option
name.
* This makes the code more readable because it is clear that the
variable/field does not hold an IPv6 address. This will hopefully
avoid bugs like <https://github.com/moby/moby/issues/46445> in the
future.
* If IPv6 SNAT support is ever added, the names will be symmetric.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Rather than pass an `iptables.IPVersion` value alongside every
`iptRule` parameter, embed the IP version in the `iptRule` struct.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
That field was only used to pass `-t nat` for NAT rules. Now `-t
<tableName>` (where `<tableName>` is one of the `iptables.Table`
values) is always passed, eliminating the need for `preArgs`.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
Pass the entire `*networkConfiguration` struct to
`setupIPTablesInternal` to simplify the function signature and improve
code readability.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
On modern kernels this is an alias; however newer code has preferred
ctstate while older code has preferred the deprecated 'state' name.
Prefer the newer name for uniformity in the rules libnetwork creates,
and because some implementations/distributions of the xtables userland
tools may not support the legacy alias.
Signed-off-by: Bjorn Neergaard <bjorn.neergaard@docker.com>
The github.com/containerd/containerd/log package was moved to a separate
module, which will also be used by upcoming (patch) releases of containerd.
This patch moves our own uses of the package to use the new module.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
So far, internal networks were only isolated from the host by iptables
DROP rules. As a consequence, outbound connections from containers would
timeout instead of being "rejected" through an immediate ICMP dest/port
unreachable, a TCP RST or a failing `connect` syscall.
This was visible when internal containers were trying to resolve a
domain that don't match any container on the same network (be it a truly
"external" domain, or a container that don't exist/is dead). In that
case, the embedded resolver would try to forward DNS queries for the
different values of resolv.conf `search` option, making DNS resolution
slow to return an error, and the slowness being exacerbated by some libc
implementations.
This change makes `connect` syscall to return ENETUNREACH, and thus
solves the broader issue of failing fast when external connections are
attempted.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This change creates a few OTEL spans and plumb context through the DNS
resolver and DNS backends (ie. Sandbox and Network). This should help
better understand how much lock contention impacts performance, and
help debug issues related to DNS queries (we basically have no
visibility into what's happening here right now).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
- use named return variables to make the function more self-describing
- rename variable for readability
- slightly optimize slice initialization, and keep linters happy
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It was only used in a single place, and it was defined far away from
where it was used.
Move the code inline, so that it's clear at a glance what it's doing.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The store field is only mutated by Controller.initStores(), which is
only called inside the cosntructor (libnetwork.New), so there should be
no need to protect the field with a mutex in non-exported functions.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Controller.getNetworkFromStore() already returns a ErrNoSuchNetwork if
no network was found, so we don't need to convert the existing error.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Simplify the lock/unlock cycle, and make the "lookupAlias" branch
more similar to the non-lookupAlias variant.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Skip faster when we're looking for aliases. Also check for the list
of aliases to be empty, not just `nil` (although in practice it should
be equivalent).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- use `nameOrAlias` for the name (or alias) to resolve
- use `lookupAlias` to indicate what the intent is; this function
is either looking up aliases or "regular" names. Ideally we would
split the function, but let's keep that for a future exercise.
- name the `ipv6Miss` output variable. The "ipv6 miss" logic is rather
confusing, and should probably be revisited, but let's start with
giving the variable a name to make it more apparent what it is.
- use `nw` for networks, which is the more common local name
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- remove some intermediate vars, or move them closer to where they're used.
- ResolveService: use strings.SplitN to limit number of elements. This
code is only used to validate the input, results are not used.
- ResolveService: return early instead of breaking the loop. This makes
it clearer from the code that were not returning anything (nil, nil).
- Controller.sandboxCleanup(): rename a var, and slight refactor of
error-handling.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function was used to check if the network is a multi-host, swarm-scoped
network. Part of this check involved a check whether the cluster-agent was
present.
In all places where this function was used, the next step after checking if
the network was "cluster eligible", was to get the agent, and (again) check
if it was not nil.
This patch rewrites the isClusterEligible utility into a clusterAgent utility,
which both checks if the network is cluster-eligible, and returns the agent
(if set). For convenience, an "ok" bool is added, which callers can use to
return early (although just checking for nilness would likely have been
sufficient).
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This removes redundant nil-checks in Endpoint.deleteServiceInfoFromCluster
and Endpoint.addServiceInfoToCluster.
These functions return early if the network is not ["cluster eligible"][1],
and the function used for that (`Network.isClusterEligible`) requires the
[agent to not be `nil`][2].
This check moved around a few times ([3][3], [4][4]), but was originally
added in [libnetwork 1570][5] which, among others, tried to avoid a nil-pointer
exception reported in [moby 28712][6], which accessed the `Controller.agent`
[without locking][7]. That issue was addressed by adding locks, adding a
`Controller.getAgent` accessor, and updating deleteServiceInfoFromCluster
to use a local var. It also sprinkled this `nil` check to be on the safe
side, but as `Network.isClusterEligible` already checks for the agent
to not be `nil`, this should not be redundant.
[1]: 5b53ddfcdd/libnetwork/agent.go (L529-L534)
[2]: 5b53ddfcdd/libnetwork/agent.go (L688-L696)
[3]: f2307265c7
[4]: 6426d1e66f
[5]: 8dcf9960aa
[6]: https://github.com/moby/moby/issues/28712
[7]: 75fd88ba89/vendor/github.com/docker/libnetwork/agent.go (L452)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It's still not "great", but implement a `newInterface()` constructor
to create a new Interface instance, instead of creating a partial
instance and applying "options" after the fact.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
We're only using the results if the interface doesn't have an address
yet, so skip this step if we don't use it.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Flatten some nested "if"-statements, and improve error.
Errors returned by this function are not handled, and only logged, so
make them more informative if debugging is needed.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
They were not consistently used, and the locations where they were
used were already "setters", so we may as well inline the code.
Also updating Namespace.Restore to keep the lock slightly longer,
instead of locking/unlocking for each property individually, although
we should consider to keep the long for the duration of the whole
function to make it more atomic.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Make the mutex internal to the Namespace; locking/unlocking should not
be done externally, and this makes it easier to see where it's used.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Interface.Remove() was directly accessing Namespace "internals", such
as locking/unlocking. Move the code from Interface.Remove() into the
Namespace instead.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
None of the code using this function was setting the value, so let's
simplify and remove the argument.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Fixes#18864, #20648, #33561, #40901.
[This GH comment][1] makes clear network name uniqueness has never been
enforced due to the eventually consistent nature of Classic Swarm
datastores:
> there is no guaranteed way to check for duplicates across a cluster of
> docker hosts.
And this is further confirmed by other comments made by @mrjana in that
same issue, eg. [this one][2]:
> we want to adopt a schema which can pave the way in the future for a
> completely decentralized cluster of docker hosts (if scalability is
> needed).
This decentralized model is what Classic Swarm was trying to be. It's
been superseded since then by Docker Swarm, which has a centralized
control plane.
To circumvent this drawback, the `NetworkCreate` endpoint accepts a
`CheckDuplicate` flag. However it's not perfectly reliable as it won't
catch concurrent requests.
Due to this design decision, API clients like Compose have to implement
workarounds to make sure names are really unique (eg.
docker/compose#9585). And the daemon itself has seen a string of issues
due to that decision, including some that aren't fixed to this day (for
instance moby/moby#40901):
> The problem is, that if you specify a network for a container using
> the ID, it will add that network to the container but it will then
> change it to reference the network by using the name.
To summarize, this "feature" is broken, has no practical use and is a
source of pain for Docker users and API consumers. So let's just remove
it for _all_ API versions.
[1]: https://github.com/moby/moby/issues/18864#issuecomment-167201414
[2]: https://github.com/moby/moby/issues/18864#issuecomment-167202589
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Before this commit, setting the `com.docker.network.host_ipv4` bridge
option when `enable_ipv6` is true and the experimental `ip6tables`
option is enabled would cause Docker to fail to create the network:
> failed to create network `test-network`: Error response from daemon:
> Failed to Setup IP tables: Unable to enable NAT rule: (iptables
> failed: `ip6tables --wait -t nat -I POSTROUTING -s fd01::/64 ! -o
> br-test -j SNAT --to-source 192.168.0.2`: ip6tables
> v1.8.7 (nf_tables): Bad IP address "192.168.0.2"
>
> Try `ip6tables -h` or `ip6tables --help` for more information.
> (exit status 2))
Fix this error by passing nil -- not the `host_ipv4` address -- when
creating the IPv6 rules.
Signed-off-by: Richard Hansen <rhansen@rhansen.org>
- store linkIndex in a local variable so that it can be reused
- remove / rename some intermediate vars that shadowed existing declaration
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This argument was originally added in libnetwork:
03f440667f
At the time, this argument was conditional, but currently it's always set
to "true", so let's remove it.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The code ignores these errors, but will unconditionally print a warning;
> If the kernel deletion fails for the neighbor entry still remote it
> from the namespace cache. Otherwise if the neighbor moves back to the
> same host again, kernel update can fail.
Let's reduce noise if the neighbor wasn't found, to prevent logs like:
Aug 16 13:26:35 master1.local dockerd[4019880]: time="2023-08-16T13:26:35.186662370+02:00" level=warning msg="error while deleting neighbor entry" error="no such file or directory"
Aug 16 13:26:35 master1.local dockerd[4019880]: time="2023-08-16T13:26:35.366585939+02:00" level=warning msg="error while deleting neighbor entry" error="no such file or directory"
Aug 16 13:26:42 master1.local dockerd[4019880]: time="2023-08-16T13:26:42.366658513+02:00" level=warning msg="error while deleting neighbor entry" error="no such file or directory"
While changing this code, also slightly rephrase the code-comment, and
fix a typo ("remote -> remove").
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
libnetwork/osl: Namespace.DeleteNeighbor: rephrase code-comment
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The test-file had a duplicate definition for ErrNotImplemented, which
caused an error in this package, and was not used otherwise, so we can
remove this file.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function either had to create a new StaticRoute, or add the destination
to the list of routes. Skip creating a StaticRoute struct if we're not
gonna use it.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This was introduced in 1980deffae, which
changed the implementation, but forgot to update imports in these.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There is no meaningful distinction between driverapi.Registerer and
drvregistry.DriverNotifyFunc. They are both used to register a network
driver with an interested party. They have the same function signature.
The only difference is that the latter could be satisfied by an
anonymous closure. However, in practice the only implementation of
drvregistry.DriverNotifyFunc is the
(*libnetwork.Controller).RegisterDriver method. This same method also
makes the libnetwork.Controller type satisfy the Registerer interface,
therefore the DriverNotifyFunc type is redundant. Change
drvregistry.Networks to notify a Registerer and drop the
DriverNotifyFunc type.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Add a fast-patch to some functions, to prevent locking/unlocking,
or other operations that would not be needed;
- Network.addDriverInfoToCluster
- Network.deleteDriverInfoFromCluster
- Network.addServiceInfoToCluster
- Network.deleteServiceInfoFromCluster
- Network.addDriverWatches
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- return early when failing to fetch the driver
- store network-ID and controller in a variable to prevent repeatedly
locking/unlocking. We don't expect the network's ID to change
during this operation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This method is not part of any interface, and identical to Endpoint.Iface,
but one returns an Interface-type (driverapi.InterfaceInfo) and the other
returns a concrete type (EndpointInterface).
Interface-matching should generally happen on the receiver side, and this
function was only used in a single location, and passed as argument to
Driver.CreateEndpoint, which already matches the interface by accepting
a driverapi.InterfaceInfo.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Interface-matching should generally happen on the receiver side, and this
function was only used in a single location, and passed as argument to
Driver.CreateEndpoint, which already matches the interface by accepting
a driverapi.InterfaceInfo.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Interface-matching should generally happen on the receiver side, and this
function was only used in a single location, and passed as argument to
Driver.CreateEndpoint, which already matches the interface by accepting
a driverapi.InterfaceInfo.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
commit ab35df454d removed most of the pre-go1.17
build-tags, but for some reason, "go fix" doesn't remove these, so removing
the remaining ones manually
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Copying relevant documentation from the EndpointInfo interface. We should
remove this interface, and the related Info() function, but it's currently
acting as a "gate" to prevent accessing the Endpoint's accessors without
making sure it's fully hydrated.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
There's only one implementation; let's use that.
Also fixing a linting issue;
libnetwork/osl/interface_linux.go:91:2: S1001: should use copy(to, from) instead of a loop (gosimple)
for i, iface := range n.iFaces {
^
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
InterfaceOptions() returned an IfaceOptionSetter interface, which contained
"methods" that returned functional options. Such a construct could have made
sense if the functional options returned would (e.g.) be pre-propagated with
information from the Sandbox (network namespace), but none of that was the case.
There was only one implementation of IfaceOptionSetter (networkNamespace),
which happened to be the same as the only implementation of Sandbox, so remove
the interface as well, to help networkNamespace with its multi-personality
disorder.
This patch:
- removes Sandbox.Bridge() and makes it a regular function (WithIsBridge)
- removes Sandbox.Master() and makes it a regular function (WithMaster)
- removes Sandbox.MacAddress() and makes it a regular function (WithMACAddress)
- removes Sandbox.Address() and makes it a regular function (WithIPv4Address)
- removes Sandbox.AddressIPv6() and makes it a regular function (WithIPv6Address)
- removes Sandbox.LinkLocalAddresses() and makes it a regular function (WithLinkLocalAddresses)
- removes Sandbox.Routes() and makes it a regular function (WithRoutes)
- removes Sandbox.InterfaceOptions().
- removes the IfaceOptionSetter interface.
Note that the IfaceOption signature was changes as well to allow returning
an error. This is not currently used, but will be used for some options
in the near future, so adding that in preparation.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
NeighborOptions() returned an NeighborOptionSetter interface, which
contained "methods" that returned functional options. Such a construct
could have made sense if the functional options returned would (e.g.)
be pre-propagated with information from the Sandbox (network namespace),
but none of that was the case.
There was only one implementation of NeighborOptionSetter (networkNamespace),
which happened to be the same as the only implementation of Sandbox, so
remove the interface as well, to help networkNamespace with its multi-personality
disorder.
This patch:
- removes Sandbox.LinkName() and makes it a regular function (WithLinkName)
- removes Sandbox.Family() and makes it a regular function (WithFamily)
- removes Sandbox.NeighborOptions().
- removes the NeighborOptionSetter interface
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
osl.NewSandbox() always returns a nil interface on Windows (and other non-Linux
platforms). This means that any code that these fields are always nil, and
any code using these fields must be considered Linux-only.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
osl.NewSandbox() always returns a nil interface on Windows (and other non-Linux
platforms). This means that any code that these fields are always nil, and
any code using these fields must be considered Linux-only;
- libnetwork/Controller.defOsSbox
- libnetwork/Sandbox.osSbox
Ideally, these fields would live in Linux-only files, but they're referenced
in various platform-neutral parts of the code, so let's start with moving
the initialization code to Linux-only files.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Copying the descriptions from the Sandbox, Info, NeighborOptionSetter,
and IfaceOptionSetter interfaces that it implements.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Check if firewalld is running before running the function, so that consumers
of the function don't have to check for the status.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Windows uses the container-iD as ID for sandboxes, so it's not needed to
generate an ID when running on Windows.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Prior to moby/moby#44968, libnetwork would happily accept a ChildSubnet
with a bigger mask than its parent subnet. In such case, it was
producing IP addresses based on the parent subnet, and the child subnet
was not allocated from the address pool.
This commit automatically fixes invalid ChildSubnet for networks stored
in libnetwork's datastore.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The Controller.Sandboxes method was used by some SandboxWalkers. Now
that those have been removed, there are no longer any consumers of this
method, so let's remove it for now.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This functionality has been replaced with Controller.GetSandbox, and is
no longer used anywhere.
This patch removes:
- the Controller.WalkSandboxes method
- the SandboxContainerWalker SandboxWalker
- the SandboxWalker type
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>