The only remaining kvstore is BoltDB, which doesn't use TLS connections
or authentication, so we can remove these options.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Remove the intermediate variable, and move the option closer
to where it's used, as in some cases we created the variable,
but could return with an error before it was used.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function was not using the DriverCallback interface, and only
required the Registerer interface.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
A reduced set of the dependency, only taking the parts that are used. Taken from
upstream commit: dfacc563de
# install filter-repo (https://github.com/newren/git-filter-repo/blob/main/INSTALL.md)
brew install git-filter-repo
cd ~/projects
# create a temporary clone of docker
git clone https://github.com/docker/libkv.git temp_libkv
cd temp_libkv
# create branch to work with
git checkout -b migrate_libkv
# remove all code, except for the files we need; rename the remaining ones to their new target location
git filter-repo --force \
--path libkv.go \
--path store/store.go \
--path store/boltdb/boltdb.go \
--path-rename libkv.go:libnetwork/internal/kvstore/kvstore_manage.go \
--path-rename store/store.go:libnetwork/internal/kvstore/kvstore.go \
--path-rename store/boltdb/:libnetwork/internal/kvstore/boltdb/
# go to the target github.com/moby/moby repository
cd ~/projects/docker
# create a branch to work with
git checkout -b integrate_libkv
# add the temporary repository as an upstream and make sure it's up-to-date
git remote add temp_libkv ~/projects/temp_libkv
git fetch temp_libkv
# merge the upstream code, rewriting "pkg/symlink" to "symlink"
git merge --allow-unrelated-histories --signoff -S temp_libkv/migrate_libkv
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Resolver.setupIPTable() checks whether it needs to flush or create the
user chains used for NATing container DNS requests by testing for the
existence of the rules which jump to said user chains. Unfortunately it
does so using the IPTable.RawCombinedOutputNative() method, which
returns a non-nil error if the iptables command returns any output even
if the command exits with a zero status code. While that is fine with
iptables-legacy as it prints no output if the rule exists, iptables-nft
v1.8.7 prints some information about the rule. Consequently,
Resolver.setupIPTable() would incorrectly think that the rule does not
exist during container restore and attempt to create it. This happened
work work by coincidence before 8f5a9a741b
because the failure to create the already-existing table would be
ignored and the new NAT rules would be inserted before the stale rules
left in the table from when the container was last started/restored. Now
that failing to create the table is treated as a fatal error, the
incompatibility with iptables-nft is no longer hidden.
Switch to using IPTable.ExistsNative() to test for the existence of the
jump rules as it correctly only checks the iptables command's exit
status without regard for whether it outputs anything.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The method to restore a network namespace takes a collection of
interfaces to restore with the options to apply. The interface names are
structured data, tuples of (SrcName, DstPrefix) but for whatever reason
are being passed into Restore() serialized to strings. A refactor,
f0be4d126d, accidentally broke the
serialization by dropping the delimiter. Rather than fix the
serialization and leave the time-bomb for someone else to trip over,
pass the interface names as structured data.
Signed-off-by: Cory Snider <csnider@mirantis.com>
While the VXLAN interface and the iptables rules to mark outgoing VXLAN
packets for encryption are configured to use the Swarm data path port,
the XFRM policies for actually applying the encryption are hardcoded to
match packets with destination port 4789/udp. Consequently, encrypted
overlay networks do not pass traffic when the Swarm is configured with
any other data path port: encryption is not applied to the outgoing
VXLAN packets and the destination host drops the received cleartext
packets. Use the configured data path port instead of hardcoding port
4789 in the XFRM policies.
Signed-off-by: Cory Snider <csnider@mirantis.com>
TestProxyNXDOMAIN has proven to be susceptible to failing as a
consequence of unlocked threads being set to the wrong network
namespace. As the failure mode looks a lot like a bug in the test
itself, it seems prudent to add a check for mismatched namespaces to the
test so we will know for next time that the root cause lies elsewhere.
Signed-off-by: Cory Snider <csnider@mirantis.com>
osl.setIPv6 mistakenly captured the calling goroutine's thread's network
namespace instead of the network namespace of the thread getting its
namespace temporarily changed. As this function appears to only be
called from contexts in the process's initial network namespace, this
mistake would be of little consequence at runtime. The libnetwork unit
tests, on the other hand, unshare network namespaces so as not to
interfere with each other or the host's network namespace. But due to
this bug, the isolation backfires and the network namespace of
goroutines used by a test which are expected to be in the initial
network namespace can randomly become the isolated network namespace of
some other test. Symptoms include a loopback network server running in
one goroutine being inexplicably and randomly being unreachable by a
client in another goroutine.
Capture the original network namespace of the thread from the thread to
be tampered with, after locking the goroutine to the thread.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Swapping out the global logger on the fly is causing tests to flake out
by logging to a test's log output after the test function has returned.
Refactor Resolver to use a dependency-injected logger and the resolver
unit tests to inject a private logger instance into the Resolver under
test.
Signed-off-by: Cory Snider <csnider@mirantis.com>
tstwriter mocks the server-side connection between the resolver and the
container, not the resolver and the external DNS server, so returning
the external DNS server's address as w.LocalAddr() is technically
incorrect and misleading. Only the protocols need to match as the
resolver uses the client's choice of protocol to determine which
protocol to use when forwarding the query to the external DNS server.
While this change has no material impact on the tests, it makes the
tests slightly more comprehensible for the next person.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Our resolver is just a forwarder for external DNS so it should act like
it. Unless it's a server failure or refusal, take the response at face
value and forward it along to the client. RFC 8020 is only applicable to
caching recursive name servers and our resolver is neither caching nor
recursive.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Now that most uses of reexec have been replaced with non-reexec
solutions, most of the reexec.Init() calls peppered throughout the test
suites are unnecessary. Furthermore, most of the reexec.Init() calls in
test code neglects to check the return value to determine whether to
exit, which would result in the reexec'ed subprocesses proceeding to run
the tests, which would reexec another subprocess which would proceed to
run the tests, recursively. (That would explain why every reexec
callback used to unconditionally call os.Exit() instead of returning...)
Remove unneeded reexec.Init() calls from test and example code which no
longer needs it, and fix the reexec.Init() calls which are not inert to
exit after a reexec callback is invoked.
Signed-off-by: Cory Snider <csnider@mirantis.com>
- sandbox, endpoint changed in c71555f030, but
missed updating the stubs.
- add missing stub for Controller.cleanupServiceDiscovery()
- While at it also doing some minor (formatting) changes.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
This function included a defer to close the net.Conn if an error occurred,
but the calling function (SetExternalKey()) also had a defer to close it
unconditionally.
Rewrite it to use json.NewEncoder(), which accepts a writer, and inline
the code.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It's a no-op on Windows and other non-Linux, non-FreeBSD platforms,
so there's no need to register the re-exec.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Just print the error and os.Exit() instead, which makes it more
explicit that we're exiting, and there's no need to decorate the
error.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Split the function into a "backing" function that returns an error, and the
re-exec entrypoint, which handles the error to provide a more idiomatic approach.
This was part of a larger change accross multiple re-exec functions (now removed).
For history's sake; here's the description for that;
The `reexec.Register()` function accepts reexec entrypoints, which are a `func()`
without return (matching a binary's `main()` function). As these functions cannot
return an error, it's the entrypoint's responsibility to handle any error, and to
indicate failures through `os.Exit()`.
I noticed that some of these entrypoint functions had `defer()` statements, but
called `os.Exit()` either explicitly or implicitly (e.g. through `logrus.Fatal()`).
defer statements are not executed if `os.Exit()` is called, which rendered these
statements useless.
While I doubt these were problematic (I expect files to be closed when the process
exists, and `runtime.LockOSThread()` to not have side-effects after exit), it also
didn't seem to "hurt" to call these as was expected by the function.
This patch rewrites some of the entrypoints to split them into a "backing function"
that can return an error (being slightly more iodiomatic Go) and an wrapper function
to act as entrypoint (which can handle the error and exit the executable).
To some extend, I'm wondering if we should change the signatures of the entrypoints
to return an error so that `reexec.Init()` can handle (or return) the errors, so
that logging can be handled more consistently (currently, some some use logrus,
some just print); this would also keep logging out of some packages, as well as
allows us to provide more metadata about the error (which reexec produced the
error for example).
A quick search showed that there's some external consumers of pkg/reexec, so I
kept this for a future discussion / exercise.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
- Verify the content to be equal, not "contains"; this output should be
predictable.
- Also verify the content returned by the function to match.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Looks like the intent is to exclude windows (which wouldn't have /etc/resolv.conf
nor systemd), but most tests would run fine elsewhere. This allows running the
tests on macOS for local testing.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Use t.TempDir() for convenience, and change some t.Fatal's to Errors,
so that all tests can run instead of failing early.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The test was assuming that the "source" file was always "/etc/resolv.conf",
but the `Get()` function uses `Path()` to find the location of resolv.conf,
which may be different.
While at it, also changed some `t.Fatalf()` to `t.Errorf()`, and renamed
some variables for clarity.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
After my last change, I noticed that the hash is used as a []byte in most
cases (other than tests). This patch updates the type to use a []byte, which
(although unlikely very important) also improves performance:
Compared to the previous version:
benchstat new.txt new2.txt
name old time/op new time/op delta
HashData-10 128ns ± 1% 116ns ± 1% -9.77% (p=0.000 n=20+20)
name old alloc/op new alloc/op delta
HashData-10 208B ± 0% 88B ± 0% -57.69% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
HashData-10 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=20+20)
And compared to the original version:
benchstat old.txt new2.txt
name old time/op new time/op delta
HashData-10 201ns ± 1% 116ns ± 1% -42.39% (p=0.000 n=18+20)
name old alloc/op new alloc/op delta
HashData-10 416B ± 0% 88B ± 0% -78.85% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
HashData-10 6.00 ± 0% 2.00 ± 0% -66.67% (p=0.000 n=20+20)
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The code seemed overly complicated, requiring a reader to be constructed,
where in all cases, the data was already available in a variable. This patch
simplifies the utility to not require a reader, which also makes it a bit
more performant:
go install golang.org/x/perf/cmd/benchstat@latest
GO111MODULE=off go test -run='^$' -bench=. -count=20 > old.txt
GO111MODULE=off go test -run='^$' -bench=. -count=20 > new.txt
benchstat old.txt new.txt
name old time/op new time/op delta
HashData-10 201ns ± 1% 128ns ± 1% -36.16% (p=0.000 n=18+20)
name old alloc/op new alloc/op delta
HashData-10 416B ± 0% 208B ± 0% -50.00% (p=0.000 n=20+20)
name old allocs/op new allocs/op delta
HashData-10 6.00 ± 0% 3.00 ± 0% -50.00% (p=0.000 n=20+20)
A small change was made in `Build()`, which previously returned the resolv.conf
data, even if the function failed to write it. In the new variation, `nil` is
consistently returned on failures.
Note that in various places, the hash is not even used, so we may be able to
simplify things more after this.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Since cc19eba (backported to v23.0.4), the PreferredPool for docker0 is
set only when the user provides the bip config parameter or when the
default bridge already exist. That means, if a user provides the
fixed-cidr parameter on a fresh install or reboot their computer/server
without bip set, dockerd throw the following error when it starts:
> failed to start daemon: Error initializing network controller: Error
> creating default "bridge" network: failed to parse pool request for
> address space "LocalDefault" pool "" subpool "100.64.0.0/26": Invalid
> Address SubPool
See #45356.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Use Linux BPF extensions to locate the offset of the VXLAN header within
the packet so that the same BPF program works with VXLAN packets
received over either IPv4 or IPv6.
Signed-off-by: Cory Snider <csnider@mirantis.com>
This commit removes iptables rules configured for secure overlay
networks when a network is deleted. Prior to this commit, only
CreateNetwork() was taking care of removing stale iptables rules.
If one of the iptables rule can't be removed, the erorr is logged but
it doesn't prevent network deletion.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The (*network).ipamRelease function nils out the network's IPAM info
fields, putting the network struct into an inconsistent state. The
network-restore startup code panics if it tries to restore a network
from a struct which has fewer IPAM config entries than IPAM info
entries. Therefore (*network).delete contains a critical section: by
persisting the network to the store after ipamRelease(), the datastore
will contain an inconsistent network until the deletion operation
completes and finishes deleting the network from the datastore. If for
any reason the deletion operation is interrupted between ipamRelease()
and deleteFromStore(), the daemon will crash on startup when it tries to
restore the network.
Updating the datastore after releasing the network's IPAM pools may have
served a purpose in the past, when a global datastore was used for
intra-cluster communication and the IPAM allocator had persistent global
state, but nowadays there is no global datastore and the IPAM allocator
has no persistent state whatsoever. Remove the vestigial datastore
update as it is no longer necessary and only serves to cause problems.
If the network deletion is interrupted before the network is deleted
from the datastore, the deletion will resume during the next daemon
startup, including releasing the IPAM pools.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Linux kernel prior to v3.16 was not supporting netns for vxlan
interfaces. As such, moby/libnetwork#821 introduced a "host mode" to the
overlay driver. The related kernel fix is available for rhel7 users
since v7.2.
This mode could be forced through the use of the env var
_OVERLAY_HOST_MODE. However this env var has never been documented and
is not referenced in any blog post, so there's little chance many people
rely on it. Moreover, this host mode is deemed as an implementation
details by maintainers. As such, we can consider it dead and we can
remove it without a prior deprecation warning.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Since 0fa873c, there's no function writing overlay networks to some
datastore. As such, overlay network struct doesn't need to implement
KVObject interface.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Since a few commits, subnet's vni don't change during the lifetime of
the subnet struct, so there's no need to lock the network before
accessing it.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Since the previous commit, data from the local store are never read,
thus proving it was only used for Classic Swarm.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The overlay driver in Swarm v2 mode doesn't support live-restore, ie.
the daemon won't even start if the node is part of a Swarm cluster and
live-restore is enabled. This feature was only used by Swarm Classic.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
VNI allocations made by the overlay driver were only used by Classic
Swarm. With Swarm v2 mode, the driver ovmanager is responsible of
allocating & releasing them.
Previously, vxlanIdm was initialized when a global store was available
but since 142b522, no global store can be instantiated. As such,
releaseVxlanID actually does actually nothing and iptables rules are
never removed.
The last line of dead code detected by golangci-lint is now gone.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Prior to 0fa873c, the serf-based event loop was started when a global
store was available. Since there's no more global store, this event loop
and all its associated code is dead.
Most dead code detected by golangci-lint in prior commits is now gone.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
- LocalKVProvider, LocalKVProviderURL, LocalKVProviderConfig,
GlobalKVProvider, GlobalKVProviderURL and GlobalKVProviderConfig
are all unused since moby/libnetwork@be2b6962 (moby/libnetwork#908).
- GlobalKVClient is unused since 0fa873c and c8d2c6e.
- MakeKVProvider, MakeKVProviderURL and MakeKVProviderConfig are unused
since 96cfb076 (moby/moby#44683).
- MakeKVClient is unused since 142b5229 (moby/moby#44875).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The overlay driver was creating a global store whenever
netlabel.GlobalKVClient was specified in its config argument. This
specific label is unused anymore since 142b522 (moby/moby#44875).
It was also creating a local store whenever netlabel.LocalKVClient was
specificed in its config argument. This store is unused since
moby/libnetwork@9e72136 (moby/libnetwork#1636).
Finally, the sync.Once properties are never used and thus can be
deleted.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The overlay driver was creating a global store whenever
netlabel.GlobalKVClient was specified in its config argument. This
specific label is not used anymore since 142b522 (moby/moby#44875).
golangci-lint now detects dead code. This will be fixed in subsequent
commits.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
This command was useful when overlay networks based on external KV store
was developed but is unused nowadays.
As the last reference to OverlayBindInterface and OverlayNeighborIP
netlabels are in the ovrouter cmd, they're removed too.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Drop support for platforms which only have xt_u32 but not xt_bpf. No
attempt is made to clean up old xt_u32 iptables rules left over from a
previous daemon instance.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Encrypted overlay networks are unique in that they are the only kind of
network for which libnetwork programs an iptables rule to explicitly
accept incoming packets. No other network driver does this. The overlay
driver doesn't even do this for unencrypted networks!
Because the ACCEPT rule is appended to the end of INPUT table rather
than inserted at the front, the rule can be entirely inert on many
common configurations. For example, FirewallD programs an unconditional
REJECT rule at the end of the INPUT table, so any ACCEPT rules appended
after it have no effect. And on systems where the rule is effective, its
presence may subvert the administrator's intentions. In particular,
automatically appending the ACCEPT rule could allow incoming traffic
which the administrator was expecting to be dropped implicitly with a
default-DROP policy.
Let the administrator always have the final say in how incoming
encrypted overlay packets are filtered by no longer automatically
programming INPUT ACCEPT iptables rules for them.
Signed-off-by: Cory Snider <csnider@mirantis.com>
This is a follow-up of 48ad9e1. This commit removed the function
ElectInterfaceAddresses from utils_linux.go but not their FreeBSD &
Windows counterpart. As these functions are never called, they can be
safely removed.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The ExtDNS2 field was added in
aad1632c15
to migrate existing state from < 1.14 to a new type. As it's unlikely
that installations still have state from before 1.14, rename ExtDNS2
back to ExtDNS and drop the migration code.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Co-authored-by: Cory Snider <csnider@mirantis.com>
Signed-off-by: Cory Snider <csnider@mirantis.com>
Allow SetMatrix to be used as a value type with a ready-to-use zero
value. SetMatrix values are already non-copyable by virtue of having a
mutex field so there is no harm in allowing non-pointer values to be
used as local variables or struct fields. Any attempts to pass around
by-value copies, e.g. as function arguments, will be flagged by go vet.
Signed-off-by: Cory Snider <csnider@mirantis.com>
FirewallD creates the root INPUT chain with a default-accept policy and
a terminal rule which rejects all packets not accepted by any prior
rule. Any subsequent rules appended to the chain are therefore inert.
The administrator would have to open the VXLAN UDP port to make overlay
networks work at all, which would result in all VXLAN traffic being
accepted and defeating our attempts to enforce encryption on encrypted
overlay networks.
Insert the rule to drop unencrypted VXLAN packets tagged for encrypted
overlay networks at the top of the INPUT chain so that enforcement of
mandatory encryption takes precedence over any accept rules configured
by the administrator. Continue to append the accept rule to the bottom
of the chain so as not to override any administrator-configured drop
rules.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Some newer distros such as RHEL 9 have stopped making the xt_u32 kernel
module available with the kernels they ship. They do ship the xt_bpf
kernel module, which can do everything xt_u32 can and more. Add an
alternative implementation of the iptables match rule which uses xt_bpf
to implement exactly the same logic as the u32 filter using a BPF
program. Try programming the BPF-powered rules as a fallback when
programming the u32-powered rules fails.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The iptables rule clause used to match on the VNI of VXLAN datagrams
looks like line noise to the uninitiated. It doesn't help that the
expression is repeated twice and neither copy has any commentary.
DRY out the rule builder to a common function, and document what the
rule does and how it works.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The iptables rules which make encryption mandatory on an encrypted
overlay network are only programmed once there is a second node
participating in the network. This leaves single-node encrypted overlay
networks vulnerable to packet injection. Furthermore, failure to program
the rules is not treated as a fatal error.
Program the iptables rules to make encryption mandatory before creating
the VXLAN link to guarantee that there is no window of time where
incoming cleartext VXLAN packets for the network would be accepted, or
outgoing cleartext packets be transmitted. Only create the VXLAN link if
programming the rules succeeds to ensure that it fails closed.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The overlay-network encryption code is woefully under-documented, which
is especially problematic as it operates on under-documented kernel
interfaces. Document what I have puzzled out of the implementation for
the benefit of the next poor soul to touch this code.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Funneling the peer operations into an unbuffered channel only serves to
achieve the same result as a mutex, using a lot more boilerplate and
indirection. Get rid of the boilerplate and unnecessary indirection by
using a mutex and calling the operations directly.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The netip types can be used as map keys, unlike net.IP and friends,
which is a very useful property to have for this application.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The address spaces are orthogonal. There is no shared state between them
logically so there is no reason for them to share any in-memory data
structures. addrSpace is responsible for allocating subnets and
addresses, while Allocator is responsible for implementing the IPAM API.
Lower all the implementation details of allocation into addrSpace.
There is no longer a need to include the name of the address space in
the map keys for subnets now that each addrSpace holds its own state
independently from other addrSpaces. Remove the AddressSpace field from
the struct used for map keys within an addrSpace so that an addrSpace
does not need to know its own name.
Pool allocations were encoded in a tree structure, using parent
references and reference counters. This structure affords for pools
subdivided an arbitrary number of times to be modeled, in theory. In
practice, the max depth is only two: master pools and sub-pools. The
allocator data model has also been heavily influenced by the
requirements and limitations of Datastore persistence, which are no
longer applicable.
Address allocations are always associated with a master pool. Sub-pools
only serve to restrict the range of addresses within the master pool
which could be allocated from. Model pool allocations within an address
space as a two-level hierarchy of maps.
Signed-off-by: Cory Snider <csnider@mirantis.com>
There is only one call site, in (*Allocator).RequestPool. The logic is
tightly coupled between the caller and callee, and having them separate
makes it harder to reason about.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Only two address spaces are supported: LocalDefault and GlobalDefault.
Support for non-default address spaces in the IPAM Allocator is
vestigial, from a time when IPAM state was stored in a persistent shared
datastore. There is no way to create non-default address spaces through
the IPAM API so there is no need to retain code to support the use of
such address spaces. Drop all pretense that more address spaces can
exist, to the extent that the IPAM API allows.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The two-phase commit dance serves no purpose with the current IPAM
allocator implementation. There are no fallible operations between the
call to aSpace.updatePoolDBOnAdd() and invoking the returned closure.
Allocate the subnet in the address space immediately when called and get
rid of the closure return.
Signed-off-by: Cory Snider <csnider@mirantis.com>
PTR queries with domain names unknown to us are not necessarily invalid.
Act like a well-behaved middlebox and fall back to forwarding
externally, same as we do with the other query types.
Signed-off-by: Cory Snider <csnider@mirantis.com>
...for limiting concurrent external DNS requests with
"golang.org/x/sync/semaphore".Weighted. Replace the ad-hoc rate limiter
for when the concurrency limit is hit (which contains a data-race bug)
with "golang.org/x/time/rate".Sometimes.
Immediately retrying with the next server if the concurrency limit has
been hit just further compounds the problem. Wait on the semaphore and
refuse the query if it could not be acquired in a reasonable amount of
time.
Signed-off-by: Cory Snider <csnider@mirantis.com>
It handles figuring out the UDP receive buffer size and setting IO
timeouts, which simplifies our code. It is also more robust to receiving
UDP replies to earlier queries which timed out.
Log failures to perform a client exchange at level error so they are
more visible to operators and administrators.
Signed-off-by: Cory Snider <csnider@mirantis.com>
forwardExtDNS() will now continue with the next external DNS sever if
co.ReadMsg() returns (nil, nil). Previously it would abort resolving the
query and not reply to the container client. The implementation of
ReadMsg() in the currently- vendored version of miekg/dns cannot return
(nil, nil) so the difference is immaterial in practice.
Signed-off-by: Cory Snider <csnider@mirantis.com>
(*dns.Msg).Truncate() is more intelligent and standards-compliant about
truncating DNS response messages than our hand-rolled version. Fix a
silly fencepost error the max TCP message size: the limit is
dns.MaxMsgSize (65535), full stop.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The TC flag in a DNS message indicates that the sender had to
truncate it to fit within the length limit of the transmission channel.
It does NOT indicate that part of the message was lost before reaching
the recipient. Older versions of github.com/miekg/dns conflated the two
cases by returning ErrTruncated from ReadMsg() if the message was parsed
without error but had the TC flag set. The version of miekg/dns
currently vendored no longer returns an error when a well-formed DNS
message is received which has its TC flag set, but there was some
confusion on how to update libnetwork to deal with this behaviour
change. Truncated DNS replies are no longer different from any other
reply message: they are normal replies which do not need any special-
case handling to proxy back to the client.
Signed-off-by: Cory Snider <csnider@mirantis.com>
TestRequestReleaseAddressDuplicate gets flagged by go test -race because
the same err variable inside the test is assigned to from multiple
goroutines without synchronization, which obscures whether or not there
are any data races in the code under test.
Trouble is, the test _depends on_ the data race to exit the loop if an
error occurs inside a spawned goroutine. And the test contains a logical
concurrency bug (not flagged by the Go race detector) which can result
in false-positive test failures. Because a release operation is logged
after the IP is released, the other goroutine could reacquire the
address and log that it was reacquired before the release is logged.
Fix up the test so it is no longer subject to data races or
false-positive test failures, i.e. flakes.
Signed-off-by: Cory Snider <csnider@mirantis.com>
If the resolver encounters an error before it attempts to forward the
request to external DNS, do not try to log information about the
external connection, because at this point `extConn` is `nil`. This
makes sure `dockerd` won't panic and crash from a nil pointer
dereference when it sees an invalid DNS query.
fixes#44979
Signed-off-by: er0k <er0k@er0k.net>
(cherry picked from commit 6c2637be11)
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
"math/rand".Seed
- Migrate to using local RNG instances.
"archive/tar".TypeRegA
- The deprecated constant tar.TypeRegA is the same value as
tar.TypeReg and so is not needed at all.
Signed-off-by: Cory Snider <csnider@mirantis.com>
DNS servers in the loopback address range should always be resolved in
the host network namespace when the servers are configured by reading
from the host's /etc/resolv.conf. The daemon mistakenly conflated the
presence of DNS options (docker run --dns-opt) with user-supplied DNS
servers, treating the list of servers loaded from the host as a user-
supplied list and attempting to resolve in the container's network
namespace. Correct this oversight so that loopback DNS servers are only
resolved in the container's network namespace when the user provides the
DNS server list, irrespective of other DNS configuration.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The per-network statistics counters are loaded and incremented without
any concurrency control. Use atomic integers to prevent data races
without having to add any synchronization.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*bitmap.Handle).String() method can be rather expensive to call. It
is all the more tragic when the expensively-constructed string is
immediately discarded because the log level is not high enough. Let
logrus stringify the arguments to debug logs so they are only
stringified when the log level is high enough.
# Before
ok github.com/docker/docker/libnetwork/ipam 10.159s
# After
ok github.com/docker/docker/libnetwork/ipam 2.484s
Signed-off-by: Cory Snider <csnider@mirantis.com>
IPVLAN networks created on Moby v20.10 do not have the IpvlanFlag
configuration value persisted in the libnetwork database as that config
value did not exist before v23.0.0. Gracefully migrate configurations on
unmarshal to prevent type-assertion panics at daemon start after upgrade.
Fixes#44925
Signed-off-by: Cory Snider <csnider@mirantis.com>
The (*bitseq.Handle).Destroy() method deletes the persisted KVObject
from the datastore. This is a no-op on all the bitseq handles in package
ipam as they are not persisted in any datastore.
Signed-off-by: Cory Snider <csnider@mirantis.com>
That method was only referenced by ipam.Allocator, but as it no longer
stores any state persistently there is no possibility for it to load an
inconsistent bit-sequence from Docker 1.9.x.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The DatastoreConfig discovery type is unused. Remove the constant and
any resulting dead code. Today's biggest loser is the IPAM Allocator:
DatastoreConfig was the only type of discovery event it was listening
for, and there was no other place where a non-nil datastore could be
passed into the allocator. Strip out all the dead persistence code from
Allocator, leaving it as purely an in-memory implementation.
There is no more need to check the consistency of the allocator's
bit-sequences as there is no persistent storage for inconsistent bit
sequences to be loaded from.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Per the Interface Segregation Principle, network drivers should not have
to depend on GetPluginGetter methods they do not use. The remote network
driver is the only one which needs a PluginGetter, and it is already
special-cased in Controller so there is no sense warping the interfaces
to achieve a foolish consistency. Replace all other network drivers' Init
functions with Register functions which take a driverapi.Registerer
argument instead of a driverapi.DriverCallback. Add back in Init wrapper
functions for only the drivers which Swarmkit references so that
Swarmkit can continue to build.
Refactor the libnetwork Controller to use the new drvregistry.Networks
and drvregistry.IPAMs driver registries in place of the legacy ones.
Signed-off-by: Cory Snider <csnider@mirantis.com>
There is no benefit to having a single registry for both IPAM drivers
and network drivers. IPAM drivers are registered in a separate namespace
from network drivers, have separate registration methods, separate
accessor methods and do not interact with network drivers within a
DrvRegistry in any way. The only point of commonality is
interface { GetPluginGetter() plugingetter.PluginGetter }
which is only used by the respective remote drivers and therefore should
be outside of the scope of a driver registry.
Create new, separate registry types for network drivers and IPAM
drivers, respectively. These types are "legacy-free". Neither type has
GetPluginGetter methods. The IPAMs registry does not have an
IPAMDefaultAddressSpaces method as that information can be queried
directly from the driver using its GetDefaultAddressSpaces method.
The Networks registry does not have an AddDriver method as that method
is a trivial wrapper around calling one of its arguments with its other
arguments.
Refactor DrvRegistry in terms of the new IPAMs and Networks registries
so that existing code in libnetwork and Swarmkit will continue to work.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The datastore arguments to the IPAM driver Init() functions are always
nil, even in Swarmkit. The only IPAM driver which consumed the
datastores was builtin; all others (null, remote, windowsipam) simply
ignored it. As the signatures of the IPAM driver init functions cannot
be changed without breaking the Swarmkit build, they have to be left
with the same signatures for the time being. Assert that nil datastores
are always passed into the builtin IPAM driver's init function so that
there is no ambiguity the datastores are no longer respected.
Add new Register functions for the IPAM drivers which are free from the
legacy baggage of the Init functions. (The legacy Init functions can be
removed once Swarmkit is migrated to using the Register functions.) As
the remote IPAM driver is the only one which depends on a PluginGetter,
pass it in explicitly as an argument to Register. The other IPAM drivers
should not be forced to depend on a GetPluginGetter() method they do not
use (Interface Segregation Principle).
Signed-off-by: Cory Snider <csnider@mirantis.com>
Without (*Controller).ReloadConfiguration, the only way to configure
datastore scopes would be by passing config.Options to libnetwork.New.
The only options defined which relate to datastore scopes are limited to
configuring the local-scope datastore. Furthermore, the default
datastore config only defines configuration for the local-scope
datastore. The local-scope datastore is therefore the only datastore
scope possible in libnetwork. Start removing code which is only
needed to support multiple datastore scopes.
Signed-off-by: Cory Snider <csnider@mirantis.com>
ConfigLocalScopeDefaultNetworks is now dead code, thank goodness! Make
sure it stays dead by deleting the function. Refactor package ipamutils
to simplify things given its newly-reduced (ahem) scope.
Signed-off-by: Cory Snider <csnider@mirantis.com>
ipam.Allocator is not a singleton, but it references mutable singleton
state. Address that deficiency by refactoring it to instead take the
predefined address spaces as constructor arguments. Unfortunately some
work is needed on the Swarmkit side before the mutable singleton state
can be completely eliminated from the IPAMs.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The function references global shared, mutable state and is no longer
needed. Deleting it brings us one step closer to getting rid of that
pesky shared state.
Signed-off-by: Cory Snider <csnider@mirantis.com>
These package-level variables were copied over from the Linux
implementation; drop them for clarity's sake.
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
netlink offers the netlink.LinkNotFoundError type, which we can use with
errors.As() to detect a unused link name.
Additionally, early return if GenerateRandomName fails, as reading
random bytes should be a highly reliable operation, and otherwise the
error would be swallowed by the fall-through return.
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
GenerateRandomName now uses length to represent the overall length of
the string; this will help future users avoid creating interface names
that are too long for the kernel to accept by mistake. The test coverage
is increased and cleaned up using gotest.tools.
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
*bitseq.Handle should not implement sync.Locker. The mutex is a private
implementation detail which external consumers should not be able to
manipulate.
Signed-off-by: Cory Snider <csnider@mirantis.com>
The byte-slice temporary is fully overwritten on each loop iteration so
it can be safely reused to reduce GC pressure.
Signed-off-by: Cory Snider <csnider@mirantis.com>
There is a solid bit-vector datatype hidden inside bitseq.Handle, but
it's obscured by all the intrusive datastore and KVObject nonsense.
It can be used without a datastore, but the datastore baggage limits its
utility inside and outside libnetwork. Extract the datatype goodness
into its own package which depends only on the standard library so it
can be used in more situations.
Signed-off-by: Cory Snider <csnider@mirantis.com>
...in preparation for separating the bit-sequence datatype from the
datastore persistence (KVObject) concerns. The new package's contents
are identical to those of package libnetwork/bitseq to assist in
reviewing the changes made on each side of the split.
Signed-off-by: Cory Snider <csnider@mirantis.com>
This reverts commit 2d397beb00.
moby#44706 and moby#44805 were both merged, and both refactored the same
file. The combination broke the build, and was not detected in CI as
only the combination of the two, applied to the same parent commit,
caused the failure.
moby#44706 should be carried forward, based on the current master, in
order to resolve this conflict.
Signed-off-by: Bjorn Neergaard <bneergaard@mirantis.com>
Embedded structs are part of the exported surface of a struct type.
Boxing a struct value into an interface value does not erase that;
any code could gain access to the embedded struct value with a simple
type assertion. The mutex is supposed to be a private implementation
detail, but *network implements sync.Locker because the mutex is
embedded. Change the mutex to an unexported field so *network no
longer spuriously implements the sync.Locker interface.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Embedded structs are part of the exported surface of a struct type.
Boxing a struct value into an interface value does not erase that;
any code could gain access to the embedded struct value with a simple
type assertion. The mutex is supposed to be a private implementation
detail, but *endpoint implements sync.Locker because the mutex is
embedded. Change the mutex to an unexported field so *endpoint no
longer spuriously implements the sync.Locker interface.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Basically every exported method which takes a libnetwork.Sandbox
argument asserts that the value's concrete type is *sandbox. Passing any
other implementation of the interface is a runtime error! This interface
is a footgun, and clearly not necessary. Export and use the concrete
type instead.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Embedded structs are part of the exported surface of a struct type.
Boxing a struct value into an interface value does not erase that;
any code could gain access to the embedded struct value with a simple
type assertion. The mutex is supposed to be a private implementation
detail, but *sandbox implements sync.Locker because the mutex is
embedded. Change the mutex to an unexported field so *sandbox no
longer spuriously implements the sync.Locker interface.
Signed-off-by: Cory Snider <csnider@mirantis.com>
Embedded structs are part of the exported surface of a struct type.
Boxing a struct value into an interface value does not erase that;
any code could gain access to the embedded struct value with a simple
type assertion. The mutex is supposed to be a private implementation
detail, but *controller implements sync.Locker because the mutex is
embedded.
c, _ := libnetwork.New()
c.(sync.Locker).Lock()
Change the mutex to an unexported field so *controller no longer
spuriously implements the sync.Locker interface.
Signed-off-by: Cory Snider <csnider@mirantis.com>
unshare.Go() is not used as an existing network namespace needs to be
entered, not a new one created. Explicitly lock main() to the initial
thread so as not to depend on the side effects of importing the
internal/unshare package to achieve the same.
Signed-off-by: Cory Snider <csnider@mirantis.com>
- The oldest kernel version currently supported is v3.10. Bridge
parameters can be set through netlink since v3.8 (see
torvalds/linux@25c71c7). As such, we don't need to fallback to sysfs to
set hairpin mode.
- `scanInterfaceStats()` is never called, so no need to keep it alive.
- Document why `default_pvid` is set through sysfs
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
When userland-proxy is turned off and on again, the iptables nat rule
doing hairpinning isn't properly removed. This fix makes sure this nat
rule is removed whenever the bridge is torn down or hairpinning is
disabled (through setting userland-proxy to true).
Unlike for ip masquerading and ICC, the `programChainRule()` call
setting up the "MASQ LOCAL HOST" rule has to be called unconditionally
because the hairpin parameter isn't restored from the driver store, but
always comes from the driver config.
For the "SKIP DNAT" rule, things are a bit different: this rule is
always deleted by `removeIPChains()` when the bridge driver is
initialized.
Fixes#44721.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
Trying to remove the "docker.io" domain from locations where it's not relevant.
In these cases, this domain was used as a "random" domain for testing or example
purposes.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Conntrack entries are created for UDP flows even if there's nowhere to
route these packets (ie. no listening socket and no NAT rules to
apply). Moreover, iptables NAT rules are evaluated by netfilter only
when creating a new conntrack entry.
When Docker adds NAT rules, netfilter will ignore them for any packet
matching a pre-existing conntrack entry. In such case, when
dockerd runs with userland proxy enabled, packets got routed to it and
the main symptom will be bad source IP address (as shown by #44688).
If the publishing container is run through Docker Swarm or in
"standalone" Docker but with no userland proxy, affected packets will
be dropped (eg. routed to nowhere).
As such, Docker needs to flush all conntrack entries for published UDP
ports to make sure NAT rules are correctly applied to all packets.
- Fixes#44688
- Fixes#8795
- Fixes#16720
- Fixes#7540
- Fixesmoby/libnetwork#2423
- and probably more.
As a precautionary measure, those conntrack entries are also flushed
when revoking external connectivity to avoid those entries to be reused
when a new sandbox is created (although the kernel should already
prevent such case).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
iptables -C flag was introduced in v1.4.11, which was released ten
years ago. Thus, there're no more Linux distributions supported by
Docker using this version. As such, this commit removes the old way of
checking if an iptables rule exists (by using substring matching).
Signed-off-by: Albin Kerouanton <albinker@gmail.com>
The former was doing some checks and logging warnings, whereas
the latter was doing the same checks but to set some internal variables.
As both are called only once and from the same place, there're now
merged together.
Signed-off-by: Albin Kerouanton <albinker@gmail.com>