Currently if there is any transient gossip failure in any node the
recoevry process depends on other nodes propogating the information
indirectly. In cases if these transient failures affects all the nodes
that this node has in its memberlist then this node will be permenantly
cutoff from the the gossip channel. Added node state management code in
networkdb to address these problems by trying to rejoin the cluster via
the failed nodes when there is a failure. This also necessitates the
need to add new messages called node event messages to differentiate
between node leave and node failure.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
When dynamic networks are created and there is a race in creation of the
same network from two different tasks then one of them will fail while
the other will succeed. For service tasks this is not a big problem
because they will be rescheduled again. But for attachment tasks this
can be a problem since they won't get recreated and making the whole
connection fail. Fixed it by serializing network creation for the
network with the same id and trying to see if the id is present after
coming out of wait.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
This reverts commit b042dbe312.
The original commit breaks s390x, for example Docker build fails:
* https://github.com/docker/docker/issues/26440
As discussed in the above issue:
Even though char is unsigned by default on s390x, (gcc)go forces the type
of RawSockaddr.Data to be signed.
It makes no practical difference if these fields are signed or unsigned,
it's just an API issue.
The (assumed) reason for the original commit:
For a while RawSockaddr.Data was unsigned during development of the gcc
s390x port (not in an upstream release though). Probably the patch has
been developed in this time frame.
Signed-off-by: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Since docker/docker removed mflag package and libnetwork relies on it
create a copy of mflag package in libnetwork project.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Currently the endpoint count is being decremented before the driver
cleanup and more importantly before releasing the ip address. This is
racy as it creates a time window where we already have decremented the
endpoint count and so the network can be deleted now. But we haven't
released the IP address yet and the pool is already gone. Although there
is no harm done since the pool is already gone. it generates unnecessary
error message about not able to release the address. Also if the driver
cleanup fails we really should not decrement endpoint count.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
When stale delete notifications are received, we still need to make sure
to purge sandbox neighbor cache because these stale deletes are most
typically out of order delete notifications and if an add for the
peermac was received before the delete of the old peermac,vtep pair then
we process that and replace the kernel state but the old neighbor state
in the sandbox cache remains. That needs to be purged when we finally
get the out of order delete notification.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
When the libnetwork controller is not in distributed control mode avoid
retaining stale sandboxes when the network cannot be retrieved from
store. This ratining logic is only applicable for an independent k/v
store which manages libnetwork state. In such case the k/v store may be
temporarily unavailable so there is a need to retain the sandbox so that
the resource cleanup happens properly.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
This also allows pubslied services to be accessible from containers on bridge
networks on the host
Signed-off-by: Santhosh Manohar <santhosh@docker.com>
Avoid by reinitializing the channel immediately after closing the
channel within a lock. Also change the wait code to cache the channel in
stack be retrieving it from controller and wait on the stack copy of the
channel.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
This script gathers some basic information from a system that might
be useful to help troubleshoot problems. If added into an image
including the proper binaries, running looks something like this:
docker run --rm \
-v /var/run/docker.sock:/var/run/docker.sock \
-v /var/run/docker/netns:/var/run/docker/netns \
--privileged --net=host nwsupport /bin/support
Signed-off-by: Daniel Hiltgen <daniel.hiltgen@docker.com>
Avoid the whole store endpoint update logic when running in swarm mode
and the endpoint is part of a global scope network. Currently there is
no store update that is happening for global scope networks in swarm
mode, but this code path will delete the svcRecords database when the
last endpoint on the network is removed which is something that is not
required.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
When leaving the entire gossip cluster or when leaving a network
specific gossip cluster, we may not have had a chance to cleanup service
bindings by way of gossip updates due to premature closure of gossip
channel. Make sure to cleanup all service bindings since we are not
participating in the cluster any more.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Currently the initDone notification is provided immediately after
initializing the cluster. This may be fine for the first manager. But
for all subsequent nodes which join the cluster we need to wait until
the node completes the joining to the gossip cluster inorder to
synchronize the gossip network clock with other nodes. If we don't have
uptodate clock the updates that this node provides to the cluster may be
discarded by the other nodes if they have entries which are yet to be
reaped but have a better clock.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
In cases a node left the cluster and quickly rejoined before the node
entry is expired by other nodes in the cluster, when the node rejoins we
fail to add it to the quick lookup database. Fixed it.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
In networkdb we should ignore delete events for entries which doesn't
exist in the db. This is always true because if the entry did not exist
then the entry has been removed way earlier and got purged after the
reap timer and this notification is very stale.
Also there were duplicate delete notifications being sent to the
clients. One when the actual delete event was received from gossip and
later when the entry was getting reaped. The second notification is
unnecessary and may cause issues with the clients if they are not
coded for idempotency.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
When a node leaves the swarm cluster, we should cleanup the ingress
network and sandbox. This makes sure that when the next time the node
joins the swarm it will be able to update the cluster with the right
information.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
The SNAT rules added for LB egress is broader and breaks load balancing
if the service is connected to multiple networks. Make it conditional
based on the subnet to which the network belongs so that the right SNAT
rule gets matched when egressing the corresponding network.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Fixed certain spurious overlay errors which were not errors at all but
showing up everytime service tasks are started in the engine.
Also added a check to make sure a delete is valid by checking the
incoming endpoint id wih the one in peerdb just to make sure if the
delete from gossip is not stale.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Make service loadbalancing to work from within one of the containers of
the service. Currently this only works when the loadbalancer selects the
current container. If another container of the same service is chosen,
the connection times out. This fix adds a SNAT rule to change the source
IP to the containers primary IP so that responses can be routed back to
this container.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Ingress loadbalancer is only required to be plumbed in ingress sandboxes
of nodes which are the only mechanism to get traffix outside the cluster
to tasks. Since the tasks are part of ingress network, these
loadbalancers were getting added in all tasks which are exposing ports
which is totally unnecessary resource usage. This PR avoids that.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
If not enough keys are provided to SetKeys, this may cause a panic. This
should not cause problems with the current integration in Docker 1.12.0,
but the panic might happen loading data created by an earlier version,
or data that is corrupted somehow. Add a length check to be defensive.
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Sometimes you may get stale backend removal notices from gossip due to
some lingering state. If a stale backend notice is received and it is
already processed in this node ignore it rather than processing it.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
The CopyTo function for joininfo is not copying the driver table entries
which then is missing when the endpoint is re-read for the store cache.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
If a remote plugin returns an empty string in response to RequestAddress(),
the internal helper will return nil which will crash libnetwork in several
places.
Treat an empty string as a new error ipamapi.ErrNoIPReturned.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
- We need to compare the node notification IP with
the advertise address otherwise when the advertise
address is different from the local address (this
is for the public address outside of the host
that maps 1-to-1 to the local private address)
the local IP will be acocunted as an ipsec host
and extra states will be programmed for it.
Signed-off-by: Alessandro Boch <aboch@docker.com>
- When creating a non encrypted overlay network,
make sure no encryption related mangle rule from
stale network is on the way.
Signed-off-by: Alessandro Boch <aboch@docker.com>
- Because of a bug in the netlink xfrm code, our code will
fail to find and remove the states. While we could wait
for the netlink library fix, there is no longer a need to
convert the parsed IP addresses to the canonical notation
given the previous SPI computation (which worked on that
4 byte address assumption) is now replaced by the fnv hash.
- Also modify driver option that enables ipsec to "encrypted"
Signed-off-by: Alessandro Boch <aboch@docker.com>
With this change, all the auto-detection of the addresses are removed
from libnetwork and the caller takes the responsibilty to have a proper
advertise-addr in various scenarios (including externally facing public
advertise-addr with an internal facing private listen-addr)
Signed-off-by: Madhu Venugopal <madhu@docker.com>
the UDS sock is an unique file and the lifetime of it is until the
docker daemon dies (gracefully). Hence there is no need for it to be
under /var/lib and not mandatory to be configurable either.
Signed-off-by: Madhu Venugopal <madhu@docker.com>
A network is added to the `d.networks` map before it's fully initialized. That
is, it's possible for a network in `d.networks` to exist without having
`bridgeIPv4` populated yet. If multiple networks are spun up close to the same
time, a panic can occur.
Example:
```
panic(0x1a75d20, 0xc82000e090)
/usr/local/go/src/runtime/panic.go:443 +0x4e9
net.networkNumberAndMask(0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/ip.go:433 +0x42
net.(*IPNet).Contains(0x0, 0xc82084dbd0, 0x4, 0x4, 0xc820010200)
/usr/local/go/src/net/ip.go:457 +0x25
github.com/docker/libnetwork/drivers/bridge.(*networkConfiguration).conflictsWithNetworks(0xc822249360, 0xc822761380, 0x40, 0xc820866a60, 0x4, 0x4, 0x0, 0x0)
/root/rpmbuild/BUILD/docker-engine/vendor/src/github.com/docker/libnetwork/drivers/bridge/bridge.go:334 +0x40b
```
Signed-off-by: Andy Lindeman <alindeman@salesforce.com>
Rather than re-execing docker as the proxy, create a new command docker-proxy
that is much smaller to save memory in the case where there are a lot of
procies being created. Also allows the proxy to be replaced, for example
in Docker for Mac we have a proxy that proxies to osx instead of locally.
This is the vendoring pull for https://github.com/docker/docker/pull/23312
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
When deleting entries or when learning about deleted entries remember
then for a longer time to avoid excessive delete duplicates in the
gossip cluster. Also added code changes to ignore event messages
originated from the source node so that it doesn't get added into the
rebroadcast queue.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
While scaling down, currently we are removing the service record even if
the LB entry for the vip is not fully removed. This causes resolution
issues when scaling down. Fixed it by removing the service record only
if the LB for the vip is going away.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
Currently ovmanager simply logs an error when there is a vni allocation
failure. Instead it should error out and free all the previously
allocated vnis
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
While trying to update loadbalancer state index the service both on id
and portconfig. From libnetwork point of view a service is not just
defined by its id but also the ports it exposes. When a service updates
its port its id remains the same but its portconfigs change which should
be treated as a new service in libnetwork in order to ensure proper
cleanup of old LB state and creation of new LB state.
Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>
- If an endpoint is forcibly removed, it should not
matter whether the locator info is present. If
the daemon was started w/o the --cluster-advertise
option (the option is not mandatory), then the
locator would be empty for any endpoint.
Signed-off-by: Alessandro Boch <aboch@docker.com>
If xfrm modules cannot be loaded:
- Create netlink.Handle only for ROUTE socket
- Reject local join on overlay secure network
Signed-off-by: Alessandro Boch <aboch@docker.com>