0ct0pu5/moby

Author	SHA1	Message	Date
Flavio Crisciani	d6440c9139	optimize the rebroadcast for failure case Before when a node was failing, all the nodes would bump the lamport time of all their entries. This means that if a node flap, there will be a storm of update of all the entries. This commit on the base of the previous logic guarantees that only the node that joins back will readvertise its own entries, the other nodes won't need to advertise again. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 14:08:54 -07:00
Flavio Crisciani	a3ecb8902a	fix join/leave join/leave fixes: - when a node leaves the network will deletes all the other nodes entries but will keep track of its to make sure that other nodes if they are tcp syncing will be aware of them being deleted. (a node that did not yet receive the network leave will potentially tcp/sync) add network reapTime, was not being set locally Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 14:08:45 -07:00
Flavio Crisciani	e77c245e45	2x faster to converge - Introduced back the Invalidate - optimized the rebroadcast logic Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 13:47:18 -07:00
Flavio Crisciani	585964bf32	NetworkDB testing infra - Diagnose framework that exposes REST API for db interaction - Dockerfile to build the test image - Periodic print of stats regarding queue size - Client and server side for integration with testkit - Added write-delete-leave-join - Added test write-delete-wait-leave-join - Added write-wait-leave-join Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-27 08:50:43 -07:00
Flavio Crisciani	60b5add4af	NetworkDB allow setting PacketSize - Introduce the possibility to specify the max buffer length in network DB. This will allow to use the whole MTU limit of the interface - Add queue stats per network, it can be handy to identify the node's throughput per network and identify unbalance between nodes that can point to an MTU missconfiguration Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-26 13:44:33 -07:00
Flavio Crisciani	051a0d5ce9	NetworkDB incorrect number of entries in networkNodes A rapid (within networkReapTime 30min) leave/join network can corrupt the list of nodes per network with multiple copies of the same nodes. The fix makes sure that each node is present only once Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-18 16:57:49 -07:00
Sebastiaan van Stijn	3dd1fb1217	Make node join event logging less noisy Commit `ca9a768d80` added a number of debugging messages for node join/leave events. This patch checks if a node already was listed, and otherwise skips the logging to make the logs a bit less noisy. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2017-07-10 17:25:14 -07:00
Santhosh Manohar	6bd57f977d	Fix go generate for protobuf Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-07-05 16:31:12 -07:00
Flavio Crisciani	39d2204896	Service discovery logic rework changed the ipMap to SetMatrix to allow transient states Compacted the addSvc and deleteSvc into a one single method Updated the datastructure for backends to allow storing all the information needed to cleanup properly during the cleanupServiceBindings Removed the enable/disable Service logic that was racing with sbLeave/sbJoin logic Add some debug logs to track further race conditions Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-06-11 20:49:29 -07:00
Madhu Venugopal	78a910ee17	Merge pull request #1787 from fcrisciani/goroutine_leak Fix leak of handleTableEvents	2017-06-06 13:17:17 -07:00
Madhu Venugopal	59994bbb15	Merge pull request #1775 from sanimej/gossip Handle single manager reload by having workers reconnect	2017-05-31 14:57:34 -07:00
Santhosh Manohar	ca9a768d80	Handle single manager reload by having workers reconnect Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-05-31 14:36:23 -07:00
Flavio Crisciani	6d768ef73c	Fix leak of handleTableEvents The channel ch.C is never closed. Added the listen of the ch.Done() to guarantee that the goroutine is exiting once the event channel is closed Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-31 11:04:19 -07:00
Flavio Crisciani	f585f33042	Node failure timeout fix The time to keep a node failed into the failed node list was originally supposed to be 24h. If a node leaves explicitly it will be removed from the list of nodes and put into the leftNodes list. This way the NotifyLeave event won't insert it into the retry list. NOTE: if the event is lost instead the behavior will be the same as a failed node. If a node fails, the NotifyLeave will insert it into the failedNodes list with a reapTime of 24h. This means that the node will be checked for 24h before being completely forgot. The current check time is every 1 second and is done by the reconnectNode function. The failed node list is updated every 2h instead. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-22 17:19:31 -07:00
Santhosh Manohar	06c3489bb8	retry once on a bulk sync failure Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-05-11 21:13:18 -07:00
Flavio Crisciani	da9ac65ea6	Remove explicit set of memberlist protocol Memberlist does a full validation of the protocol version (min, current, max) amoung all the ndoes of the cluster. The previous code was setting the protocol version to max version. That made the upgrade incompatible. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-08 16:58:53 -07:00
Madhu Venugopal	1624c61ef2	Merge pull request #1727 from sanimej/cphard control-plane hardening: Avoid nDB stale entries	2017-04-25 11:04:13 -07:00
Santhosh Manohar	1693144ae2	Merge pull request #1713 from aboch/nse On clusterLeave, notify only if there are peers	2017-04-23 16:31:46 -07:00
Alessandro Boch	1323730eca	On send node envents, notify only if there are peers - Otherwise operation will unnecessarely block for five seconds. - This is particularly noticeable on graceful shutdown of daemon in one node cluster. Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-04-21 10:19:08 -07:00
Santhosh Manohar	102f9d230d	Avoid nDB stale entries because of intermittent nw issues. Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-04-19 14:01:28 -07:00
Santhosh Manohar	69ad7ef244	control-plane hardning: cleanup local state on peer leaving a network Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-03-31 01:49:03 -07:00
Santhosh Manohar	539888412b	Merge pull request #1689 from aboch/inv Do not invalidate table event messages	2017-03-16 13:47:01 -07:00
Alessandro Boch	9c3c86a931	Do not invalidate table event messages - Do not run the risk of suppressing meaningful messages for the rest of the cluster, as a many services depend on it, like the service records and the distributed load balancers. Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-03-16 00:49:58 -07:00
Alessandro Boch	4b306ee83d	Fix panic in networkdb test code fatal error: concurrent map read and map write goroutine 264 [running]: runtime.throw(0x90043c, 0x21) /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc4203d1d68 sp=0xc4203d1d48 runtime.mapaccess2_faststr(0x86df20, 0xc4203f5470, 0xc42044afc0, 0x5, 0xc4203d1e40, 0x4ed6b8) /usr/local/go/src/runtime/hashmap_fast.go:306 +0x52b fp=0xc4203d1dc8 sp=0xc4203d1d68 github.com/docker/libnetwork/networkdb.(*NetworkDB).verifyNodeExistence(0xc42007e160, 0xc42008a240, 0xc42044afc0, 0x5, 0x1) /go/src/github.com/docker/libnetwork/networkdb/networkdb_test.go:58 +0x6c fp=0xc4203d1e50 sp=0xc4203d1dc8 Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-03-15 23:26:32 -07:00
Santhosh Manohar	bfab379411	swarm mode network inspect should provide cluser-wide task details Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-03-10 19:12:00 -08:00
Madhu Venugopal	bb560a1f44	Generating node discovery events to the drivers from networkdb With the introduction of networkdb, the node discovery events were not sent to the drivers. This commit generates the node discovery events and sents it to the drivers interested in it. Signed-off-by: Madhu Venugopal <madhu@docker.com>	2017-02-01 17:54:51 -08:00
Alessandro Boch	595246bdfb	Merge pull request #1568 from likel/refactor Remove unnecessary string formats	2016-12-29 12:18:06 -08:00
Santhosh Manohar	176088a742	Merge pull request #968 from aboch/ed6 Control IPv6 on container's interface	2016-12-22 18:15:15 -08:00
Santhosh Manohar	0c2b4b267c	Check for node's presence in networkDB's node map before accessing. Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-12-05 00:58:59 -08:00
Madhu Venugopal	224a73d60b	Merge pull request #1576 from daehyeok/misspell Fixed misspelling	2016-12-02 16:02:23 -08:00
Aaron Lehmann	bb8b9a6040	networkdb: Properly format memberlist logs Right now, items logged by memberlist end up as a complete log line embedded inside another log line, like the following: Nov 22 16:34:16 hostname dockerd: time="2016-11-22T16:34:16.802103258-08:00" level=info msg="2016/11/22 16:34:16 [INFO] memberlist: Marking xyz-1d1ec2dfa053 as failed, suspect timeout reached\n" This has two time and date stamps, and an escaped newline inside the "msg" field of the outer log message. To fix this, define a custom logger that only prints the message itself. Capture this message in logWriter, strip off the log level (added directly by memberlist), and route to the appropriate logrus method. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>	2016-12-01 19:08:07 -08:00
Alessandro Boch	fac86cf69a	Add missing locks in agent and service code Signed-off-by: Alessandro Boch <aboch@docker.com>	2016-11-29 13:58:06 -08:00
Daehyeok Mun	f89d6b0073	Fixed misspelling Signed-off-by: Daehyeok Mun <daehyeok@gmail.com>	2016-11-28 11:46:52 -07:00
Alessandro Boch	f195563a4e	Control IPv6 on container's interface - Disable ipv6 on all interface by default at sandbox creation. Enable IPv6 per interface basis if the interface has an IPv6 address. In case sandbox has an IPv6 interface, also enable IPv6 on loopback interface. Signed-off-by: Alessandro Boch <aboch@docker.com>	2016-11-22 15:38:24 -08:00
Ke Li	23ac56fdd0	Remove unnecessary string formats Signed-off-by: Ke Li <kel@splunk.com>	2016-11-22 09:29:53 +08:00
Santhosh Manohar	27500b1e35	Separate service LB & SD from network plumbing Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-11-17 13:09:14 -08:00
Victor Vieux	236dc57a9e	fix unsafe acces on arm Signed-off-by: Victor Vieux <vieux@docker.com>	2016-11-10 23:05:11 -08:00
Santhosh Manohar	31dd4362a8	Merge pull request #1542 from allencloud/change-reapNode-interval update reapNode interval	2016-11-08 11:14:23 -08:00
allencloud	0b4f68390d	remove unused mConfig Signed-off-by: allencloud <allen.sun@daocloud.io>	2016-11-08 18:18:55 +08:00
allencloud	99f84ff5a7	update reapNode interval Signed-off-by: allencloud <allen.sun@daocloud.io>	2016-11-08 15:28:42 +08:00
Alessandro Boch	c5ca82daf4	Merge pull request #1519 from sanimej/newlb Add sandbox API for task insertion to service LB and service discovery	2016-11-03 13:31:46 -07:00
Santhosh Manohar	c52c8ca6eb	Add NetworkDB API to fetch the per network peer (gossip cluster) list Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-11-02 13:58:15 -07:00
Santhosh Manohar	a7e1718800	Add sandbox API for task insertion to service LB and service discovery Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-10-25 05:41:44 -07:00
Santhosh Manohar	e98b152bac	Reap failed nodes after 24 hours Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-10-20 11:24:04 -07:00
Alessandro Boch	6b74a8d479	Merge pull request #1476 from sanimej/time Use monotonic clock source to reap networkDB entries	2016-10-20 07:30:41 -07:00
Santhosh Manohar	0a2537eea3	Use monotonic clock for reaping networkDB entries Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2016-10-19 22:30:47 -07:00
Alexander Morozov	c772d14e58	networkdb: fix race in deleteNetwork There are multiple places which reads from that slice(i.e. bulkSync). Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>	2016-10-12 08:42:05 -07:00
Alexander Morozov	03088ace1b	networkdb: fix race in access to nodes len Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>	2016-10-04 12:19:25 -07:00
Jana Radhakrishnan	f649d5ae61	Do not hold ack channel in ack table after closing Once the bulksync ack channel is closed remove it from the ack table right away. There is no reason to keep it in the ack table and later delete it in the ack waiter. Ack waiter anyways has reference to the channel on which it is waiting. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>	2016-10-03 09:50:02 -07:00
Jana Radhakrishnan	22c322dded	Avoid returning early on agent join failures When a gossip join failure happens do not return early in the call chain because a join failure is most likely transient and the retry logic built in the networkdb is going to retry and succeed. Returning early makes the initialization of ingress network/sandbox to not happen which causes a problem even after the gossip join on retry is successful. Signed-off-by: Jana Radhakrishnan <mrjana@docker.com>	2016-09-27 08:36:10 -07:00

1 2

79 commits