beenull/moby

Author	SHA1	Message	Date
Euan Harris	96c7cba64c	networkdb, drivers: Regenerate protocol buffers agent.pb.go is unchanged, but the files in networkdb and drivers are slightly different when regenerated using the current versions of protoc and gogoproto. This is probably because agent.pb.go was last regenerated quite recently, in February 2018, whereas networkdb.pb.go and overlay/overlay.pb.go were last changed in 2017, and windows/overlay/overlay.pb.go was last changed in 2016. Signed-off-by: Euan Harris <euan.harris@docker.com>	2018-06-22 15:03:12 +01:00
Flavio Crisciani	48196df4a2	Further makefile cleanup - cleaned the make check - local build do not require context Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2018-06-16 11:03:11 -07:00
Flavio Crisciani	65e8971ffd	Merge pull request #2134 from dani-docker/esc-532 Adding a recovery mechanism for a split gossip cluster	2018-04-23 13:14:27 -07:00
Dani Louca	96472cdaea	Adding a recovery mechanism for a split gossip cluster Signed-off-by: Dani Louca <dani.louca@docker.com>	2018-04-23 14:18:46 -04:00
Brian Goff	bc465326fe	networkdb: Use write lock in handleNodeEvent `handleNodeEvent` is calling `changeNodeState` which writes to various maps on the ndb object. Using a write lock prevents a panic on concurrent read/write access on these maps. Signed-off-by: Brian Goff <cpuguy83@gmail.com>	2018-04-11 21:28:29 -04:00
Flavio Crisciani	9b7922ff6e	Fix README flag and expose orphan network peers - Readme example was using wrong flag - Network peers were not exposed properly Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2018-03-23 10:19:02 -07:00
Flavio Crisciani	a59ecd9537	Change diagnose module name to diagnostic Align it to the moby/moby external api Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2018-01-25 16:09:29 -08:00
Flavio Crisciani	64da6b8889	Avoid delay on node rejoin, avoid useless witness Avoid waiting for a double notification once a node rejoin, just put it back to active state. Waiting for a further message does not really add anything to the safety of the operation, the source of truth for the node status resided inside memberlist. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2018-01-23 16:21:18 -08:00
Flavio Crisciani	b190ee3ccf	Cleanup node management logic Created method to handle the node state change with cleanup operation associated. Realign testing client with the new diagnostic interface Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-12-13 09:40:38 -08:00
Flavio Crisciani	3e544bc500	Avoid extra notification on node leave If a node leave, avoid to notify the upper layer for entries that are already marked for deletion Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-12-01 16:19:38 -08:00
Flavio Crisciani	b578cdce86	Diagnose framework for networkDB This commit introduces the possibility to enable a debug mode for the networkDB, this will allow the opening of a tcp port on localhost that will expose the networkDB api for debugging purposes. The API can be discovered using curl localhost:<port>/help It support json output if passed json as URL query parameter option and pretty printing if passing json=pretty All the binaries values are serialized in base64 encoding, this can be skip passing the unsafe option as url query parameter A simple go client will follow up Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-12-01 16:19:35 -08:00
Flavio Crisciani	f0fcb0bbe6	Fixed race on quick node fail/join The previous logic was not properly handling the case of a node that was failing and oining back in short period of time. The issue was in the handling of the network messages. When a node joins it sync with other nodes, these are passing the whole list of nodes that at best of their knowledge are part of a network. At this point if the node receives that node A is part of the network it saves it before having received the notification that node A is actually alive (coming from memberlist). If node A failed the source node will receive the notification while the new joined node won't because memberlist never advertise node A as available. In this case the new node will never purge node A from its state but also worse, will accept any table notification where node A is the owner and so will end up in a out of sync state with the rest of the cluster. This commit contains also some code cleanup around the area of node management Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-11-27 14:38:06 -08:00
Flavio Crisciani	4037132b33	Fix listen port for test infra Update Dockerfile, curl is used for the healthcheck Add /dump for creating the routine stack trace Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-11-16 16:23:44 -08:00
Flavio Crisciani	a41f623b10	Merge pull request #1957 from fcrisciani/netdb-gc-test Add test to confirm garbage collection	2017-11-08 16:25:47 -08:00
Flavio Crisciani	7fbaf6de2c	Add test to confirm garbage collection - Create a test to verify that a node that joins in an async way is not going to extend the life of a already deleted object Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-10-23 09:58:57 +02:00
Flavio Crisciani	1732ab426d	Handle cleanup DNS for attachable container Attachable containers they are tasks with no service associated their cleanup was not done properly so it was possible to have a leak of their name resolution if that was the last container on the network. Cleanupservicebindings was not able to do the cleanup because there is no service, while also the notification of the delete arrives after that the network is already being cleaned Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-10-12 21:41:29 -07:00
Flavio Crisciani	ad577a25fe	Changed ipMask to string Avoid error logs in case of local peer case, there is no need for deleteNeighbor Avoid the network leave to readvertise already deleted entries to upper layer Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-10-02 17:29:18 -07:00
Flavio Crisciani	ef2e91707d	Merge pull request #1958 from ityangchen/test-libnetwork Repair (*Broadcaster).run goroutine leak	2017-09-30 10:44:12 -07:00
Flavio Crisciani	b92d91d6a1	Fix comparison against wrong constant The comparison was against the wrong constant value. As described in the comment the check is there to guarantee to not propagate events realted to stale deleted elements Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-09-29 21:05:24 -07:00
yangchenliang	955c532735	Repair (Broadcaster).run goroutine leak When execute 'docker swarm init' and 'docker swarm leave -f' on a node repeatedly, the (Broadcaster).run goroutine leak. Signed-off-by: yangchenliang <yangchenliang@huawei.com>	2017-09-29 18:56:16 +08:00
Flavio Crisciani	8c31217a44	NetworkDB create NodeID for cluster nodes Separate the hostname from the node identifier. All the messages that are exchanged on the network are containing a nodeName field that today was hostname-uniqueid. Now being encoded as strings in the protobuf without any length restriction they plays a role on the effieciency of protocol itself. If the hostname is very long the overhead will increase and will degradate the performance of the database itself that each single cycle by default allows 1400 bytes payload Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-09-26 10:48:04 -07:00
Flavio Crisciani	a4e64d05c1	Avoid alignment of reapNetwork and tableEntries Make sure that the network is garbage collected after the entries. Entries to be deleted requires that the network is present. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-09-22 10:57:47 -07:00
Flavio Crisciani	053a534ab1	Changed ReapTable logic - Changed the loop per network. Previous implementation was taking a ReadLock to update the reapTime but now with the residualReapTime also the bulkSync is using the same ReadLock creating possible issues in concurrent read and update of the value. The new logic fetches the list of networks and proceed to the cleanup network by network locking the database and releasing it after each network. This should ensure a fair locking avoiding to keep the database blocked for too much time. Note: The ticker does not guarantee that the reap logic runs precisely every reapTimePeriod, actually documentation says that if the routine is too long will skip ticks. In case of slowdown of the process itself it is possible that the lifetime of the deleted entries increases, it still should not be a huge problem because now the residual reaptime is propagated among all the nodes a slower node will let the deleted entry being repropagate multiple times but the state will still remain consistent. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-09-21 09:37:47 -07:00
Flavio Crisciani	2d2a2bc568	Fix reapTime logic in NetworkDB - Added remainingReapTime field in the table event. Wihtout it a node that did not have a state for the element was marking the element for deletion setting the max reapTime. This was creating the possibility to keep the entry being resync between nodes forever avoding the purpose of the reap time itself. - On broadcast of the table event the node owner was rewritten with the local node name, this was not correct because the owner should continue to remain the original one of the message Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-09-21 09:37:37 -07:00
Derek McGowan	710e0664c4	Update logrus to v1.0.1 Fix case sensitivity issue Update docker and runc vendors Signed-off-by: Derek McGowan <derek@mcgstyle.net>	2017-08-07 11:20:47 -07:00
Flavio Crisciani	2e38c53def	PeerInit for the sandbox init Move the sandbox init logic into the go routine that handles peer operations. This is to avoid deadlocks in the use of the pMap.Lock for the network Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-05 12:07:31 -07:00
Flavio Crisciani	d6440c9139	optimize the rebroadcast for failure case Before when a node was failing, all the nodes would bump the lamport time of all their entries. This means that if a node flap, there will be a storm of update of all the entries. This commit on the base of the previous logic guarantees that only the node that joins back will readvertise its own entries, the other nodes won't need to advertise again. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 14:08:54 -07:00
Flavio Crisciani	a3ecb8902a	fix join/leave join/leave fixes: - when a node leaves the network will deletes all the other nodes entries but will keep track of its to make sure that other nodes if they are tcp syncing will be aware of them being deleted. (a node that did not yet receive the network leave will potentially tcp/sync) add network reapTime, was not being set locally Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 14:08:45 -07:00
Flavio Crisciani	e77c245e45	2x faster to converge - Introduced back the Invalidate - optimized the rebroadcast logic Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-08-01 13:47:18 -07:00
Flavio Crisciani	585964bf32	NetworkDB testing infra - Diagnose framework that exposes REST API for db interaction - Dockerfile to build the test image - Periodic print of stats regarding queue size - Client and server side for integration with testkit - Added write-delete-leave-join - Added test write-delete-wait-leave-join - Added write-wait-leave-join Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-27 08:50:43 -07:00
Flavio Crisciani	60b5add4af	NetworkDB allow setting PacketSize - Introduce the possibility to specify the max buffer length in network DB. This will allow to use the whole MTU limit of the interface - Add queue stats per network, it can be handy to identify the node's throughput per network and identify unbalance between nodes that can point to an MTU missconfiguration Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-26 13:44:33 -07:00
Flavio Crisciani	051a0d5ce9	NetworkDB incorrect number of entries in networkNodes A rapid (within networkReapTime 30min) leave/join network can corrupt the list of nodes per network with multiple copies of the same nodes. The fix makes sure that each node is present only once Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-07-18 16:57:49 -07:00
Sebastiaan van Stijn	3dd1fb1217	Make node join event logging less noisy Commit `ca9a768d80` added a number of debugging messages for node join/leave events. This patch checks if a node already was listed, and otherwise skips the logging to make the logs a bit less noisy. Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2017-07-10 17:25:14 -07:00
Santhosh Manohar	6bd57f977d	Fix go generate for protobuf Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-07-05 16:31:12 -07:00
Flavio Crisciani	39d2204896	Service discovery logic rework changed the ipMap to SetMatrix to allow transient states Compacted the addSvc and deleteSvc into a one single method Updated the datastructure for backends to allow storing all the information needed to cleanup properly during the cleanupServiceBindings Removed the enable/disable Service logic that was racing with sbLeave/sbJoin logic Add some debug logs to track further race conditions Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-06-11 20:49:29 -07:00
Madhu Venugopal	78a910ee17	Merge pull request #1787 from fcrisciani/goroutine_leak Fix leak of handleTableEvents	2017-06-06 13:17:17 -07:00
Madhu Venugopal	59994bbb15	Merge pull request #1775 from sanimej/gossip Handle single manager reload by having workers reconnect	2017-05-31 14:57:34 -07:00
Santhosh Manohar	ca9a768d80	Handle single manager reload by having workers reconnect Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-05-31 14:36:23 -07:00
Flavio Crisciani	6d768ef73c	Fix leak of handleTableEvents The channel ch.C is never closed. Added the listen of the ch.Done() to guarantee that the goroutine is exiting once the event channel is closed Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-31 11:04:19 -07:00
Flavio Crisciani	f585f33042	Node failure timeout fix The time to keep a node failed into the failed node list was originally supposed to be 24h. If a node leaves explicitly it will be removed from the list of nodes and put into the leftNodes list. This way the NotifyLeave event won't insert it into the retry list. NOTE: if the event is lost instead the behavior will be the same as a failed node. If a node fails, the NotifyLeave will insert it into the failedNodes list with a reapTime of 24h. This means that the node will be checked for 24h before being completely forgot. The current check time is every 1 second and is done by the reconnectNode function. The failed node list is updated every 2h instead. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-22 17:19:31 -07:00
Santhosh Manohar	06c3489bb8	retry once on a bulk sync failure Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-05-11 21:13:18 -07:00
Flavio Crisciani	da9ac65ea6	Remove explicit set of memberlist protocol Memberlist does a full validation of the protocol version (min, current, max) amoung all the ndoes of the cluster. The previous code was setting the protocol version to max version. That made the upgrade incompatible. Signed-off-by: Flavio Crisciani <flavio.crisciani@docker.com>	2017-05-08 16:58:53 -07:00
Madhu Venugopal	1624c61ef2	Merge pull request #1727 from sanimej/cphard control-plane hardening: Avoid nDB stale entries	2017-04-25 11:04:13 -07:00
Santhosh Manohar	1693144ae2	Merge pull request #1713 from aboch/nse On clusterLeave, notify only if there are peers	2017-04-23 16:31:46 -07:00
Alessandro Boch	1323730eca	On send node envents, notify only if there are peers - Otherwise operation will unnecessarely block for five seconds. - This is particularly noticeable on graceful shutdown of daemon in one node cluster. Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-04-21 10:19:08 -07:00
Santhosh Manohar	102f9d230d	Avoid nDB stale entries because of intermittent nw issues. Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-04-19 14:01:28 -07:00
Santhosh Manohar	69ad7ef244	control-plane hardning: cleanup local state on peer leaving a network Signed-off-by: Santhosh Manohar <santhosh@docker.com>	2017-03-31 01:49:03 -07:00
Santhosh Manohar	539888412b	Merge pull request #1689 from aboch/inv Do not invalidate table event messages	2017-03-16 13:47:01 -07:00
Alessandro Boch	9c3c86a931	Do not invalidate table event messages - Do not run the risk of suppressing meaningful messages for the rest of the cluster, as a many services depend on it, like the service records and the distributed load balancers. Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-03-16 00:49:58 -07:00
Alessandro Boch	4b306ee83d	Fix panic in networkdb test code fatal error: concurrent map read and map write goroutine 264 [running]: runtime.throw(0x90043c, 0x21) /usr/local/go/src/runtime/panic.go:566 +0x95 fp=0xc4203d1d68 sp=0xc4203d1d48 runtime.mapaccess2_faststr(0x86df20, 0xc4203f5470, 0xc42044afc0, 0x5, 0xc4203d1e40, 0x4ed6b8) /usr/local/go/src/runtime/hashmap_fast.go:306 +0x52b fp=0xc4203d1dc8 sp=0xc4203d1d68 github.com/docker/libnetwork/networkdb.(*NetworkDB).verifyNodeExistence(0xc42007e160, 0xc42008a240, 0xc42044afc0, 0x5, 0x1) /go/src/github.com/docker/libnetwork/networkdb/networkdb_test.go:58 +0x6c fp=0xc4203d1e50 sp=0xc4203d1dc8 Signed-off-by: Alessandro Boch <aboch@docker.com>	2017-03-15 23:26:32 -07:00

1 2 3

105 commits