Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure task reaper stopChan is closed no more than once #2491

Merged
merged 1 commit into from
Feb 1, 2018

Conversation

nishanttotla
Copy link
Contributor

Signed-off-by: Nishant Totla nishanttotla@gmail.com

@stevvooe
Copy link
Contributor

LGTM but add some assurance that run cannot be called twice.

// TODO(dperny) calling stop on the task reaper twice will cause a panic
// because we try to close a channel that will already have been closed.
close(tr.stopChan)
tr.closeOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason to have multiple invocation for this function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kicked off a rebuild. Lets close this out soon ? @nishanttotla

@codecov
Copy link

codecov bot commented Jan 30, 2018

Codecov Report

Merging #2491 into master will decrease coverage by 0.55%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2491      +/-   ##
==========================================
- Coverage   62.09%   61.53%   -0.56%     
==========================================
  Files          49      129      +80     
  Lines        6898    21315   +14417     
==========================================
+ Hits         4283    13117    +8834     
- Misses       2174     6796    +4622     
- Partials      441     1402     +961

@@ -282,8 +286,8 @@ func (tr *TaskReaper) tick() {

// Stop stops the TaskReaper and waits for the main loop to exit.
func (tr *TaskReaper) Stop() {
// TODO(dperny) calling stop on the task reaper twice will cause a panic
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dperny do you remember the reason for this? I'm looking into it too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's never happened, IIRC, it's just possible for it to happen. Other components have Stop channels that are guarded against double close.

@nishanttotla
Copy link
Contributor Author

I noticed that taskreaper.Stop() is called in both func (m *Manager) Stop(ctx context.Context, clearData bool) and func (m *Manager) becomeFollower() in manager/manager.go. So a scenario where an existing leader was shutting down, and there was a leader election before it could finish a shutdown to become a follower, taskreaper.Stop() could be called twice.

This is a very corner case scenario, but it seems realistic because both Stop and becomeFollower run in separate goroutines. We should look out for this for all components shut down by becomeFollower:

// becomeFollower shuts down the subsystems that are only run by the leader.
func (m *Manager) becomeFollower() {
	m.dispatcher.Stop()
	m.logbroker.Stop()
	m.caserver.Stop()

	if m.allocator != nil {
		m.allocator.Stop()
		m.allocator = nil
	}

	m.constraintEnforcer.Stop()
	m.constraintEnforcer = nil

	m.replicatedOrchestrator.Stop()
	m.replicatedOrchestrator = nil

	m.globalOrchestrator.Stop()
	m.globalOrchestrator = nil

	m.taskReaper.Stop()
	m.taskReaper = nil

	m.scheduler.Stop()
	m.scheduler = nil

	m.roleManager.Stop()
	m.roleManager = nil

	if m.keyManager != nil {
		m.keyManager.Stop()
		m.keyManager = nil
	}
}

WDYT @anshulpundir @dperny @stevvooe?

@anshulpundir
Copy link
Contributor

Makes sense. Lets add that comment to TaskReaper.Stop() and land this change @nishanttotla

@dperny
Copy link
Collaborator

dperny commented Feb 1, 2018

Actually, I do have a change request, can we log an error or warning message if we try to stop the taskreaper twice? It's a corner case, and it's better to log an error than to panic the program, but it's an undesirable occurrence. Something like

log.L.Warn("Stop called twice on taskreaper")

@nishanttotla nishanttotla force-pushed the stop-task-reaper-once branch 2 times, most recently from 6751f66 to 4657b27 Compare February 1, 2018 18:07
Signed-off-by: Nishant Totla <nishanttotla@gmail.com>
// shutting down, and the other when the manager (the leader) is
// becoming a follower. Since these two instances could race with
// each other, we use closeOnce here to ensure that TaskReaper.Stop()
// is called only once to avoid a panic.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anshulpundir added a comment here.

@nishanttotla
Copy link
Contributor Author

I think the test flakiness is related, I'm investigating.

@nishanttotla
Copy link
Contributor Author

@dperny @anshulpundir have we seen this kind of flakiness in other PRs?

time="2018-02-01T18:21:57Z" level=warning msg="no certificate expiration specified, using default" cluster.id=tqntuxu76xeetoxmyjsl43zuq method="(*Server).UpdateRootCA" module=ca testHasExternalCA=false testname=TestLoadSecurityConfigLoadFromDisk 
time="2018-02-01T18:21:57Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:21:57Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:21:57Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:21:59Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:21:59Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:21:59Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:01Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:01Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:01Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:03Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:03Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:03Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:05Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial tcp 127.0.0.1:12: getsockopt: connection refused\"; Reconnecting to {127.0.0.1:12 <nil>}" module=grpc 
time="2018-02-01T18:22:05Z" level=info msg="Failed to dial 127.0.0.1:12: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:05Z" level=error msg="failed to retrieve remote root CA certificate" error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" 
time="2018-02-01T18:22:07Z" level=warning msg="no certificate expiration specified, using default" cluster.id=wsgzortbrionw33zytsshw288 method="(*Server).UpdateRootCA" module=ca testHasExternalCA=false testname=TestLoadSecurityConfigDownloadAllCerts 
time="2018-02-01T18:22:08Z" level=info msg="Listening for local connections" addr="/tmpfs/manager-root-ca-update106457598/control.sock" module=node node.id=ka7utsr50amzqxcmy1ent66ju proto=unix 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became follower at term 0" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="newRaft 4e1464b1d868fe2c [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became follower at term 1" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c is starting a new election at term 1" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became candidate at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c received MsgVoteResp from 4e1464b1d868fe2c at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="4e1464b1d868fe2c became leader at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="raft.node: 4e1464b1d868fe2c elected leader 4e1464b1d868fe2c at term 2" module=raft node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Creating default ingress network" module=node node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher starting" module=dispatcher node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 0s before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="worker ka7utsr50amzqxcmy1ent66ju was successfully registered" method="(*Dispatcher).register" 
time="2018-02-01T18:22:08Z" level=info msg="Stopping manager" 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher stopping" method="(*Dispatcher).Stop" module=dispatcher node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="dispatcher session dropped, marking node ka7utsr50amzqxcmy1ent66ju down" method="(*Dispatcher).Session" node.id=ka7utsr50amzqxcmy1ent66ju node.session=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=error msg="failed to remove node" error="rpc error: code = Aborted desc = dispatcher is stopped" method="(*Dispatcher).Session" node.id=ka7utsr50amzqxcmy1ent66ju node.session=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=info msg="transport: http2Server.HandleStreams failed to read frame: read unix /tmpfs/manager-root-ca-update106457598/control.sock->@: use of closed network connection" module=grpc 
time="2018-02-01T18:22:08Z" level=info msg="transport: http2Client.notifyError got notified that the client transport was broken EOF." module=grpc 
time="2018-02-01T18:22:08Z" level=info msg="grpc: addrConn.resetTransport failed to create client transport: connection error: desc = \"transport: dial unix /tmpfs/manager-root-ca-update106457598/control.sock: connect: no such file or directory\"; Reconnecting to {/tmpfs/manager-root-ca-update106457598/control.sock <nil>}" module=grpc 
time="2018-02-01T18:22:08Z" level=error msg="heartbeat to manager { } failed" error="rpc error: code = Internal desc = transport is closing" method="(*session).heartbeat" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju session.id=rb7x451ox253wcqcfu7mdabt8 sessionID=rb7x451ox253wcqcfu7mdabt8 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=100ms error="rpc error: code = Internal desc = transport is closing" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 288.042µs before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=300ms error="rpc error: code = Unavailable desc = grpc: the connection is unavailable" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="manager selected by agent for new session: { }" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="waiting 66.168097ms before registering session" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Manager shut down" 
time="2018-02-01T18:22:08Z" level=info msg="Stopping manager" module=node node.id=ka7utsr50amzqxcmy1ent66ju 
time="2018-02-01T18:22:08Z" level=info msg="Failed to dial /tmpfs/manager-root-ca-update106457598/control.sock: context canceled; please retry." module=grpc 
time="2018-02-01T18:22:08Z" level=error msg="agent: session failed" backoff=700ms error="rpc error: code = FailedPrecondition desc = grpc: the client connection is closing" module="node/agent" node.id=ka7utsr50amzqxcmy1ent66ju 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0xe1b128]

goroutine 296 [running]:
github.com/docker/swarmkit/connectionbroker.(*Conn).Peer(...)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/connectionbroker/broker.go:104
github.com/docker/swarmkit/agent.newSession(0x2b967133a250, 0xc4209a6270, 0xc4209a8000, 0x184395bd, 0xc420370820, 0x19, 0xc4209a62a0, 0xc42021e400)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/session.go:68 +0x588
github.com/docker/swarmkit/agent.(*Agent).run(0xc4209a8000, 0x2b967133a250, 0xc4209a6270)
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/agent.go:365 +0x13f4
created by github.com/docker/swarmkit/agent.(*Agent).Start.func1
	/home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/agent/agent.go:83 +0xb2
FAIL	github.com/docker/swarmkit/node	10.945s
make: *** [coverage] Error 1

@nishanttotla
Copy link
Contributor Author

nishanttotla commented Feb 1, 2018

Okay tests seem to pass now, the flakiness is strange but I can't say that it's related to this PR anymore.

As far as @dperny's comment about logging is concerned, it's not possible to do it with a sync.Once without using locking, which would be overkill.

Ping @dperny @anshulpundir

@anshulpundir
Copy link
Contributor

The flakiness is fixed in 7d507f6

@anshulpundir anshulpundir merged commit 4641f9e into moby:master Feb 1, 2018
@nishanttotla nishanttotla deleted the stop-task-reaper-once branch February 1, 2018 21:15
@anshulpundir
Copy link
Contributor

I merged it, not sure if you were planning to address #2491 (comment)

If so, I can add it to my next PR @nishanttotla

@nishanttotla
Copy link
Contributor Author

nishanttotla commented Feb 1, 2018

@anshulpundir I responded to that comment briefly in my last one. It's not possible to add just a simple log message with sync.Once, because it only lets you ensure that the callback provided to it is called once. It doesn't exactly let you find out if a second call was made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants