From: Jack Burton Subject: Re: [patch 2 of 3] iscsid(8): avoid double free & race during cleanup To: Claudio Jeker Cc: tech@openbsd.org Date: Thu, 16 Jan 2025 13:37:13 +1030 On Wed, 15 Jan 2025 14:46:02 +0100 Claudio Jeker wrote: > On Tue, Dec 31, 2024 at 06:33:08PM +1030, Jack Burton wrote: > > Fixing that one unmasked a latent race condition during shutdown. > > Depending on whether session reestablishment attempts result in any > > new connections entering the login phase before ISCSI_EXIT_WAIT > > expires, EITHER conn_free() would get called multiple times on the > > same connection OR there'd be an outright access after free. > > > > To avoid adding extra states to the fsm, simply check exit_rounds > > (which gets set in shutdown_cb) before freeing connections in > > sess_do_free(): if it's set then we're about to exit anyway, so it > > doesn't matter that some connections may get reallocated without > > freeing the old ones. > > Uhm. I don't understand how conn_free() can be called multiple times. sess_do_free() calls conn_free() in the loop for each connection in the TAILQ. Then sess_do_conn_fail() calls conn_free() because connection state is CONN_FREE. I don't really understand why sess_do_conn_fail() is getting called at all, as after sess_do_free() the current session state should be SESS_FREE (not SESS_RUNNING) ... but somehow it does get called. Here's what we should see when a single session is active and SIGTERM is received (but this only happens when the target is the netbsd iscsi target from ports): ===CUT HERE=== initiator_shutdown: going down session[netbsd disc1] going down conn_fsm[netbsd disc1]: LOGGED_IN ev logout conn_fsm[netbsd disc1]: new state IN_LOGOUT initiator_logout_cb: response 0, Time2Wait 0, Time2Retain 0 conn_fsm[netbsd disc1]: IN_LOGOUT ev logged out conn_fsm[netbsd disc1]: new state FREE session_fsm[netbsd disc1]: LOGGED_IN ev session closed timeout 0 sess_fsm[netbsd disc1]: LOGGED_IN ev session closed conn_free sess_fsm[netbsd disc1]: new state FREE sess_fsm: done exiting. sym0 detached ===CUT HERE=== Now here's what happens when a single session is active to any other target and SIGTERM is received: ===CUT HERE=== initiator_shutdown: going down session[iet disc2] going down conn_fsm[iet disc2]: LOGGED_IN ev logout sd1 detached conn_fsm[iet disc2]: new state IN_LOGOUT initiator_logout_cb: response 0, Time2Wait 0, Time2Retain 0 conn_fsm[iet disc2]: IN_LOGOUT ev logged out conn_fsm[iet disc2]: new state FREE session_fsm[iet disc2]: LOGGED_IN ev session closed timeout 0 conn_fsm[iet disc2]: FREE ev closed c_do_fail session_fsm[iet disc2]: LOGGED_IN ev connection fail timeout 0 conn_fsm[iet disc2]: new state FREE sess_fsm[iet disc2]: LOGGED_IN ev session closed conn_free sess_fsm[iet disc2]: new state FREE sess_fsm: done sess_fsm[iet disc2]: FREE ev connection fail conn_free iscsid(34499) in free(): bogus pointer (double free?) 0xd1f9419000 Abort trap ===CUT HERE=== If any sessions are in a failed state (and there are no logged in sessions) when SIGTERM is received, either of the two behaviours could be seen (with just a single failed session, shutdown remains clean; but with multiple failed sessions the double free sometimes happens). With my patch #3 (but not #2) applied, if there are logged in sessions to other targets (which don't work at all without my patch #3) when SIGTERM is received, sometimes I see the same double free as above and other times I see an access after free instead (but I can't reproduce that failure mode with either netbsd or iet as the target and I don't have any other targets that work without patch #3). > It means that the s->connections TAILQ is corrupted in some ways. Yes. With the structures we have it doesn't seem possible to remove the last remaining connection from connections in a struct session cleanly. That's what I was tying to work around here. The alternative (restructuring the connection queue) would be a much more intrusive change, which I was trying to avoid. > One issue is that sess_do_reinstatement() does a double > TAILQ_REMOVE(). So maybe that is the root cause. Could you give the > below diff a try? With that diff alone, iscsid(8) still aborts with a double free during cleanup after receiving SIGTERM. > I'm not sure if the code should retrigger shutdowns or not in the > shutdown_cb. I don't think so. There's always going to be *something* of a race here, if any of the connections is actually doing anything when SIGTERM is received. That seems to be the point of shutdown_cb & ISCSI_EXIT_WAIT. If a target has taken more than a second to respond already, chances are it's heavily loaded so the last thing we want to do is resend a PDU to it and make things worse. Checking exit_rounds in sess_do_free() struck me as the simplest way to avoid the double free (i.e. without adding a bunch more states to the fsm). If you can think of a better way I'm all ears! > Also something that is strange is that sess_do_free is actually not > freeing the session. Another thing to look into (its been a long time > I looked at the iscsi RFC). As best I can tell, that was to ensure that the session still exists when a retry triggers sess_do_start() again. Again, the cause seems to be that as things stand it's not possible for a session to have an empty tailq of connections more than once.