Index | Thread | Search

From:
Jack Burton <jack@saosce.com.au>
Subject:
Re: [patch 2 of 3] iscsid(8): avoid double free & race during cleanup
To:
Claudio Jeker <cjeker@diehard.n-r-g.com>
Cc:
tech@openbsd.org
Date:
Wed, 22 Jan 2025 14:09:44 +1030

Download raw body.

Thread
On Mon, 20 Jan 2025 13:54:20 +0100
Claudio Jeker <cjeker@diehard.n-r-g.com> wrote:
> On Fri, Jan 17, 2025 at 06:04:32PM +1030, Jack Burton wrote:
> > On Thu, 16 Jan 2025 21:27:34 +0100
> > Claudio Jeker <cjeker@diehard.n-r-g.com> wrote:
> >   
> > > On Thu, Jan 16, 2025 at 01:37:13PM +1030, Jack Burton wrote:  
> > > > On Wed, 15 Jan 2025 14:46:02 +0100
> > > > Claudio Jeker <cjeker@diehard.n-r-g.com> wrote:    
> > > > > On Tue, Dec 31, 2024 at 06:33:08PM +1030, Jack Burton wrote:
> > > > >   
> > > > > > Fixing that one unmasked a latent race condition during
> > > > > > shutdown. Depending on whether session reestablishment
> > > > > > attempts result in any new connections entering the login
> > > > > > phase before ISCSI_EXIT_WAIT expires, EITHER conn_free()
> > > > > > would get called multiple times on the same connection OR
> > > > > > there'd be an outright access after free.
> > > > > > 
> > > > > > To avoid adding extra states to the fsm, simply check
> > > > > > exit_rounds (which gets set in shutdown_cb) before freeing
> > > > > > connections in sess_do_free(): if it's set then we're about
> > > > > > to exit anyway, so it doesn't matter that some connections
> > > > > > may get reallocated without freeing the old ones.      
> > > > > 
> > > > > Uhm. I don't understand how conn_free() can be called multiple
> > > > > times.    
> > > > 
> > > > sess_do_free() calls conn_free() in the loop for each
> > > > connection in the TAILQ.  Then sess_do_conn_fail() calls
> > > > conn_free() because connection state is CONN_FREE.
> > > > 
> > > > I don't really understand why sess_do_conn_fail() is getting
> > > > called at all, as after sess_do_free() the current session
> > > > state should be SESS_FREE (not SESS_RUNNING) ... but somehow it
> > > > does get called. 
> > > 
> > > Ah, I see. The problem is that sessev has another pointer to the
> > > connection and the sessev is not removed in conn_free().
> > > I need to reread the code to see how we can fix that mess.
> > > But what happens is that we have a event queued with a pointer to
> > > a connection that was already freed.  
> > 
> > Thanks, that's just the clue I needed!
> > 
> > You're right, it is a bit of a mess.  Not your fault though: most of
> > that mess is necessary if we want to keep the fsm reasonably true
> > to the standard.
> > 
> > Simplest way I could think of to avoid the double free (*without*
> > the sledge-hammer approach of my original patch #2) was to keep
> > track of any one-shot event queued for each connection, dequeuing
> > it (if any) in conn_free().
> > 
> > [I am of course assuming that no connection ever gets more than one
> > one-shot event pending at the same time ... but in my testing I
> > have not been able to produce any scenario that violates that
> > assumption.]
> > 
> > The patch below does just that.  With this patch applied I get clean
> > shutdowns on SIGTERM, regardless of which target(s) iscsid is
> > connected to and regardless of whether each of them is in login,
> > running, or failed when I send the signal.
> > 
> > Is this a reasonable approach for the time being?
> >   
> 
> I took the liberty and cleaned up this a bit further. I added a struct
> sessev to the session and connection structs and use those for the
> session_fsm. So now a pending sessev event can be removed in
> connection_free.
> 
> What you think about that? Can you test this against the non-netbsd
> targets?

It's a bit more intrusive ... but it does make the code paths clearer
to anyone reading the source, so yes I like your approach better.

Testing gave the same results as my last iteration: clean shutdown on
SIGTERM regardless of what targets we're connected to and their current
state(s).

As an added bonus, in the special case where all connections are
in a failed state when SIGTERM is received, shutdown is much faster
with your patch than with my last one.

Many thanks.