From: Nick Owens <mischief@offblast.org>
Subject: Re: relayd reload race crash
To: Rafael Sadowski <rafael@sizeofvoid.org>
Cc: tech@openbsd.org, Claudio Jeker <cjeker@diehard.n-r-g.com>
Date: Tue, 3 Mar 2026 05:53:17 -0800

On Mon, Mar 2, 2026 at 10:54 PM Rafael Sadowski <rafael@sizeofvoid.org> wrote:
>
> On Sun Mar 01, 2026 at 05:50:55AM -0800, Nick Owens wrote:
> > hello rafael,
> >
> > On Sun, Mar 01, 2026 at 09:47:54AM +0100, Rafael Sadowski wrote:
> > > Hi Nick Owens,
> > >
> > > do you have a minimal test setup to to reproduce that? Comments below.
> >
> > my apologies. i did not inclue what i originally wrote to bugs@ in one
> > of my replies..
>
> Thanks, that always makes it easier!
>
> >
> > https://marc.info/?l=openbsd-bugs&m=177018029002894&w=2
> >
> > doas watch -s 0.1 relayctl -s /tmp/relayd.sock reload
> >
> > watch -s 0 curl -kso /dev/null https://127.0.0.1:444
> >
> > fugu$ cat x.conf
> > socket "/tmp/relayd.sock"
> > table <www> { 127.0.0.1 }
> > http protocol "proxy" {
> >       tls keypair "test"
> >       return error
> > }
> >
> > relay "https" {
> >       listen on 0.0.0.0 port 444 tls
> >       protocol "proxy"
> >       forward to <www> port 80
> > }
> >
> > fugu$ doas ./obj/relayd -d -f x.conf
> > startup
> > adding 1 hosts from table www:80 (no check)
> > adding 1 hosts from table www:80 (no check)
> > adding 1 hosts from table www:80 (no check)
> > <snip>
> > adding 1 hosts from table www:80 (no check)
> > adding 1 hosts from table www:80 (no check)
> > adding 1 hosts from table www:80 (no check)
> > ca: ca_dispatch_relay: invalid relay hash 'SHA256:764076891ced15921fb2ec023df3232c194dccac8c6b9fde2565ca01ed13874d'
> > hce exiting, pid 71488
> > pfe exiting, pid 40741
> > ca exiting, pid 48707
> > relay: pipe closed
> > lost child: pid 39177 exited abnormally
> > ca exiting, pid 20355
> > lost child: pid 71618 exited abnormally
> > relay exiting, pid 83235
> > relay exiting, pid 71571
> > parent terminating, pid 32006
> >
> > >
> > > On Sat Feb 28, 2026 at 10:59:22PM -0800, Nick Owens wrote:
> > > > On Tue, Feb 10, 2026 at 04:34:08PM -0800, Nick Owens wrote:
> > > > > On Tue, Feb 03, 2026 at 10:15:20PM -0800, Nick Owens wrote:
> > > > > > On Wed, Feb 04, 2026 at 03:51:11PM +1000, Paul W. Rankin wrote:
> > > > > > > > On 04/02/2026 1:58 PM AEST mischief () offblast ! org wrote:
> > > > > > > >   this was brought up by rnkn on IRC,
> > > > > > >
> > > > > > > I'm just looping myself in here. Thanks.
> > > > > > >
> > > > > >
> > > > > > one approach is to return negative cko_tlen back to rsae_send_imsg. this
> > > > > > might result in some broken tls connections, but its better than taking
> > > > > > down all of relayd.
> > > > > >
> > > > > > a better fix would probably be to temporarily stop processing rsa ops
> > > > > > altogether while reload is happening, but i'm not sure of a good
> > > > > > approach for that.
> > > > > >
> > > > > > Index: ca.c
> > > > > > ===================================================================
> > > > > > RCS file: /cvs/src/usr.sbin/relayd/ca.c,v
> > > > > > diff -u -p -r1.45 ca.c
> > > > > > --- ca.c        21 Nov 2024 13:21:34 -0000      1.45
> > > > > > +++ ca.c        4 Feb 2026 06:03:33 -0000
> > > > > > @@ -234,9 +234,15 @@ ca_dispatch_relay(int fd, struct privsep
> > > > > >                         fatalx("%s: invalid relay proc", __func__);
> > > > > >                 if (IMSG_DATA_SIZE(imsg) != (sizeof(cko) + cko.cko_flen))
> > > > > >                         fatalx("%s: invalid key operation", __func__);
> > > > > > -               if ((pkey = pkey_find(env, cko.cko_hash)) == NULL)
> > > > > > -                       fatalx("%s: invalid relay hash '%s'",
> > >
> > > Yes I think this is to aggressive during a reload process.
> > >
> > > > > > +               if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) {
> > > > > > +                       log_warnx("%s: invalid relay hash '%s'",
> > > > > >                             __func__, cko.cko_hash);
> > > > > > +                       cko.cko_tlen = -1;
> > > > > > +                       if (proc_compose_imsg(env->sc_ps, PROC_RELAY, cko.cko_proc,
> > > > > > +                           imsg->hdr.type, -1, -1, &cko, sizeof(cko)) == -1)
> > > > > > +                               log_warn("%s: proc_composev_imsg", __func__);
> > > > > > +                       break;
> > > > > > +               }
> > > > > >                 if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL)
> > > > > >                         fatalx("%s: invalid relay key", __func__);
> > >
> > > If we go this way, we need to handle this too.
> > >
> > > > > >
> > > > > >
> > > > >
> > > > > ping. here is an alternate approach to stop processing imsgs from PROC_RELAY by removing the events during reload, but when testing i still had client connections get empty replies, closed early, or the rsa op timed out. however, relayd also does not crash.
> > >
> > > The fix replaces a crash with a broken connection?
> > >
> > > >
> > > > ping again. i hope this will be fixed for the next release.
> > > >
> > > > >
> > > > >
> > > > > diff --git a/usr.sbin/relayd/ca.c b/usr.sbin/relayd/ca.c
> > > > > index e54259c5971..89f7f48685e 100644
> > > > > --- a/usr.sbin/relayd/ca.c
> > > > > +++ b/usr.sbin/relayd/ca.c
> > > > > @@ -204,9 +204,11 @@ ca_dispatch_parent(int fd, struct privsep_proc *p, struct imsg *imsg)
> > > > >                 break;
> > > > >         case IMSG_CTL_START:
> > > > >                 ca_launch();
> > > > > +               proc_unblock(p->p_ps, PROC_RELAY);
> > > > >                 break;
> > > > >         case IMSG_CTL_RESET:
> > > > >                 config_getreset(env, imsg);
> > > > > +               proc_block(p->p_ps, PROC_RELAY);
> > > > >                 break;
> > > > >         default:
> > > > >                 return -1;
> > > > > diff --git a/usr.sbin/relayd/proc.c b/usr.sbin/relayd/proc.c
> > > > > index 3eb00aa0381..56a49120cb4 100644
> > > > > --- a/usr.sbin/relayd/proc.c
> > > > > +++ b/usr.sbin/relayd/proc.c
> > > > > @@ -682,6 +682,38 @@ proc_dispatch_null(int fd, struct privsep_proc *p, struct imsg *imsg)
> > > > >         return (-1);
> > > > >  }
> > > > >
> > > > > +void
> > > > > +proc_block(struct privsep *ps, enum privsep_procid id)
> > > > > +{
> > > > > +       struct imsgev *iev;
> > > > > +       int n, m;
> > > > > +
> > > > > +       n = -1;
> > > > > +
> > > > > +       proc_range(ps, id, &n, &m);
> > > > > +
> > > > > +       for(; n < m; n++){
> > > > > +               iev = &ps->ps_ievs[id][n];
> > > > > +               event_del(&iev->ev);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > > +void
> > > > > +proc_unblock(struct privsep *ps, enum privsep_procid id)
> > > > > +{
> > > > > +       struct imsgev *iev;
> > > > > +       int n, m;
> > > > > +
> > > > > +       n = -1;
> > > > > +
> > > > > +       proc_range(ps, id, &n, &m);
> > > > > +
> > > > > +       for(; n < m; n++){
> > > > > +               iev = &ps->ps_ievs[id][n];
> > > > > +               event_add(&iev->ev, NULL);
> > > > > +       }
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * imsg helper functions
> > > > >   */
> > > > > diff --git a/usr.sbin/relayd/relayd.h b/usr.sbin/relayd/relayd.h
> > > > > index 3b5c3987f93..697c3bd8b48 100644
> > > > > --- a/usr.sbin/relayd/relayd.h
> > > > > +++ b/usr.sbin/relayd/relayd.h
> > > > > @@ -1451,6 +1451,8 @@ int        imsg_compose_event(struct imsgev *, uint16_t, uint32_t,
> > > > >             pid_t, int, void *, uint16_t);
> > > > >  int     imsg_composev_event(struct imsgev *, uint16_t, uint32_t,
> > > > >             pid_t, int, const struct iovec *, int);
> > > > > +void    proc_block(struct privsep *, enum privsep_procid);
> > > > > +void    proc_unblock(struct privsep *, enum privsep_procid);
> > > > >
> > > > >  /* config.c */
> > > > >  int     config_init(struct relayd *);
> > > >
> > >
> > > Is a workaround and not a fix for the root of the problem. Are you
> > > agree?
> >
> > yes, broken client connections aren't ideal, but i feel it's better than
> > relayd giving up the ghost entirely if for example in the extremely
> > common case, one is using crontab like i do:
> >
> > @weekly acme-client example.com && rcctl reload relayd
> >
> > today if you do this you are simply playing russian roulette with your
> > webserver.
> >
> > thanks,
> > nick
> >
>
> As described above, I think that ending a direct reply with an error
> (-1) to the sender is the right approach to prevent a fatal crash during
> configuration reloads. It ensures the relay worker is unblocked
> immediately (not wait for the timeout) even if a key is temporarily
> unavailable (as in our race, due to the reload).
>
> OK?

not sure my reply carries much weight, but ok by me :-)

>
> diff --git a/ca.c b/ca.c
> index 167fb1c..3515977 100644
> --- a/ca.c
> +++ b/ca.c
> @@ -234,11 +234,33 @@ ca_dispatch_relay(int fd, struct privsep_proc *p, struct imsg *imsg)
>                         fatalx("%s: invalid relay proc", __func__);
>                 if (IMSG_DATA_SIZE(imsg) != (sizeof(cko) + cko.cko_flen))
>                         fatalx("%s: invalid key operation", __func__);
> -               if ((pkey = pkey_find(env, cko.cko_hash)) == NULL)
> -                       fatalx("%s: invalid relay hash '%s'",
> +
> +               if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) {
> +                       log_warnx("%s: invalid relay hash '%s'",
>                             __func__, cko.cko_hash);
> -               if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL)
> -                       fatalx("%s: invalid relay key", __func__);
> +                       /* Signal failure to the waiting relay worker. */
> +                       cko.cko_tlen = -1;
> +                       iov[c].iov_base = &cko;
> +                       iov[c++].iov_len = sizeof(cko);
> +                       if (proc_composev_imsg(env->sc_ps, PROC_RELAY,
> +                           cko.cko_proc, imsg->hdr.type, -1, -1, iov,
> +                            c) == -1)
> +                               log_warn("%s: proc_composev_imsg", __func__);
> +                       break;
> +               }
> +
> +               if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL) {
> +                       log_warnx("%s: invalid relay key", __func__);
> +                       /* Signal failure to the waiting relay worker. */
> +                       cko.cko_tlen = -1;
> +                       iov[c].iov_base = &cko;
> +                       iov[c++].iov_len = sizeof(cko);
> +                       if (proc_composev_imsg(env->sc_ps, PROC_RELAY,
> +                           cko.cko_proc, imsg->hdr.type, -1, -1, iov,
> +                            c) == -1)
> +                               log_warn("%s: proc_composev_imsg", __func__);
> +                       break;
> +               }
>
>                 DPRINTF("%s:%d: key hash %s proc %d",
>                     __func__, __LINE__, cko.cko_hash, cko.cko_proc);
> @@ -392,7 +414,11 @@ rsae_send_imsg(int flen, const u_char *from, u_char *to, RSA *rsa,
>                                 fatalx("invalid response");
>
>                         ret = cko.cko_tlen;
> -                       if (ret > 0) {
> +                       if (ret == -1) {
> +                               log_warnx("%s: priv%s failed for key %s",
> +                                   __func__, cmd == IMSG_CA_PRIVENC ?
> +                                   "enc" : "dec", cko.cko_hash);
> +                       } else if (ret > 0) {
>                                 if (IMSG_DATA_SIZE(&imsg) !=
>                                     (sizeof(cko) + ret))
>                                         fatalx("data size");