From: Nick Owens Subject: Re: relayd reload race crash To: Rafael Sadowski Cc: tech@openbsd.org, Claudio Jeker Date: Tue, 3 Mar 2026 05:53:17 -0800 On Mon, Mar 2, 2026 at 10:54 PM Rafael Sadowski wrote: > > On Sun Mar 01, 2026 at 05:50:55AM -0800, Nick Owens wrote: > > hello rafael, > > > > On Sun, Mar 01, 2026 at 09:47:54AM +0100, Rafael Sadowski wrote: > > > Hi Nick Owens, > > > > > > do you have a minimal test setup to to reproduce that? Comments below. > > > > my apologies. i did not inclue what i originally wrote to bugs@ in one > > of my replies.. > > Thanks, that always makes it easier! > > > > > https://marc.info/?l=openbsd-bugs&m=177018029002894&w=2 > > > > doas watch -s 0.1 relayctl -s /tmp/relayd.sock reload > > > > watch -s 0 curl -kso /dev/null https://127.0.0.1:444 > > > > fugu$ cat x.conf > > socket "/tmp/relayd.sock" > > table { 127.0.0.1 } > > http protocol "proxy" { > > tls keypair "test" > > return error > > } > > > > relay "https" { > > listen on 0.0.0.0 port 444 tls > > protocol "proxy" > > forward to port 80 > > } > > > > fugu$ doas ./obj/relayd -d -f x.conf > > startup > > adding 1 hosts from table www:80 (no check) > > adding 1 hosts from table www:80 (no check) > > adding 1 hosts from table www:80 (no check) > > > > adding 1 hosts from table www:80 (no check) > > adding 1 hosts from table www:80 (no check) > > adding 1 hosts from table www:80 (no check) > > ca: ca_dispatch_relay: invalid relay hash 'SHA256:764076891ced15921fb2ec023df3232c194dccac8c6b9fde2565ca01ed13874d' > > hce exiting, pid 71488 > > pfe exiting, pid 40741 > > ca exiting, pid 48707 > > relay: pipe closed > > lost child: pid 39177 exited abnormally > > ca exiting, pid 20355 > > lost child: pid 71618 exited abnormally > > relay exiting, pid 83235 > > relay exiting, pid 71571 > > parent terminating, pid 32006 > > > > > > > > On Sat Feb 28, 2026 at 10:59:22PM -0800, Nick Owens wrote: > > > > On Tue, Feb 10, 2026 at 04:34:08PM -0800, Nick Owens wrote: > > > > > On Tue, Feb 03, 2026 at 10:15:20PM -0800, Nick Owens wrote: > > > > > > On Wed, Feb 04, 2026 at 03:51:11PM +1000, Paul W. Rankin wrote: > > > > > > > > On 04/02/2026 1:58 PM AEST mischief () offblast ! org wrote: > > > > > > > > this was brought up by rnkn on IRC, > > > > > > > > > > > > > > I'm just looping myself in here. Thanks. > > > > > > > > > > > > > > > > > > > one approach is to return negative cko_tlen back to rsae_send_imsg. this > > > > > > might result in some broken tls connections, but its better than taking > > > > > > down all of relayd. > > > > > > > > > > > > a better fix would probably be to temporarily stop processing rsa ops > > > > > > altogether while reload is happening, but i'm not sure of a good > > > > > > approach for that. > > > > > > > > > > > > Index: ca.c > > > > > > =================================================================== > > > > > > RCS file: /cvs/src/usr.sbin/relayd/ca.c,v > > > > > > diff -u -p -r1.45 ca.c > > > > > > --- ca.c 21 Nov 2024 13:21:34 -0000 1.45 > > > > > > +++ ca.c 4 Feb 2026 06:03:33 -0000 > > > > > > @@ -234,9 +234,15 @@ ca_dispatch_relay(int fd, struct privsep > > > > > > fatalx("%s: invalid relay proc", __func__); > > > > > > if (IMSG_DATA_SIZE(imsg) != (sizeof(cko) + cko.cko_flen)) > > > > > > fatalx("%s: invalid key operation", __func__); > > > > > > - if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) > > > > > > - fatalx("%s: invalid relay hash '%s'", > > > > > > Yes I think this is to aggressive during a reload process. > > > > > > > > > + if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) { > > > > > > + log_warnx("%s: invalid relay hash '%s'", > > > > > > __func__, cko.cko_hash); > > > > > > + cko.cko_tlen = -1; > > > > > > + if (proc_compose_imsg(env->sc_ps, PROC_RELAY, cko.cko_proc, > > > > > > + imsg->hdr.type, -1, -1, &cko, sizeof(cko)) == -1) > > > > > > + log_warn("%s: proc_composev_imsg", __func__); > > > > > > + break; > > > > > > + } > > > > > > if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL) > > > > > > fatalx("%s: invalid relay key", __func__); > > > > > > If we go this way, we need to handle this too. > > > > > > > > > > > > > > > > > > > > > > > > > ping. here is an alternate approach to stop processing imsgs from PROC_RELAY by removing the events during reload, but when testing i still had client connections get empty replies, closed early, or the rsa op timed out. however, relayd also does not crash. > > > > > > The fix replaces a crash with a broken connection? > > > > > > > > > > > ping again. i hope this will be fixed for the next release. > > > > > > > > > > > > > > > > > > > diff --git a/usr.sbin/relayd/ca.c b/usr.sbin/relayd/ca.c > > > > > index e54259c5971..89f7f48685e 100644 > > > > > --- a/usr.sbin/relayd/ca.c > > > > > +++ b/usr.sbin/relayd/ca.c > > > > > @@ -204,9 +204,11 @@ ca_dispatch_parent(int fd, struct privsep_proc *p, struct imsg *imsg) > > > > > break; > > > > > case IMSG_CTL_START: > > > > > ca_launch(); > > > > > + proc_unblock(p->p_ps, PROC_RELAY); > > > > > break; > > > > > case IMSG_CTL_RESET: > > > > > config_getreset(env, imsg); > > > > > + proc_block(p->p_ps, PROC_RELAY); > > > > > break; > > > > > default: > > > > > return -1; > > > > > diff --git a/usr.sbin/relayd/proc.c b/usr.sbin/relayd/proc.c > > > > > index 3eb00aa0381..56a49120cb4 100644 > > > > > --- a/usr.sbin/relayd/proc.c > > > > > +++ b/usr.sbin/relayd/proc.c > > > > > @@ -682,6 +682,38 @@ proc_dispatch_null(int fd, struct privsep_proc *p, struct imsg *imsg) > > > > > return (-1); > > > > > } > > > > > > > > > > +void > > > > > +proc_block(struct privsep *ps, enum privsep_procid id) > > > > > +{ > > > > > + struct imsgev *iev; > > > > > + int n, m; > > > > > + > > > > > + n = -1; > > > > > + > > > > > + proc_range(ps, id, &n, &m); > > > > > + > > > > > + for(; n < m; n++){ > > > > > + iev = &ps->ps_ievs[id][n]; > > > > > + event_del(&iev->ev); > > > > > + } > > > > > +} > > > > > + > > > > > +void > > > > > +proc_unblock(struct privsep *ps, enum privsep_procid id) > > > > > +{ > > > > > + struct imsgev *iev; > > > > > + int n, m; > > > > > + > > > > > + n = -1; > > > > > + > > > > > + proc_range(ps, id, &n, &m); > > > > > + > > > > > + for(; n < m; n++){ > > > > > + iev = &ps->ps_ievs[id][n]; > > > > > + event_add(&iev->ev, NULL); > > > > > + } > > > > > +} > > > > > + > > > > > /* > > > > > * imsg helper functions > > > > > */ > > > > > diff --git a/usr.sbin/relayd/relayd.h b/usr.sbin/relayd/relayd.h > > > > > index 3b5c3987f93..697c3bd8b48 100644 > > > > > --- a/usr.sbin/relayd/relayd.h > > > > > +++ b/usr.sbin/relayd/relayd.h > > > > > @@ -1451,6 +1451,8 @@ int imsg_compose_event(struct imsgev *, uint16_t, uint32_t, > > > > > pid_t, int, void *, uint16_t); > > > > > int imsg_composev_event(struct imsgev *, uint16_t, uint32_t, > > > > > pid_t, int, const struct iovec *, int); > > > > > +void proc_block(struct privsep *, enum privsep_procid); > > > > > +void proc_unblock(struct privsep *, enum privsep_procid); > > > > > > > > > > /* config.c */ > > > > > int config_init(struct relayd *); > > > > > > > > > > Is a workaround and not a fix for the root of the problem. Are you > > > agree? > > > > yes, broken client connections aren't ideal, but i feel it's better than > > relayd giving up the ghost entirely if for example in the extremely > > common case, one is using crontab like i do: > > > > @weekly acme-client example.com && rcctl reload relayd > > > > today if you do this you are simply playing russian roulette with your > > webserver. > > > > thanks, > > nick > > > > As described above, I think that ending a direct reply with an error > (-1) to the sender is the right approach to prevent a fatal crash during > configuration reloads. It ensures the relay worker is unblocked > immediately (not wait for the timeout) even if a key is temporarily > unavailable (as in our race, due to the reload). > > OK? not sure my reply carries much weight, but ok by me :-) > > diff --git a/ca.c b/ca.c > index 167fb1c..3515977 100644 > --- a/ca.c > +++ b/ca.c > @@ -234,11 +234,33 @@ ca_dispatch_relay(int fd, struct privsep_proc *p, struct imsg *imsg) > fatalx("%s: invalid relay proc", __func__); > if (IMSG_DATA_SIZE(imsg) != (sizeof(cko) + cko.cko_flen)) > fatalx("%s: invalid key operation", __func__); > - if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) > - fatalx("%s: invalid relay hash '%s'", > + > + if ((pkey = pkey_find(env, cko.cko_hash)) == NULL) { > + log_warnx("%s: invalid relay hash '%s'", > __func__, cko.cko_hash); > - if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL) > - fatalx("%s: invalid relay key", __func__); > + /* Signal failure to the waiting relay worker. */ > + cko.cko_tlen = -1; > + iov[c].iov_base = &cko; > + iov[c++].iov_len = sizeof(cko); > + if (proc_composev_imsg(env->sc_ps, PROC_RELAY, > + cko.cko_proc, imsg->hdr.type, -1, -1, iov, > + c) == -1) > + log_warn("%s: proc_composev_imsg", __func__); > + break; > + } > + > + if ((rsa = EVP_PKEY_get1_RSA(pkey)) == NULL) { > + log_warnx("%s: invalid relay key", __func__); > + /* Signal failure to the waiting relay worker. */ > + cko.cko_tlen = -1; > + iov[c].iov_base = &cko; > + iov[c++].iov_len = sizeof(cko); > + if (proc_composev_imsg(env->sc_ps, PROC_RELAY, > + cko.cko_proc, imsg->hdr.type, -1, -1, iov, > + c) == -1) > + log_warn("%s: proc_composev_imsg", __func__); > + break; > + } > > DPRINTF("%s:%d: key hash %s proc %d", > __func__, __LINE__, cko.cko_hash, cko.cko_proc); > @@ -392,7 +414,11 @@ rsae_send_imsg(int flen, const u_char *from, u_char *to, RSA *rsa, > fatalx("invalid response"); > > ret = cko.cko_tlen; > - if (ret > 0) { > + if (ret == -1) { > + log_warnx("%s: priv%s failed for key %s", > + __func__, cmd == IMSG_CA_PRIVENC ? > + "enc" : "dec", cko.cko_hash); > + } else if (ret > 0) { > if (IMSG_DATA_SIZE(&imsg) != > (sizeof(cko) + ret)) > fatalx("data size");