Download raw body.
xhci: recover halted endpoints on USB Transaction Errors
Hi tech,
On Supermicro X10/X11 boards (tested on X10SLL-F and X11) the emulated
USB keyboard and mouse exposed by the BMC/iKVM stop working after a
BMC reset until the host is rebooted.
Reproducer: "Reset" button in the BMC web UI.
When the device re-appears the HID's INTR IN endpoint answers every
poll with a USB Transaction Error:
xhci0: txerr? code 4 (with XHCI_DEBUG)
Per xHCI r1.1 section 4.10.2.6 a Transaction Error completion leaves
the endpoint in the Halted state. The current xhci_event_xfer_generic()
just sets xfer->status = USBD_IOERROR and breaks, so every subsequent
xfer queued on the pipe is silently dropped by the halted endpoint --
the keyboard dies for good.
The diff below does two things:
1) Treats XHCI_CODE_TXERR / XHCI_CODE_SPLITERR like XHCI_CODE_STALL
and issues an async reset-ep, so the usb stack can restart the
pipe on a clean endpoint.
2) Caps the number of consecutive TXERR-driven resets per pipe with
a small counter in struct xhci_pipe (reset on any successful or
short completion). After XHCI_TXERR_RETRIES failures the pipe
is obviously wedged, so we complete the xfer with USBD_IOERROR
and call usb_needs_reattach() -- the hub explore task then
detaches the stuck device, resets the port and re-enumerates it.
On these boards the BMC has stabilised by then and the device
comes back in its proper topology (ATEN hub with the HID behind
it) and the keyboard works again without a host reboot.
Please note that I used AI to understand the problem. Tested the patch on
two machines and it works for me. But I understand that it might be totally
wrong and someone, more capable than me, might have a better approach.
I'll be glad to provide more details or do some extra testing.
Best wishes,
Atanas
Index: dev/usb/xhci.c
===================================================================
--- dev/usb/xhci.c
+++ dev/usb/xhci.c
@@ -70,6 +70,7 @@ struct xhci_pipe {
struct usbd_xfer *pending_xfers[XHCI_MAX_XFER];
struct usbd_xfer *aborted_xfer;
int halted;
+ unsigned int txerr_count;
size_t free_trbs;
int skip;
#define TRB_PROCESSED_NO 0
@@ -78,6 +79,8 @@ struct xhci_pipe {
uint8_t trb_processed[XHCI_MAX_XFER];
};
+#define XHCI_TXERR_RETRIES 3
+
int xhci_reset(struct xhci_softc *);
void xhci_suspend(struct xhci_softc *);
int xhci_intr1(struct xhci_softc *);
@@ -953,6 +956,7 @@ xhci_event_xfer_generic(struct xhci_softc *sc, struct
usbd_xfer_isread(xfer) ?
BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
xfer->status = USBD_NORMAL_COMPLETION;
+ xp->txerr_count = 0;
break;
case XHCI_CODE_SHORT_XFER:
/*
@@ -977,12 +981,31 @@ xhci_event_xfer_generic(struct xhci_softc *sc, struct
usbd_xfer_isread(xfer) ?
BUS_DMASYNC_POSTREAD : BUS_DMASYNC_POSTWRITE);
xfer->status = USBD_NORMAL_COMPLETION;
+ xp->txerr_count = 0;
break;
case XHCI_CODE_TXERR:
case XHCI_CODE_SPLITERR:
DPRINTF(("%s: txerr? code %d\n", DEVNAME(sc), code));
- xfer->status = USBD_IOERROR;
- break;
+ /* Prevent any timeout to kick in. */
+ timeout_del(&xfer->timeout_handle);
+ usb_rem_task(xfer->device, &xfer->abort_task);
+
+ /*
+ * A USB Transaction Error leaves the endpoint Halted
+ * (xHCI r1.1 4.10.2.6); reset it. If the endpoint
+ * keeps failing, ask the hub to re-enumerate the
+ * device rather than spinning forever.
+ */
+ if (++xp->txerr_count > XHCI_TXERR_RETRIES) {
+ xp->txerr_count = 0;
+ xfer->status = USBD_IOERROR;
+ usb_needs_reattach(xfer->device);
+ break;
+ }
+ xp->halted = USBD_IOERROR;
+ xp->aborted_xfer = xfer;
+ xhci_cmd_reset_ep_async(sc, slot, dci);
+ return (1);
case XHCI_CODE_STALL:
case XHCI_CODE_BABBLE:
DPRINTF(("%s: babble code %d\n", DEVNAME(sc), code));
@@ -1623,6 +1646,7 @@ xhci_pipe_init(struct xhci_softc *sc, struct usbd_pip
xp->free_trbs = xp->ring.ntrb;
xp->halted = 0;
+ xp->txerr_count = 0;
sdev->pipes[xp->dci - 1] = xp;
xhci: recover halted endpoints on USB Transaction Errors