From: Jan Klemkow Subject: Re: Device errors with Xeon w5-2545 To: tech@openbsd.org Date: Fri, 14 Mar 2025 01:00:05 +0100 On Thu, Feb 06, 2025 at 11:35:00AM GMT, Jan Klemkow wrote: > I get some troubles trying OpenBSD on newer Intel Systems. > > I have two system with an ASRockRack W790D8UD-1L1N2T mainboard, but with two > different CPUs w5-2545 and w5-3525. The system with the w5-3525 CPU just works > as expected. > > The system with the w5-2545 CPU does not. It hangs during boot and I got > several devices with errors while debugging this: > > ahci(4) reports failures on the first command timeout, due to a busy > controller. xhci(4) dies on the first interrupt due to 0xffffffff in the > status register. em(4) reports invalid checksum of the EEPROM during > initialization. Also the nvme(4) it not responsive. > > While the hang, its not possible to trap into ddb with db_console=1 and break > via serial console. Its possible to workaround the hand by disabling the > xhci(4) driver via UKC. But, the other devices still don't work. > > A boot of Debian/Linux-stable shows that all device operate normally here. > > I flashed the last BIOS from Vendor on both boards and compared all bios > configuration options. > > Dmesgs of both systems are below. > > It looks like some trouble in the PCIe bus, to me. Because, all PCIe > devices have problems. But, I can't find a debugging approach to start > with, while looking in the code. Finally, I found the bug. Turns out all PCIe devices are attaching 4 times instead of just once. Example from the dmesgs before: Non-working CPU: em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:42:b8:fb em1 at pci29 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space em2 at pci44 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space em3 at pci59 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3 nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R nvme1 at pci35 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers nvme2 at pci50 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers nvme3 at pci65 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers xhci(4) driver via UKC. But, the other devices still don't work. xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20 xhci1 at pci22 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 xhci2 at pci37 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 xhci3 at pci52 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 Working CPU: em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:78:8a:52 nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3 nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20 The reason for this behavior are 3 additional PCIe root "phantom" bridged, which are a copying of the real one. They are advertised via ACPI. A part of the ACPI nodes tree is this: \_SB_.PC00._HID \_SB_.PC01._HID \_SB_.PC02._HID \_SB_.PC03._HID \_SB_.PC04._HID \_SB_.PC08._HID \_SB_.PC08.DIN0._HID \_SB_.PC09._HID \_SB_.PC09.DIN0._HID \_SB_.PC0A._HID \_SB_.PC0A.DIN0._HID \_SB_.PC0B._HID \_SB_.PC0B.DIN0._HID The nodes PC00, PC01, PC02, PC03, PC04 and PC08.DIN0 are valid and attaching properly via acpipci(4) and so on. The nodes PC09, PC0A and PC0B are NOT valid. Their _STA method return just 0. But, PC09.DIN0, PC0A.DIN0 and PC0B.DIN0 are valid. This nodes return _STA with the bits STA_PRESENT and STA_ENABLED set, even when their parent devices are NOT present and enabled. The diff below just expands the parent check to ALL parents of the node, which have to be valid before working with them. I also checked the Linux code path on this hardware. Linux filters out the three problematic nodes while its run through acpi_bus_attach() and don't even look at their DIN0 nodes. Big thanks to Tristan Kundrat for his debugging support here! With the diff below, everything just works fine on this hardware. ok? Bye, Jan Index: dev/acpi/acpi.c =================================================================== RCS file: /cvs/src/sys/dev/acpi/acpi.c,v diff -u -p -r1.443 acpi.c --- dev/acpi/acpi.c 11 Feb 2025 16:22:37 -0000 1.443 +++ dev/acpi/acpi.c 13 Mar 2025 23:18:28 -0000 @@ -3224,6 +3224,7 @@ acpi_foundhid(struct aml_node *node, voi { struct acpi_softc *sc = (struct acpi_softc *)arg; struct device *self = (struct device *)arg; + struct aml_node *parent; char cdev[32]; char dev[32]; struct acpi_attach_args aaa; @@ -3236,9 +3237,11 @@ acpi_foundhid(struct aml_node *node, voi if (acpi_parsehid(node, arg, cdev, dev, sizeof(dev)) != 0) return (0); - sta = acpi_getsta(sc, node->parent); - if ((sta & (STA_PRESENT | STA_ENABLED)) != (STA_PRESENT | STA_ENABLED)) - return (0); + for (parent = node->parent; parent != NULL; parent = parent->parent) { + sta = acpi_getsta(sc, parent); + if (!ISSET(sta, STA_PRESENT) || !ISSET(sta, STA_ENABLED)) + return (0); + } if (aml_evalinteger(sc, node->parent, "_CCA", 0, NULL, &cca)) cca = 1;