From: Jan Klemkow Subject: Re: Device errors with Xeon w5-2545 To: Mark Kettenis Cc: tech@openbsd.org Date: Fri, 14 Mar 2025 11:22:58 +0100 On Fri, Mar 14, 2025 at 10:45:49AM +0100, Mark Kettenis wrote: > > Date: Fri, 14 Mar 2025 01:00:05 +0100 > > From: Jan Klemkow > > > > On Thu, Feb 06, 2025 at 11:35:00AM GMT, Jan Klemkow wrote: > > > I get some troubles trying OpenBSD on newer Intel Systems. > > > > > > I have two system with an ASRockRack W790D8UD-1L1N2T mainboard, but with two > > > different CPUs w5-2545 and w5-3525. The system with the w5-3525 CPU just works > > > as expected. > > > > > > The system with the w5-2545 CPU does not. It hangs during boot and I got > > > several devices with errors while debugging this: > > > > > > ahci(4) reports failures on the first command timeout, due to a busy > > > controller. xhci(4) dies on the first interrupt due to 0xffffffff in the > > > status register. em(4) reports invalid checksum of the EEPROM during > > > initialization. Also the nvme(4) it not responsive. > > > > > > While the hang, its not possible to trap into ddb with db_console=1 and break > > > via serial console. Its possible to workaround the hand by disabling the > > > xhci(4) driver via UKC. But, the other devices still don't work. > > > > > > A boot of Debian/Linux-stable shows that all device operate normally here. > > > > > > I flashed the last BIOS from Vendor on both boards and compared all bios > > > configuration options. > > > > > > Dmesgs of both systems are below. > > > > > > It looks like some trouble in the PCIe bus, to me. Because, all PCIe > > > devices have problems. But, I can't find a debugging approach to start > > > with, while looking in the code. > > > > Finally, I found the bug. Turns out all PCIe devices are attaching 4 > > times instead of just once. Example from the dmesgs before: > > > > Non-working CPU: > > em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:42:b8:fb > > em1 at pci29 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space > > em2 at pci44 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space > > em3 at pci59 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space > > > > nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3 > > nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R > > nvme1 at pci35 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers > > nvme2 at pci50 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers > > nvme3 at pci65 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers > > > > xhci(4) driver via UKC. But, the other devices still don't work. > > xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20 > > xhci1 at pci22 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 > > xhci2 at pci37 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 > > xhci3 at pci52 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1 > > > > > > Working CPU: > > em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:78:8a:52 > > > > nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3 > > nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R > > > > xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20 > > > > The reason for this behavior are 3 additional PCIe root "phantom" > > bridged, which are a copying of the real one. They are advertised via > > ACPI. > > > > A part of the ACPI nodes tree is this: > > > > \_SB_.PC00._HID > > \_SB_.PC01._HID > > \_SB_.PC02._HID > > \_SB_.PC03._HID > > \_SB_.PC04._HID > > \_SB_.PC08._HID > > \_SB_.PC08.DIN0._HID > > \_SB_.PC09._HID > > \_SB_.PC09.DIN0._HID > > \_SB_.PC0A._HID > > \_SB_.PC0A.DIN0._HID > > \_SB_.PC0B._HID > > \_SB_.PC0B.DIN0._HID > > > > The nodes PC00, PC01, PC02, PC03, PC04 and PC08.DIN0 are valid and > > attaching properly via acpipci(4) and so on. > > > > The nodes PC09, PC0A and PC0B are NOT valid. Their _STA method return > > just 0. But, PC09.DIN0, PC0A.DIN0 and PC0B.DIN0 are valid. This nodes > > return _STA with the bits STA_PRESENT and STA_ENABLED set, even when > > their parent devices are NOT present and enabled. > > > > The diff below just expands the parent check to ALL parents of the node, > > which have to be valid before working with them. > > > > I also checked the Linux code path on this hardware. Linux filters out > > the three problematic nodes while its run through acpi_bus_attach() and > > don't even look at their DIN0 nodes. > > > > Big thanks to Tristan Kundrat for his debugging support here! > > > > With the diff below, everything just works fine on this hardware. > > > > ok? > > This is not quite right. > > When we check _STA we should return 1 to prevent enumeration of > children. The condition for that it somewhat complex and the > documentation isn't very clear (see 6.3.7 in the ACPI spec), but I > think the following is what we would need: > > if ((sta & STA_PRESENT) == 0 && (sta & STA_DEV_OK) == 0) > return (1); > if ((sta & STA_ENABLED) == 0) > return (0); Oh, yeah. Its way smarter and avoids the extra loops. > In any case a change in this area needs thorough testing. That's right. I guess this will some other strange hardware problems out there. Any tests are welcome! Thanks, Jan Index: dev/acpi/acpi.c =================================================================== RCS file: /cvs/src/sys/dev/acpi/acpi.c,v diff -u -p -r1.443 acpi.c --- dev/acpi/acpi.c 11 Feb 2025 16:22:37 -0000 1.443 +++ dev/acpi/acpi.c 14 Mar 2025 10:13:54 -0000 @@ -3237,7 +3237,9 @@ acpi_foundhid(struct aml_node *node, voi return (0); sta = acpi_getsta(sc, node->parent); - if ((sta & (STA_PRESENT | STA_ENABLED)) != (STA_PRESENT | STA_ENABLED)) + if (!ISSET(sta, STA_PRESENT) && !ISSET(sta, STA_DEV_OK)) + return (1); + if (!ISSET(sta, STA_ENABLED)) return (0); if (aml_evalinteger(sc, node->parent, "_CCA", 0, NULL, &cca))