Download raw body.
Device errors with Xeon w5-2545
> Date: Fri, 14 Mar 2025 01:00:05 +0100
> From: Jan Klemkow <jan@openbsd.org>
>
> On Thu, Feb 06, 2025 at 11:35:00AM GMT, Jan Klemkow wrote:
> > I get some troubles trying OpenBSD on newer Intel Systems.
> >
> > I have two system with an ASRockRack W790D8UD-1L1N2T mainboard, but with two
> > different CPUs w5-2545 and w5-3525. The system with the w5-3525 CPU just works
> > as expected.
> >
> > The system with the w5-2545 CPU does not. It hangs during boot and I got
> > several devices with errors while debugging this:
> >
> > ahci(4) reports failures on the first command timeout, due to a busy
> > controller. xhci(4) dies on the first interrupt due to 0xffffffff in the
> > status register. em(4) reports invalid checksum of the EEPROM during
> > initialization. Also the nvme(4) it not responsive.
> >
> > While the hang, its not possible to trap into ddb with db_console=1 and break
> > via serial console. Its possible to workaround the hand by disabling the
> > xhci(4) driver via UKC. But, the other devices still don't work.
> >
> > A boot of Debian/Linux-stable shows that all device operate normally here.
> >
> > I flashed the last BIOS from Vendor on both boards and compared all bios
> > configuration options.
> >
> > Dmesgs of both systems are below.
> >
> > It looks like some trouble in the PCIe bus, to me. Because, all PCIe
> > devices have problems. But, I can't find a debugging approach to start
> > with, while looking in the code.
>
> Finally, I found the bug. Turns out all PCIe devices are attaching 4
> times instead of just once. Example from the dmesgs before:
>
> Non-working CPU:
> em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:42:b8:fb
> em1 at pci29 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
> em2 at pci44 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
> em3 at pci59 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
>
> nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3
> nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R
> nvme1 at pci35 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
> nvme2 at pci50 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
> nvme3 at pci65 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
>
> xhci(4) driver via UKC. But, the other devices still don't work.
> xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20
> xhci1 at pci22 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
> xhci2 at pci37 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
> xhci3 at pci52 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
>
>
> Working CPU:
> em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:78:8a:52
>
> nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3
> nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R
>
> xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20
>
> The reason for this behavior are 3 additional PCIe root "phantom"
> bridged, which are a copying of the real one. They are advertised via
> ACPI.
>
> A part of the ACPI nodes tree is this:
>
> \_SB_.PC00._HID
> \_SB_.PC01._HID
> \_SB_.PC02._HID
> \_SB_.PC03._HID
> \_SB_.PC04._HID
> \_SB_.PC08._HID
> \_SB_.PC08.DIN0._HID
> \_SB_.PC09._HID
> \_SB_.PC09.DIN0._HID
> \_SB_.PC0A._HID
> \_SB_.PC0A.DIN0._HID
> \_SB_.PC0B._HID
> \_SB_.PC0B.DIN0._HID
>
> The nodes PC00, PC01, PC02, PC03, PC04 and PC08.DIN0 are valid and
> attaching properly via acpipci(4) and so on.
>
> The nodes PC09, PC0A and PC0B are NOT valid. Their _STA method return
> just 0. But, PC09.DIN0, PC0A.DIN0 and PC0B.DIN0 are valid. This nodes
> return _STA with the bits STA_PRESENT and STA_ENABLED set, even when
> their parent devices are NOT present and enabled.
>
> The diff below just expands the parent check to ALL parents of the node,
> which have to be valid before working with them.
>
> I also checked the Linux code path on this hardware. Linux filters out
> the three problematic nodes while its run through acpi_bus_attach() and
> don't even look at their DIN0 nodes.
>
> Big thanks to Tristan Kundrat for his debugging support here!
>
> With the diff below, everything just works fine on this hardware.
>
> ok?
This is not quite right.
When we check _STA we should return 1 to prevent enumeration of
children. The condition for that it somewhat complex and the
documentation isn't very clear (see 6.3.7 in the ACPI spec), but I
think the following is what we would need:
if ((sta & STA_PRESENT) == 0 && (sta & STA_DEV_OK) == 0)
return (1);
if ((sta & STA_ENABLED) == 0)
return (0);
In any case a change in this area needs thorough testing.
> Index: dev/acpi/acpi.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/acpi/acpi.c,v
> diff -u -p -r1.443 acpi.c
> --- dev/acpi/acpi.c 11 Feb 2025 16:22:37 -0000 1.443
> +++ dev/acpi/acpi.c 13 Mar 2025 23:18:28 -0000
> @@ -3224,6 +3224,7 @@ acpi_foundhid(struct aml_node *node, voi
> {
> struct acpi_softc *sc = (struct acpi_softc *)arg;
> struct device *self = (struct device *)arg;
> + struct aml_node *parent;
> char cdev[32];
> char dev[32];
> struct acpi_attach_args aaa;
> @@ -3236,9 +3237,11 @@ acpi_foundhid(struct aml_node *node, voi
> if (acpi_parsehid(node, arg, cdev, dev, sizeof(dev)) != 0)
> return (0);
>
> - sta = acpi_getsta(sc, node->parent);
> - if ((sta & (STA_PRESENT | STA_ENABLED)) != (STA_PRESENT | STA_ENABLED))
> - return (0);
> + for (parent = node->parent; parent != NULL; parent = parent->parent) {
> + sta = acpi_getsta(sc, parent);
> + if (!ISSET(sta, STA_PRESENT) || !ISSET(sta, STA_ENABLED))
> + return (0);
> + }
>
> if (aml_evalinteger(sc, node->parent, "_CCA", 0, NULL, &cca))
> cca = 1;
>
>
Device errors with Xeon w5-2545