Index | Thread | Search

From:
Jan Klemkow <jan@openbsd.org>
Subject:
Re: Device errors with Xeon w5-2545
To:
tech@openbsd.org
Date:
Fri, 14 Mar 2025 01:00:05 +0100

Download raw body.

Thread
  • Lloyd:

    Device errors with Xeon w5-2545

  • Jan Klemkow:

    Device errors with Xeon w5-2545

  • On Thu, Feb 06, 2025 at 11:35:00AM GMT, Jan Klemkow wrote:
    > I get some troubles trying OpenBSD on newer Intel Systems.
    > 
    > I have two system with an ASRockRack W790D8UD-1L1N2T mainboard, but with two
    > different CPUs w5-2545 and w5-3525.  The system with the w5-3525 CPU just works
    > as expected.
    > 
    > The system with the w5-2545 CPU does not.  It hangs during boot and I got
    > several devices with errors while debugging this:
    > 
    > ahci(4) reports failures on the first command timeout, due to a busy
    > controller.  xhci(4) dies on the first interrupt due to 0xffffffff in the
    > status register.  em(4) reports invalid checksum of the EEPROM during
    > initialization.  Also the nvme(4) it not responsive.
    > 
    > While the hang, its not possible to trap into ddb with db_console=1 and break
    > via serial console.  Its possible to workaround the hand by disabling the
    > xhci(4) driver via UKC.  But, the other devices still don't work.
    > 
    > A boot of Debian/Linux-stable shows that all device operate normally here.
    > 
    > I flashed the last BIOS from Vendor on both boards and compared all bios
    > configuration options.
    > 
    > Dmesgs of both systems are below.
    > 
    > It looks like some trouble in the PCIe bus, to me.  Because, all PCIe
    > devices have problems.  But, I can't find a debugging approach to start
    > with, while looking in the code.
    
    Finally, I found the bug.  Turns out all PCIe devices are attaching 4
    times instead of just once.  Example from the dmesgs before:
    
    Non-working CPU:
    	em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:42:b8:fb
    	em1 at pci29 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
    	em2 at pci44 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
    	em3 at pci59 dev 0 function 0 "Intel I210" rev 0x03: cannot find mem space
    
    	nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3
    	nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R
    	nvme1 at pci35 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
    	nvme2 at pci50 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
    	nvme3 at pci65 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: unable to map registers
    
    	xhci(4) driver via UKC.  But, the other devices still don't work.
    	xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20
    	xhci1 at pci22 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
    	xhci2 at pci37 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
    	xhci3 at pci52 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11intr_establish: pic msi pin -2147442688: can't share type 1 with 1
    
    
    Working CPU:
    	em0 at pci7 dev 0 function 0 "Intel I210" rev 0x03: msi, address 9c:6b:00:78:8a:52
    
    	nvme0 at pci13 dev 0 function 0 "Samsung SM981/PM981 NVMe" rev 0x00: msix, NVMe 1.3
    	nvme0: Samsung SSD 970 EVO Plus 250GB, firmware 2B2QEXM7, serial S4EUNS0X303722R
    
    	xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20
    
    The reason for this behavior are 3 additional PCIe root "phantom"
    bridged, which are a copying of the real one.  They are advertised via
    ACPI.
    
    A part of the ACPI nodes tree is this:
    
    	\_SB_.PC00._HID
    	\_SB_.PC01._HID
    	\_SB_.PC02._HID
    	\_SB_.PC03._HID
    	\_SB_.PC04._HID
    	\_SB_.PC08._HID
    	\_SB_.PC08.DIN0._HID
    	\_SB_.PC09._HID
    	\_SB_.PC09.DIN0._HID
    	\_SB_.PC0A._HID
    	\_SB_.PC0A.DIN0._HID
    	\_SB_.PC0B._HID
    	\_SB_.PC0B.DIN0._HID
    
    The nodes PC00, PC01, PC02, PC03, PC04 and PC08.DIN0 are valid and
    attaching properly via acpipci(4) and so on.
    
    The nodes PC09, PC0A and PC0B are NOT valid.  Their _STA method return
    just 0.  But, PC09.DIN0, PC0A.DIN0 and PC0B.DIN0 are valid.  This nodes
    return _STA with the bits STA_PRESENT and STA_ENABLED set, even when
    their parent devices are NOT present and enabled.
    
    The diff below just expands the parent check to ALL parents of the node,
    which have to be valid before working with them.
    
    I also checked the Linux code path on this hardware.  Linux filters out
    the three problematic nodes while its run through acpi_bus_attach() and
    don't even look at their DIN0 nodes.
    
    Big thanks to Tristan Kundrat for his debugging support here!
    
    With the diff below, everything just works fine on this hardware.
    
    ok?
    
    Bye,
    Jan
    
    Index: dev/acpi/acpi.c
    ===================================================================
    RCS file: /cvs/src/sys/dev/acpi/acpi.c,v
    diff -u -p -r1.443 acpi.c
    --- dev/acpi/acpi.c	11 Feb 2025 16:22:37 -0000	1.443
    +++ dev/acpi/acpi.c	13 Mar 2025 23:18:28 -0000
    @@ -3224,6 +3224,7 @@ acpi_foundhid(struct aml_node *node, voi
     {
     	struct acpi_softc	*sc = (struct acpi_softc *)arg;
     	struct device		*self = (struct device *)arg;
    +	struct aml_node		*parent;
     	char		 	 cdev[32];
     	char		 	 dev[32];
     	struct acpi_attach_args	 aaa;
    @@ -3236,9 +3237,11 @@ acpi_foundhid(struct aml_node *node, voi
     	if (acpi_parsehid(node, arg, cdev, dev, sizeof(dev)) != 0)
     		return (0);
     
    -	sta = acpi_getsta(sc, node->parent);
    -	if ((sta & (STA_PRESENT | STA_ENABLED)) != (STA_PRESENT | STA_ENABLED))
    -		return (0);
    +	for (parent = node->parent; parent != NULL; parent = parent->parent) {
    +		sta = acpi_getsta(sc, parent);
    +		if (!ISSET(sta, STA_PRESENT) || !ISSET(sta, STA_ENABLED))
    +			return (0);
    +	}
     
     	if (aml_evalinteger(sc, node->parent, "_CCA", 0, NULL, &cca))
     		cca = 1;
    
    
  • Lloyd:

    Device errors with Xeon w5-2545

  • Jan Klemkow:

    Device errors with Xeon w5-2545