From: Martin Pieuchot Subject: Re: [PATCH] amd64: import optimized memcmp from FreeBSD To: Mark Kettenis Cc: Mateusz Guzik , tech@openbsd.org Date: Sat, 30 Nov 2024 15:45:54 +0100 Thanks for all the nice inputs! I'd appreciate an effort to bring more optimized routines to the kernel/libc/etc. On 29/11/24(Fri) 23:08, Mark Kettenis wrote: > > From: Mateusz Guzik > > Date: Fri, 29 Nov 2024 02:01:45 +0100 > > > > The rep-prefixed cmps is incredibly slow even on modern CPUs. > > > > The new implementation uses regular cmp to do it. > > > > The code got augmented to account for retguard, otherwise it matches FreeBSD. > > > > On Sapphire Rapids open1 (open + close in a loop) from will-it-scale (ops/s): > > before: 436177 > > after: 464670 (+6.5%) > > That is a nice speedup, but you'd need to present more than a single > microbenchmark on a single microarchitecture to convince me that this > is worth the extra complication of the assembly code. Here are numbers from my favorite microbenchmark, on a 24 CPUs ThinkCentre M90s Gen 5, dmesg below. vanilla ======= kernel 1m07.05s real 8m37.97s user 6m14.12s system libLLVM 12m41.54s real 195m02.44s user 80m11.87s system with memcpy =========== kernel 1m00.66s real 8m28.24s user 5m56.32s system libLLVM 12m27.16s real 193m46.84s user 75m29.95s system OpenBSD 7.6-current (GENERIC.MP) #31: Sat Nov 30 14:52:25 CET 2024 root@caju.grenadille.net:/sys/arch/amd64/compile/GENERIC.MP real mem = 33942417408 (32370MB) avail mem = 32635703296 (31123MB) random: good seed from bootblocks mpath0 at root scsibus0 at mpath0: 256 targets mainbus0 at root bios0 at mainbus0: SMBIOS rev. 3.6 @ 0x71357000 (102 entries) bios0: vendor LENOVO version "M4SKT67A" date 05/13/2024 bios0: LENOVO 12V4CTO1WW efi0 at bios0: UEFI 2.8 efi0: American Megatrends rev 0x5001b acpi0 at bios0: ACPI 6.5 acpi0: sleep states S0 S3 S4 S5 acpi0: tables DSDT FACP SSDT FIDT SSDT SSDT SSDT SSDT SSDT HPET APIC MCFG SSDT UEFI NHLT LPIT SSDT SSDT DBGP DBG2 SSDT DMAR FPDT SSDT SSDT BGRT LUFT TPM2 PHAT ASF! WSMT acpi0: wakeup devices PEG1(S4) PEGP(S4) PEGP(S4) PEGP(S4) SIO1(S3) RP09(S4) PXSX(S4) RP10(S4) PXSX(S4) RP11(S4) PXSX(S4) RP12(S4) PXSX(S4) RP13(S4) PXSX(S4) RP14(S4) [...] acpitimer0 at acpi0: 3579545 Hz, 24 bits acpihpet0 at acpi0: 19200000 Hz acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat cpu0 at mainbus0: apid 0 (boot processor) cpu0: Intel(R) Core(TM) i9-14900, 5488.56 MHz, 06-b7-01, patch 0000012b cpu0: cpuid 1 edx=bfebfbff ecx=77fafbff cpu0: cpuid 6 eax=dfcff7 ecx=409 cpu0: cpuid 7.0 ebx=239c27eb ecx=98c027ac edx=fc1cc410 cpu0: cpuid a vers=5, gp=6, gpwidth=48, ff=3, ffwidth=48 cpu0: cpuid d.1 eax=f cpu0: cpuid 80000001 edx=2c100800 ecx=121 cpu0: cpuid 80000007 edx=100 cpu0: msr 10a=1488fd6b cpu0: 48KB 64b/line 12-way D-cache, 32KB 64b/line 8-way I-cache, 2MB 64b/line 16-way L2 cache, 36MB 64b/line 12-way L3 cache cpu0: smt 0, core 0, package 0 mtrr: Pentium Pro MTRR support, 10 var ranges, 88 fixed ranges cpu0: apic clock running at 38MHz cpu0: mwait min=64, max=64, C-substates=0.2.0.2.0.1.0.1, IBE cpu1 at mainbus0: apid 8 (application processor) cpu1: Intel(R) Core(TM) i9-14900, 5489.15 MHz, 06-b7-01, patch 0000012b cpu1: smt 0, core 4, package 0 cpu2 at mainbus0: apid 16 (application processor) cpu2: Intel(R) Core(TM) i9-14900, 5488.76 MHz, 06-b7-01, patch 0000012b cpu2: smt 0, core 8, package 0 cpu3 at mainbus0: apid 24 (application processor) cpu3: Intel(R) Core(TM) i9-14900, 5488.84 MHz, 06-b7-01, patch 0000012b cpu3: smt 0, core 12, package 0 cpu4 at mainbus0: apid 32 (application processor) cpu4: Intel(R) Core(TM) i9-14900, 5488.65 MHz, 06-b7-01, patch 0000012b cpu4: smt 0, core 16, package 0 cpu5 at mainbus0: apid 40 (application processor) cpu5: Intel(R) Core(TM) i9-14900, 5488.63 MHz, 06-b7-01, patch 0000012b cpu5: smt 0, core 20, package 0 cpu6 at mainbus0: apid 48 (application processor) cpu6: Intel(R) Core(TM) i9-14900, 5488.74 MHz, 06-b7-01, patch 0000012b cpu6: smt 0, core 24, package 0 cpu7 at mainbus0: apid 56 (application processor) cpu7: Intel(R) Core(TM) i9-14900, 5488.88 MHz, 06-b7-01, patch 0000012b cpu7: smt 0, core 28, package 0 cpu8 at mainbus0: apid 64 (application processor) cpu8: Intel(R) Core(TM) i9-14900, 4290.12 MHz, 06-b7-01, patch 0000012b cpu8: 32KB 64b/line 8-way D-cache, 64KB 64b/line 8-way I-cache, 4MB 64b/line 16-way L2 cache, 36MB 64b/line 12-way L3 cache cpu8: smt 0, core 32, package 0 cpu9 at mainbus0: apid 66 (application processor) cpu9: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu9: smt 0, core 33, package 0 cpu10 at mainbus0: apid 68 (application processor) cpu10: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu10: smt 0, core 34, package 0 cpu11 at mainbus0: apid 70 (application processor) cpu11: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu11: smt 0, core 35, package 0 cpu12 at mainbus0: apid 72 (application processor) cpu12: Intel(R) Core(TM) i9-14900, 4290.12 MHz, 06-b7-01, patch 0000012b cpu12: smt 0, core 36, package 0 cpu13 at mainbus0: apid 74 (application processor) cpu13: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu13: smt 0, core 37, package 0 cpu14 at mainbus0: apid 76 (application processor) cpu14: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu14: smt 0, core 38, package 0 cpu15 at mainbus0: apid 78 (application processor) cpu15: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu15: smt 0, core 39, package 0 cpu16 at mainbus0: apid 80 (application processor) cpu16: Intel(R) Core(TM) i9-14900, 4290.07 MHz, 06-b7-01, patch 0000012b cpu16: smt 0, core 40, package 0 cpu17 at mainbus0: apid 82 (application processor) cpu17: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu17: smt 0, core 41, package 0 cpu18 at mainbus0: apid 84 (application processor) cpu18: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu18: smt 0, core 42, package 0 cpu19 at mainbus0: apid 86 (application processor) cpu19: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu19: smt 0, core 43, package 0 cpu20 at mainbus0: apid 88 (application processor) cpu20: Intel(R) Core(TM) i9-14900, 4290.07 MHz, 06-b7-01, patch 0000012b cpu20: smt 0, core 44, package 0 cpu21 at mainbus0: apid 90 (application processor) cpu21: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu21: smt 0, core 45, package 0 cpu22 at mainbus0: apid 92 (application processor) cpu22: Intel(R) Core(TM) i9-14900, 4290.11 MHz, 06-b7-01, patch 0000012b cpu22: smt 0, core 46, package 0 cpu23 at mainbus0: apid 94 (application processor) cpu23: Intel(R) Core(TM) i9-14900, 4190.34 MHz, 06-b7-01, patch 0000012b cpu23: smt 0, core 47, package 0 ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 20, 120 pins acpimcfg0 at acpi0 acpimcfg0: addr 0xc0000000, bus 0-255 acpiprt0 at acpi0: bus 0 (PC00) acpiprt1 at acpi0: bus -1 (PEG1) acpiprt2 at acpi0: bus -1 (RP09) acpiprt3 at acpi0: bus -1 (RP10) acpiprt4 at acpi0: bus -1 (RP11) acpiprt5 at acpi0: bus -1 (RP12) acpiprt6 at acpi0: bus -1 (RP13) acpiprt7 at acpi0: bus -1 (RP14) acpiprt8 at acpi0: bus -1 (RP15) acpiprt9 at acpi0: bus -1 (RP16) acpiprt10 at acpi0: bus -1 (RP01) acpiprt11 at acpi0: bus -1 (RP02) acpiprt12 at acpi0: bus -1 (RP03) acpiprt13 at acpi0: bus -1 (RP04) acpiprt14 at acpi0: bus -1 (RP05) acpiprt15 at acpi0: bus -1 (RP06) acpiprt16 at acpi0: bus -1 (RP07) acpiprt17 at acpi0: bus -1 (RP08) acpiprt18 at acpi0: bus -1 (RP17) acpiprt19 at acpi0: bus -1 (RP18) acpiprt20 at acpi0: bus -1 (RP19) acpiprt21 at acpi0: bus -1 (RP20) acpiprt22 at acpi0: bus -1 (RP21) acpiprt23 at acpi0: bus -1 (RP22) acpiprt24 at acpi0: bus -1 (RP23) acpiprt25 at acpi0: bus -1 (RP24) acpiprt26 at acpi0: bus 1 (RP25) acpiprt27 at acpi0: bus -1 (RP26) acpiprt28 at acpi0: bus -1 (RP27) acpiprt29 at acpi0: bus -1 (RP28) acpiec0 at acpi0: not present acpipci0 at acpi0 PC00: 0x00000000 0x00000011 0x00000001 com0 at acpi0 UAR1 addr 0x3f8/0x8 irq 4: ns16550a, 16 byte fifo com0: console com1 at acpi0 UAR2 addr 0x2f8/0x8 irq 3: ns16550a, 16 byte fifo "ACPI000E" at acpi0 not configured acpibtn0 at acpi0: SLPB acpicpu0 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu1 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu2 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu3 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu4 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu5 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu6 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu7 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu8 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu9 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu10 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu11 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu12 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu13 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu14 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu15 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu16 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu17 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu18 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu19 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu20 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu21 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu22 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS acpicpu23 at acpi0: C3(200@1048 mwait.1@0x60), C2(350@127 mwait.1@0x21), C1(1000@1 mwait.1), PSS "PNP0C14" at acpi0 not configured "PNP0C14" at acpi0 not configured intelpmc0 at acpi0: PEPD state 0: 0x7f:1:2:0x00:0x0000000000000060 counter: 0x7f:64:0:0x00:0x0000000000000632 frequency: 0 state 1: 0x7f:1:2:0x00:0x0000000000000060 counter: 0x00:32:0:0x03:0x00000000fe001098 frequency: 32768 acpibtn1 at acpi0: PWRB tpm0 at acpi0 TPM_ 2.0 (TIS) addr 0xfed40000/0x5000, device 0x001d15d1 rev 0x36 "INTC10A0" at acpi0 not configured "PNP0C0B" at acpi0 not configured "PNP0C0B" at acpi0 not configured "PNP0C0B" at acpi0 not configured "PNP0C0B" at acpi0 not configured "PNP0C0B" at acpi0 not configured "PNP0C14" at acpi0 not configured "PNP0C14" at acpi0 not configured "PNP0C14" at acpi0 not configured acpipwrres0 at acpi0: BTRT acpipwrres1 at acpi0: DBTR acpipwrres2 at acpi0: WRST acpipwrres3 at acpi0: FN00, resource for FAN0 acpipwrres4 at acpi0: FN01, resource for FAN1 acpipwrres5 at acpi0: FN02, resource for FAN2 acpipwrres6 at acpi0: FN03, resource for FAN3 acpipwrres7 at acpi0: FN04, resource for FAN4 acpitz0 at acpi0: critical temperature is 105 degC acpipwrres8 at acpi0: PIN_ acpivideo0 at acpi0: GFX0 acpivout0 at acpivideo0: DD1F acpivout1 at acpivideo0: DD2F cpu0: using VERW MDS workaround cpu0: Enhanced SpeedStep 5488 MHz: speeds: 2001, 2000, 1900, 1800, 1700, 1600, 1500, 1400, 1300, 1200, 1100, 1000, 900, 800 MHz pci0 at mainbus0 bus 0 0:31:5: mem address conflict 0xfe010000/0x1000 pchb0 at pci0 dev 0 function 0 "Intel Core 13G Host" rev 0x01 inteldrm0 at pci0 dev 2 function 0 "Intel Graphics" rev 0x04 drm0 at inteldrm0 inteldrm0: msi, ALDERLAKE_S, gen 12 "Intel Core 13G DTT" rev 0x01 at pci0 dev 4 function 0 not configured xhci0 at pci0 dev 20 function 0 "Intel 600 Series xHCI" rev 0x11: msi, xHCI 1.20 usb0 at xhci0: USB revision 3.0 uhub0 at usb0 configuration 1 interface 0 "Intel xHCI root hub" rev 3.00/1.00 addr 1 "Intel 600 Series Shared SRAM" rev 0x11 at pci0 dev 20 function 2 not configured iwx0 at pci0 dev 20 function 3 "Intel Wi-Fi 6 AX211" rev 0x11, msix "Intel 600 Series HECI" rev 0x11 at pci0 dev 22 function 0 not configured puc0 at pci0 dev 22 function 3 "Intel 600 Series KT" rev 0x11: ports: 16 com com4 at puc0 port 0 apic 2 int 19: ns16550a, 16 byte fifo ahci0 at pci0 dev 23 function 0 "Intel 600 Series AHCI" rev 0x11: msi, AHCI 1.3.1 ahci0: PHY offline on port 4 ahci0: PHY offline on port 5 ahci0: PHY offline on port 6 ahci0: PHY offline on port 7 scsibus1 at ahci0: 32 targets ppb0 at pci0 dev 26 function 0 "Intel 600 Series PCIE" rev 0x11: msi pci1 at ppb0 bus 1 nvme0 at pci1 dev 0 function 0 "Samsung PM9C1a NVMe" rev 0x00: msix, NVMe 2.0 nvme0: SAMSUNG MZVL81T0HDLB-00BLL, firmware 4L2QKXD7, serial S7G8NF0X420031 scsibus2 at nvme0: 2 targets, initiator 0 sd0 at scsibus2 targ 1 lun 0: sd0: 976762MB, 512 bytes/sector, 2000409264 sectors pcib0 at pci0 dev 31 function 0 vendor "Intel", unknown product 0x7a83 rev 0x11 azalia0 at pci0 dev 31 function 3 "Intel 600 Series HD Audio" rev 0x11: msi azalia0: codecs: Realtek ALC897, Intel/0x2818, using Realtek ALC897 audio0 at azalia0 ichiic0 at pci0 dev 31 function 4 "Intel 600 Series SMBus" rev 0x11: apic 2 int 18 iic0 at ichiic0 iic0: addr 0x49 15=2c 16=20 19=04 1b=05 1c=60 1e=60 1f=60 20=89 21=78 22=63 25=78 26=63 27=78 28=63 29=80 2a=88 2b=42 2c=20 2d=22 2e=04 2f=5a 32=80 34=0e 3b=06 3c=0b 3d=2a words 00=0000 01=0000 02=0000 03=0000 04=0000 05=0000 06=0000 07=0000 iic0: addr 0x4b 15=2c 16=20 19=04 1b=05 1c=60 1e=60 1f=60 20=89 21=78 22=63 25=78 26=63 27=78 28=63 29=80 2a=88 2b=42 2c=20 2d=22 2e=04 2f=5a 32=80 34=0e 3b=06 3c=0b 3d=2a words 00=0000 01=0000 02=0000 03=0000 04=0000 05=0000 06=0000 07=0000 "Intel 600 Series SPI" rev 0x11 at pci0 dev 31 function 5 not configured em0 at pci0 dev 31 function 6 "Intel I219-LM" rev 0x11: msi, address 04:7c:16:fb:15:1d isa0 at pcib0 isadma0 at isa0 pckbc0 at isa0 port 0x60/5 irq 1 irq 12 pckbd0 at pckbc0 (kbd slot) wskbd0 at pckbd0 mux 1 pcppi0 at isa0 port 0x61 spkr0 at pcppi0 vmm0 at mainbus0: VMX/EPT efifb at mainbus0 not configured ugen0 at uhub0 port 14 "Intel Bluetooth" rev 2.01/0.02 addr 2 vscsi0 at root scsibus3 at vscsi0: 256 targets softraid0 at root scsibus4 at softraid0: 256 targets root on sd0a (3fed1616e6e06415.a) swap on sd0b dump on sd0b drm:pid0:ct_send *ERROR* [drm] *ERROR* GT0: GUC: CT: No response for request 0x4000 (fence 1) drm:pid0:intel_guc_ct_send *ERROR* [drm] *ERROR* GT0: GUC: CT: Sending action 0x4000 failed (0xffffffffffffffc4e) status=0 drm:pid0:intel_huc_auth *ERROR* [drm] *ERROR* GT0: HuC: all workloads authentication failed 0xffffffffffffffc4e drm:pid14050:ct_handle_response *ERROR* [drm] *ERROR* GT0: GUC: CT: Unsolicited response message: len 1, data 0xf0000000 (fence 1, last 1) drm:pid14050:ct_handle_hxg *ERROR* [drm] *ERROR* GT0: GUC: CT: Failed to handle HXG message (0xfffffffffffffffee) 0xffff800000a79418h drm:pid14050:ct_handle_msg *ERROR* [drm] *ERROR* GT0: GUC: CT: Failed to process CT message (0xfffffffffffffffee) 0xffff800000a79414h inteldrm0: 1024x768, 32bpp wsdisplay0 at inteldrm0 mux 1 pckbd_enable: command error wsdisplay0: screen 0-5 added (std, vt100 emulation) iwx0: hw rev 0x430, fw 77.f92b5fed.0, address 6c:f6:da:d6:63:84