From: Landry Breuil Subject: Re: kernel rwlocks vs the scheduler To: tech@openbsd.org Date: Fri, 15 Nov 2024 09:42:58 +0100 Le Wed, Nov 13, 2024 at 11:41:10PM +1000, David Gwynne a écrit : > i've been hitting this diff hard, but would appreciate more tests. > rwlocks are a very fundamental part of the kernel so they need to > work. benchmarks and witness tests are also welcome. some numbers from my favourite benchmark, eg building firefox, this time on the arm64 omnibook with 12 cores. without the diff ================ only changing the amount of threads allocated to lld, so MAKE_JOBS=12 1 threads for lld took 6684 (s), time for lld libxul: 3828.96 real 3347.41 user 451.48 sys 2 threads for lld took 4968 (s), time for lld libxul: 2037.03 real 3436.02 user 519.43 sys 3 threads for lld took 4886 (s), time for lld libxul: 2037.39 real 3980.19 user 1336.71 sys 4 threads for lld took 4418 (s), time for lld libxul: 1563.20 real 4209.59 user 1277.40 sys 5 threads for lld took 4632 (s), time for lld libxul: 1620.22 real 4730.67 user 2141.68 sys 6 threads for lld took 4237 (s), time for lld libxul: 1293.13 real 4739.12 user 1854.86 sys 7 threads for lld took 4082 (s), time for lld libxul: 1204.65 real 4979.91 user 2136.25 sys 8 threads for lld took 4154 (s), time for lld libxul: 1251.00 real 5328.76 user 2997.02 sys all 640m04.18s real 3040m01.82s user 1285m49.42s system fixed 5 threads for lld, changing MAKE_JOBS, in two batches, first 1 to 8 then 9 to 12: MAKE_JOBS 1->8 1043m35.20s real 2507m44.56s user 747m08.12s system MAKE_JOBS 9-12 297m41.30s real 1482m53.38s user 584m43.49s system 1 MAKE_JOBS & 5 threads for lld took 17974 (s), time for the whole build: 299m33.63s real 299m25.49s user 63m59.88s system 2 MAKE_JOBS & 5 threads for lld took 9751 (s), time for the whole build: 162m31.17s real 296m23.57s user 60m19.34s system 3 MAKE_JOBS & 5 threads for lld took 7218 (s), time for the whole build: 120m17.75s real 298m07.04s user 67m27.90s system 4 MAKE_JOBS & 5 threads for lld took 7196 (s), time for the whole build: 119m56.49s real 317m28.81s user 131m05.36s system 5 MAKE_JOBS & 5 threads for lld took 5756 (s), time for the whole build: 95m55.22s real 314m53.99s user 101m20.27s system 6 MAKE_JOBS & 5 threads for lld took 5348 (s), time for the whole build: 89m07.90s real 320m45.14s user 110m36.11s system 7 MAKE_JOBS & 5 threads for lld took 4533 (s), time for the whole build: 75m32.59s real 324m56.65s user 96m35.54s system 8 MAKE_JOBS & 5 threads for lld took 4486 (s), time for the whole build: 74m45.29s real 335m40.38s user 112m00.79s system 9 MAKE_JOBS & 5 threads for lld took 4803 (s), time for the whole build: 80m02.76s real 352m22.68s user 145m18.24s system 10 MAKE_JOBS & 5 threads for lld took 4294 (s), time for the whole build: 71m33.42s real 360m47.93s user 132m11.32s system 11 MAKE_JOBS & 5 threads for lld took 4208 (s), time for the whole build: 70m07.91s real 370m18.35s user 141m51.38s system 12 MAKE_JOBS & 5 threads for lld took 4380 (s), time for the whole build: 72m59.61s real 399m21.73s user 163m25.45s system with the rwlocks diff & the futex diff: =============================== changing the amount of threads, MAKE_JOBS=12 1 threads for lld took 6764 (s), time for lld libxul: 3843.77 real 3360.01 user 460.74 sys 2 threads for lld took 4998 (s), time for lld libxul: 2040.58 real 3442.94 user 520.94 sys 3 threads for lld took 4554 (s), time for lld libxul: 1515.08 real 3664.65 user 635.89 sys 4 threads for lld took 4276 (s), time for lld libxul: 1213.37 real 3742.56 user 728.96 sys 5 threads for lld took 4293 (s), time for lld libxul: 1241.74 real 4261.79 user 1146.74 sys 6 threads for lld took 4121 (s), time for lld libxul: 1129.83 real 4689.45 user 1186.21 sys 7 threads for lld took 4101 (s), time for lld libxul: 1085.05 real 4877.47 user 1408.21 sys 8 threads for lld took 4479 (s), time for lld libxul: 1536.75 real 5639.95 user 4266.23 sys all 632m09.59s real 3164m59.79s user 1278m32.98s system one can clearly see sys time going down.. fixed 5 threads for lld and varying MAKE_JOBS (sadly it rebooted at the 11th step so had a cold start..) 1 MAKE_JOBS & 5 threads for lld took 19645 (s); time for the whole build: 327m25.37s real 328m36.28s user 66m23.38s system 2 MAKE_JOBS & 5 threads for lld took 10545 (s); time for the whole build: 175m45.04s real 322m25.95s user 62m29.85s system 3 MAKE_JOBS & 5 threads for lld took 8262 (s); time for the whole build: 137m42.58s real 329m12.21s user 91m00.57s system 4 MAKE_JOBS & 5 threads for lld took 6552 (s); time for the whole build: 109m12.23s real 319m55.94s user 84m52.69s system 5 MAKE_JOBS & 5 threads for lld took 5638 (s); time for the whole build: 93m57.18s real 317m59.67s user 90m56.84s system 6 MAKE_JOBS & 5 threads for lld took 4983 (s); time for the whole build: 83m03.73s real 311m47.01s user 93m32.52s system 7 MAKE_JOBS & 5 threads for lld took 4946 (s); time for the whole build: 82m25.85s real 334m05.81s user 107m57.99s system 8 MAKE_JOBS & 5 threads for lld took 4664 (s); time for the whole build: 77m44.18s real 336m03.62s user 122m17.10s system 9 MAKE_JOBS & 5 threads for lld took 4451 (s); time for the whole build: 74m11.22s real 351m13.66s user 122m04.53s system 10 MAKE_JOBS & 5 threads for lld took 4183 (s); time for the whole build: 69m43.65s real 358m26.97s user 130m38.95s system 11 MAKE_JOBS & 5 threads for lld took 4495 (s); time for the whole build: 74m54.52s real 396m28.60s user 153m21.54s system 12 MAKE_JOBS & 5 threads for lld took 4599 (s); time for the whole build: 76m38.98s real 373m50.41s user 170m15.96s system i cant make much from those numbers, but the machine was only building firefox, so besides top running in tmux there should have been no variation from external environment, but sometimes there are inconsistencies i can't explain. note that all those numbers are 'time make package', in which the make extract/configure/fake/package steps are 100% linear, so i'll retry to run the experiment timing only the make build step, which should give 'better' numbers as that step fully takes advantages of parallelism. i can also 100% see that it makes no sense to default MAKE_JOBS to ncpuonline, it's ways better to set PARALLEL_MAKE_JOBS to not more than half of ncpuonline, at least that's what i ended up with on my build vm. if one wants to reproduce the thread experiment, here's more or less the script: cd /usr/ports/www/mozilla-firefox for N in $(jot 12) ; make clean=all sed -i -e "s/--threads=./--threads=$N/" Makefile touch build-${N}-start time make package 2>&1 | tee build-${N}-end done make clean=all and use 'sh -c "time env MAKE_JOBS=$N make package"' to measure build time. Landry