Download raw body.
timing lld --threads for fun and profit
hi,
so www/mozilla-firefox spends an considerable amount of time linking
libxul.so, which is the monster library containing everything.
right now the port uses:
CONFIGURE_ENV += LDFLAGS="-Wl,--threads=${MAKE_JOBS} --ld-path=${WRKDIR}/bin/ld"
which says to lld "use as many threads as cpus you have". mpi@ made me
look at top -H during linking on 8/12/16 cores machines, and we saw that
only 20% of the time spent by all cores was actual "user" time, the rest
of the time was wasted in %sys and %spin.
so i instrumented the ld wrapper to actually time the lld process via a
simple bsd.port.mk patch:
-_LD_PROGRAM = /usr/bin/ld.lld
+_LD_PROGRAM = /usr/bin/time /usr/bin/ld.lld
and ran a firefox build loop, giving from 1 to 8 threads to the lld process.
this gave me those numbers:
- on an amd64 kvm VM with 16 cores on a slow/old hw:
1 7148.53 real 6003.69 user 1037.50 sys
2 4365.36 real 6755.63 user 1665.59 sys
3 3263.71 real 7589.08 user 1613.45 sys
4 2626.83 real 7903.39 user 1841.40 sys
5 2847.14 real 9818.02 user 2420.12 sys
6 2883.57 real 11454.10 user 3338.28 sys
7 2426.03 real 10754.16 user 3358.06 sys
8 2334.40 real 11463.60 user 3788.33 sys
- on the 8-core arm64 x13s:
1 5173.76 real 4538.51 user 613.66 sys
2 2750.27 real 4693.69 user 688.53 sys
3 1768.95 real 4382.10 user 717.15 sys
4 1407.99 real 4566.04 user 759.40 sys
5 1146.45 real 4438.68 user 856.93 sys
6 1152.81 real 4964.03 user 1228.58 sys
7 1466.82 real 6393.65 user 2460.71 sys
8 1235.53 real 6000.14 user 2420.20 sys
- on the 12-core arm64 omnibook X14:
1 3984.44 real 3493.92 user 468.64 sys
2 2119.35 real 3582.04 user 536.62 sys
3 1522.39 real 3660.79 user 663.06 sys
4 1572.70 real 4227.21 user 1279.55 sys
5 1097.43 real 4091.75 user 876.22 sys
6 1085.70 real 4326.69 user 1289.30 sys
7 1369.48 real 4988.69 user 2934.72 sys
8 1222.77 real 5150.22 user 2945.51 sys
from those numbers, apparently the 'sweet spot' for lld is to not use
more than 5 threads, above that there's too much cpu contention and
we're wasting too much time.
someone(tm) should look into patching lld to avoid using more than
MAX(ncpu,5) threads ? In the meantime, i'll probably fix the firefox
ports to avoir using MAKE_JOBS for lld but cap it at 5.
Landry
timing lld --threads for fun and profit