From: Stuart Henderson Subject: Re: btrace, identifying syscall+0x5c4 To: tech Date: Mon, 2 Dec 2024 23:43:09 +0000 On 2024/12/02 17:47, Martin Pieuchot wrote: > On 02/12/24(Mon) 14:49, Stuart Henderson wrote: > > On 2024/12/02 14:35, Stuart Henderson wrote: > > > On 2024/12/02 15:12, Claudio Jeker wrote: > > > > On Mon, Dec 02, 2024 at 01:53:02PM +0000, Stuart Henderson wrote: > > > > > Thought I'd take a look at a kernel profile from a machine which is > > > > > fairly "interesting" in terms of kernel spin. (Running -current). > > > > > > > > > > In the attached output from running "btrace kprofile.bt" I have a > > > > > lot of time accounted for under syscall+0x5c4, which has _kernel_lock > > > > > directly next to it in the traces, and I was wondering what that is. > > > > > Bit confused by disasm of trap.o but my guess is that it might be > > > > > a lock relating to dt(4) itself, is that right? > > > > > > > > I think this is the bit of syscall that grabs the kernel lock for any > > > > syscall not marked NOLOCK. In your case this is mostly fstat*, open* > > > > or the bits under syscall+0x5e7. > > > > > > Thank you. > > > > > > > Your system does a lot of parallel disk IO and that is all still using the > > > > KERNEL_LOCK. > > > > > > oh, I should try "VFS syscalls & KERNEL_LOCK()" on this machine! > > > > ...and fwiw that looks like this: > > Interesting. What kind of workload is that? You have the same amount > of contention spread across open(2), fstat(2), read(2) and close(2). It's a monitoring box, the main thing causing load is librenms (php web + cli programs). The main load generator is the every-5-mins polling, this is a php cli program run from cron which calls snmpget and snmpbulkwalk to gather stats, using them to update mariadb and rrdtool databases (mariadb/rrdtool are running on the same machine). It also runs smokeping (fping and again rrdtool updates but on a much smaller scale) and icinga2 (various small check programs, mariadb updates, redis updates) but those are lighter weight (and run by their own schedulers with time delays spreading around more, rather than being directly wall-clock synced as with librenms). hw.ncpu=24 hw.ncpuonline=12 (Xeon E5-2630 v2). > If fstat(2) is done on a pipe, socket or kqueue the updated diff below > might help. It pushes the KERNEL_LOCK() down to vn_statfile() in this > syscall. Attached profile was created with this diff (machine had been up for a few hours before I did that). I then wondered what was running fstat(2) and tried ktrace on one of the php librenms poller processes to see if it was that - quickly resulting in a crash with *cpu1: kernel diagnostic assertion "_kernel_lock_held()" failed: file "/sys/kern/kern_ktrace.c", line 660 I haven't tried to trigger again but I suspect it will not be too difficult - probably best if I don't trash the filesystems on this box too badly if possible ;) If you can't reproduce it let me know and I can try to provoke it on a WITNESS kernel. (I didn't try ktrace with the earlier diff before fstat unlock). I have screenshots with the rest of the details from traces if wanted, didn't upload them yet. I didn't save other bits from ddb this time (no 'show all procs', etc). I'll leave it running overnight with that same kernel including the fstat unlock and see if I run into any other issues (without poking it with ktrace). WaAnRiNcI:N Gk:e rSnPeLl NdOiTa gLnOoWsEtRiEcD aOsNs eSrYtSiCoAnL L" _5k e-d 3n3e1l8_2l9o7c6k0_ hEeXlIdT( )0" 9f iStopped at savectx a +0xae: movl $0,%gs:0x680 procs on cpu: (PRFLAGS all 0x2, PFLAGS 0 except for rrdtool 0x1) 5 snmpget 4 snmpget 2 snmpbulkwalk 6 snmpbulkwalk * 3 snmpget 7K snmpbulkwalk 10 snmpget 8 snmpget 11 snmpget 0 snmpbulkwalk 9 php-8.3 1 rrdtool ddb{3}> show panic *cpu1: kernel diagnostic asserton "_kernel_lock_held()" failed: file "/sys/kern/kern_ktrace.c", line 660 ddb{3}> tr savectx() at savectx+0xae end of kernel ... ddb{0} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 end of kernel ddb{1} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 x86_bus_space_io_write_1+0x1d pcdisplay_cursor+0xa6 wsemul_vt100_output+0xc7 wsdisplay_cnputc+0x65 cnputc+0x3b db_putchar+0x485 kprintf+0x1351 db_printf+0x6d panic+0xc2 __asert+0x29 ktrwriteraw+0x28f ddb{2} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 alltraps+0x135 end of kernel ddb{3} -> above ddb{4} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 _kernel_lock+0xb2 doopenat+0x1ef syscall+0x620 Xsyscall+0x128 end of kernel ddb{5} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 _kernel_lock+0xa0 sys_fstat+0x98 syscall+0x620 Xsyscall+0x128 end of kernel ddb{6} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 pagezero+0x21 uvm_pagealloc+0xe2 uvm_fault_lower+0x159 uvm_fault+0x1f7 upageflttrap+0x6c usertrap+0x217 recall_trap+0x8 ddb{7} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 uvm_pmr_cache_put+0x206 uvm_km_pgremove_intrsafe+0x48 uvm_unmap_kill_entry_withlock+0xf2 uvm_unmap_remove+0x32f uvm_unmap+0x7d free+0x172 ufs_readdir+0x2c7 VOP_READDIR+0x46 sys_getdents+0x17a syscall+0x5e7 Xsyscall+0x128 end of kernel ddb{8} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 _kernel_lock+0xa9 doopenat+0x1ef syscall+0x620 Xsyscall+0x128 end of kernel ddb{9} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 uvm_fault+0x4c upageflttrap+0x6c usertrap+0x217 recall_trap+0x8 end of kernel ddb{10} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 krernel_lock+0xa6 vn_read+0x44 dofilereadv+0x163 sys_read+0x55 syscall+0x620 Xsyscall+0x128 end of kernel ddb{11} tr x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 end of kernel ddb{12} tr, same for 12-23 (smp-disabled) x86_ipi_db+0x16 x86_ipi_handler+0x80 Xresume_lapic_ipi+0x27 acpicpu_idle+0x239 end of kernel