Download raw body.
btrace, identifying syscall+0x5c4
On 2024/12/02 17:47, Martin Pieuchot wrote:
> On 02/12/24(Mon) 14:49, Stuart Henderson wrote:
> > On 2024/12/02 14:35, Stuart Henderson wrote:
> > > On 2024/12/02 15:12, Claudio Jeker wrote:
> > > > On Mon, Dec 02, 2024 at 01:53:02PM +0000, Stuart Henderson wrote:
> > > > > Thought I'd take a look at a kernel profile from a machine which is
> > > > > fairly "interesting" in terms of kernel spin. (Running -current).
> > > > >
> > > > > In the attached output from running "btrace kprofile.bt" I have a
> > > > > lot of time accounted for under syscall+0x5c4, which has _kernel_lock
> > > > > directly next to it in the traces, and I was wondering what that is.
> > > > > Bit confused by disasm of trap.o but my guess is that it might be
> > > > > a lock relating to dt(4) itself, is that right?
> > > >
> > > > I think this is the bit of syscall that grabs the kernel lock for any
> > > > syscall not marked NOLOCK. In your case this is mostly fstat*, open*
> > > > or the bits under syscall+0x5e7.
> > >
> > > Thank you.
> > >
> > > > Your system does a lot of parallel disk IO and that is all still using the
> > > > KERNEL_LOCK.
> > >
> > > oh, I should try "VFS syscalls & KERNEL_LOCK()" on this machine!
> >
> > ...and fwiw that looks like this:
>
> Interesting. What kind of workload is that? You have the same amount
> of contention spread across open(2), fstat(2), read(2) and close(2).
It's a monitoring box, the main thing causing load is librenms (php
web + cli programs). The main load generator is the every-5-mins
polling, this is a php cli program run from cron which calls
snmpget and snmpbulkwalk to gather stats, using them to update
mariadb and rrdtool databases (mariadb/rrdtool are running on the
same machine).
It also runs smokeping (fping and again rrdtool updates but on a much
smaller scale) and icinga2 (various small check programs, mariadb
updates, redis updates) but those are lighter weight (and run by their
own schedulers with time delays spreading around more, rather than
being directly wall-clock synced as with librenms).
hw.ncpu=24 hw.ncpuonline=12 (Xeon E5-2630 v2).
> If fstat(2) is done on a pipe, socket or kqueue the updated diff below
> might help. It pushes the KERNEL_LOCK() down to vn_statfile() in this
> syscall.
Attached profile was created with this diff (machine had been up for
a few hours before I did that).
I then wondered what was running fstat(2) and tried ktrace on one of
the php librenms poller processes to see if it was that - quickly
resulting in a crash with
*cpu1: kernel diagnostic assertion "_kernel_lock_held()" failed: file
"/sys/kern/kern_ktrace.c", line 660
I haven't tried to trigger again but I suspect it will not be too
difficult - probably best if I don't trash the filesystems on this
box too badly if possible ;) If you can't reproduce it let me know
and I can try to provoke it on a WITNESS kernel.
(I didn't try ktrace with the earlier diff before fstat unlock).
I have screenshots with the rest of the details from traces if
wanted, didn't upload them yet. I didn't save other bits from ddb
this time (no 'show all procs', etc).
I'll leave it running overnight with that same kernel including the
fstat unlock and see if I run into any other issues (without
poking it with ktrace).
WaAnRiNcI:N Gk:e rSnPeLl NdOiTa gLnOoWsEtRiEcD aOsNs eSrYtSiCoAnL L"
_5k e-d
3n3e1l8_2l9o7c6k0_ hEeXlIdT( )0" 9f
iStopped at savectx a
+0xae: movl $0,%gs:0x680
procs on cpu: (PRFLAGS all 0x2, PFLAGS 0 except for rrdtool 0x1)
5 snmpget
4 snmpget
2 snmpbulkwalk
6 snmpbulkwalk
* 3 snmpget
7K snmpbulkwalk
10 snmpget
8 snmpget
11 snmpget
0 snmpbulkwalk
9 php-8.3
1 rrdtool
ddb{3}> show panic
*cpu1: kernel diagnostic asserton "_kernel_lock_held()" failed: file
"/sys/kern/kern_ktrace.c", line 660
ddb{3}> tr
savectx() at savectx+0xae
end of kernel
...
ddb{0} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
end of kernel
ddb{1} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
x86_bus_space_io_write_1+0x1d
pcdisplay_cursor+0xa6
wsemul_vt100_output+0xc7
wsdisplay_cnputc+0x65
cnputc+0x3b
db_putchar+0x485
kprintf+0x1351
db_printf+0x6d
panic+0xc2
__asert+0x29
ktrwriteraw+0x28f
ddb{2} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
alltraps+0x135
end of kernel
ddb{3} -> above
ddb{4} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
_kernel_lock+0xb2
doopenat+0x1ef
syscall+0x620
Xsyscall+0x128
end of kernel
ddb{5} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
_kernel_lock+0xa0
sys_fstat+0x98
syscall+0x620
Xsyscall+0x128
end of kernel
ddb{6} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
pagezero+0x21
uvm_pagealloc+0xe2
uvm_fault_lower+0x159
uvm_fault+0x1f7
upageflttrap+0x6c
usertrap+0x217
recall_trap+0x8
ddb{7} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
uvm_pmr_cache_put+0x206
uvm_km_pgremove_intrsafe+0x48
uvm_unmap_kill_entry_withlock+0xf2
uvm_unmap_remove+0x32f
uvm_unmap+0x7d
free+0x172
ufs_readdir+0x2c7
VOP_READDIR+0x46
sys_getdents+0x17a
syscall+0x5e7
Xsyscall+0x128
end of kernel
ddb{8} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
_kernel_lock+0xa9
doopenat+0x1ef
syscall+0x620
Xsyscall+0x128
end of kernel
ddb{9} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
uvm_fault+0x4c
upageflttrap+0x6c
usertrap+0x217
recall_trap+0x8
end of kernel
ddb{10} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
krernel_lock+0xa6
vn_read+0x44
dofilereadv+0x163
sys_read+0x55
syscall+0x620
Xsyscall+0x128
end of kernel
ddb{11} tr
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
end of kernel
ddb{12} tr, same for 12-23 (smp-disabled)
x86_ipi_db+0x16
x86_ipi_handler+0x80
Xresume_lapic_ipi+0x27
acpicpu_idle+0x239
end of kernel
btrace, identifying syscall+0x5c4