Download raw body.
wired pages vs sysctl(2)
On 04/03/25(Tue) 14:45, Martin Pieuchot wrote:
> On 27/02/25(Thu) 14:44, Martin Pieuchot wrote:
> > Since the 90s sysctl(2) has been wiring pages that contain the syscall
> > arguments.
> >
> > In uvm/uvm_fault.c r1.129 kettenis@ fixed one of the races exposed by
> > syzkaller where a sibling thread munmap(2)ed the range currently wired
> > by sysctl(2).
> >
> > Sadly the fix is incomplete and the reproducer attached triggers the
> > following panic:
> >
> > panic: uvm_fault_unwire_locked: address not in map
> > Stopped at db_enter+0x14: popq %rbp
> > TID PID UID PRFLAGS PFLAGS CPU COMMAND
> > *129832 2744 0 0x3 0 0 sysctl_wired
> > 84116 96377 0 0x10 0x8 2 sshd-session
> > 6897 50184 0 0x14000 0x200 1 softnet0
> > db_enter() at db_enter+0x14
> > panic(ffffffff8261d529) at panic+0xdd
> > uvm_fault_unwire_locked(fffffd878156ab58,20000000,20001000) at uvm_fault_unwire
> > _locked+0x246
> > uvm_fault_unwire(fffffd878156ab58,20000000,20001000) at uvm_fault_unwire+0x44
> > sysctl_vsunlock(20000040,0) at sysctl_vsunlock+0x46
> > net_sysctl(ffff800055d411b4,3,20000040,ffff800055d411e8,20000080,c,cf641d03c835
> > bbeb) at net_sysctl+0x309
> > sys_sysctl(ffff800055c99c78,ffff800055d41300,ffff800055d41270) at sys_sysctl+0x
> > 1f8
> > syscall(ffff800055d41300) at syscall+0x5ec
> > Xsyscall() at Xsyscall+0x128
> >
> > Syzkaller has two reports for this:
> > https://syzkaller.appspot.com/bug?extid=d9b926edfd5f64a66c58
> > https://syzkaller.appspot.com/bug?extid=a12540517e3a76a6a490
> >
> > It is not clear to my why vslock(9) is needed today. That said the diff
> > below completes kettenis@'s approach by making vslock similar to mlock(2).
> > With it, vsunlock() now silently fails without crashing the kernel and I
> > can no longer reproduce the panic.
>
> Note that there's another issue with the logic that wires pages during
> sysctl(2). If a page has been previously mlock(2)ed and we pass it as
> arguments to sysctl(2), it will be munlock(2)ed when returning to
> userland.
What I said is not true. The page will still be wired because map
entries are properly refcounted in uvm_map_pageable_wire().
With this diff the race left is the possibility to unwire a page wired
by the kernel via sysctl(2).
> As long as we need vslock(9), which is something that I don't know, might
> want to distinguish between userland vs kernel wiring like FreeBSD do.
I'm not so sure this is strictly needed. If thread A calls sysctl(2)
and thread B does munlock(2) on the same page, then the output of
sysctl(2) might not be coherent. I believe this race is acceptable.
Here's the diff again, so ok?
Index: uvm/uvm_fault.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_fault.c,v
diff -u -p -r1.165 uvm_fault.c
--- uvm/uvm_fault.c 10 Mar 2025 14:13:58 -0000 1.165
+++ uvm/uvm_fault.c 12 Mar 2025 14:18:22 -0000
@@ -1737,16 +1737,12 @@ uvm_fault_unwire_locked(vm_map_t map, va
* find the map entry for the current address.
*/
KASSERT(va >= entry->start);
- while (entry && va >= entry->end) {
+ while (va >= entry->end) {
next = RBT_NEXT(uvm_map_addr, entry);
+ KASSERT(next != NULL && next->start <= entry->end);
entry = next;
}
- if (entry == NULL)
- return;
- if (va < entry->start)
- continue;
-
/*
* lock it.
*/
@@ -1776,7 +1772,7 @@ uvm_fault_unwire_locked(vm_map_t map, va
}
if (oentry != NULL) {
- uvm_map_unlock_entry(oentry);
+ uvm_map_unlock_entry(entry);
}
}
Index: uvm/uvm_glue.c
===================================================================
RCS file: /cvs/src/sys/uvm/uvm_glue.c,v
diff -u -p -r1.87 uvm_glue.c
--- uvm/uvm_glue.c 28 Oct 2024 08:25:32 -0000 1.87
+++ uvm/uvm_glue.c 12 Mar 2025 14:18:22 -0000
@@ -114,7 +114,7 @@ uvm_vslock(struct proc *p, caddr_t addr,
if (end <= start)
return (EINVAL);
- return uvm_fault_wire(map, start, end, access_type);
+ return uvm_map_pageable(map, start, end, FALSE, 0);
}
/*
@@ -125,13 +125,14 @@ uvm_vslock(struct proc *p, caddr_t addr,
void
uvm_vsunlock(struct proc *p, caddr_t addr, size_t len)
{
+ struct vm_map *map = &p->p_vmspace->vm_map;
vaddr_t start, end;
start = trunc_page((vaddr_t)addr);
end = round_page((vaddr_t)addr + len);
KASSERT(end > start);
- uvm_fault_unwire(&p->p_vmspace->vm_map, start, end);
+ uvm_map_pageable(map, start, end, TRUE, 0);
}
/*
wired pages vs sysctl(2)