From: "Theo de Raadt" Subject: openat(2) is mostly useless, sadly To: tech@cvs.openbsd.org Date: Wed, 28 May 2025 08:03:29 -0600 The family of system calls related to openat(2) are mostly useless in practice, rarely used. When they are used it is often ineffectively or even with performance-reducing results. int openat(int fd, const char *path, int flags, ...); These are the others: sys_fstatat sys_utimensat sys_chflagsat sys_pathconfat sys_faccessat sys_fchmodat sys_fchownat sys_linkat sys_mkdirat sys_mkfifoat sys_mknodat sys_readlinkat sys_renameat sys_symlinkat sys_unlinkat The idea is that you can open a directory as fd, typically using O_DIRECTORY, and then do relative accesses. This will reduce lookups, and corresponding locking operations in the kernel. In practice two things get in the way, as POSIX specs say: The openat() function shall be equivalent to the open() function except in the case where path specifies a relative path. 1) What if it is not a relative path, meaning /etc/passwd? openat(herefd, "/etc/passwd, O_RDONLY) will open that file and completely ignore herefd. 2) What if the relative path is upwards, meaning "../../../../something". It walks up the path, and opens it. To keep it simple, these calls were not designed to assist any security model. Both FreeBSD and Linux have designed variations which do this. Since all the *at(2) functions have a flags parameter, their strategy was to add an additional flag which didn't allow upwards traversal. I think that misses the point, and have a different proposal. Let's create directory fd's which cannot traverse upwards. Mark the object, instead of requiring a programmer to put a flag on every system call acting upon the object. Two operational flags are added, O_BELOW and F_BELOW. Creating such a locked directory fd is done with either dirfd = open("path", O_DIRECTORY | O_BELOW); or you can lock a pre-existing dirfd: fcntl(dirfd, F_BELOW); This dirfd has two charactistics. Absolute accesses always fail with ENOENT. Relative accesses that attempt to traverse upwards fail with ENOENT. You can openat(dirfd, ".") but you cannot openat(dirfd, ".."). Code using readdir() or similar must be careful because they will be provided with "." and ".." but operations on ".." will now fail. --- An interesting use case shows up that this is a tiny bit like a chroot() system call allowed for non-root users. You can dirfd = open("path", O_DIRECTORY | O_BELOW); fchdir(dirfd); Your process is now contained inside that directory. This does not have the classic risks that prevented providing chroot() to regular processes (meaning, the opening of absolute paths inside the chroot could confuse library functions because they are now inspecting the user-created files, and the consequences of this were considered too grave). Absolute paths accessses with open() start at the process current directory, and now fail. I have not explored this regular user chroot-like thing extensively yet. Some semantic changes maybe be desired. There's a chance that this becomes something we want to use in many daemons instead of chroot(). This is just a draft. The main idea comes out of review one program which uses openat() strangely, and wondering if we can do pathname containment better in the kernel. This can work nicely alongside unveil(), but it is cheaper because the kernel doesn't need to hold references to vnodes like unveil() does. Index: lib/libc/sys/fcntl.2 =================================================================== RCS file: /cvs/src/lib/libc/sys/fcntl.2,v diff -u -p -u -r1.36 fcntl.2 --- lib/libc/sys/fcntl.2 29 Dec 2022 02:12:41 -0000 1.36 +++ lib/libc/sys/fcntl.2 24 May 2025 04:12:48 -0000 @@ -61,6 +61,14 @@ by others (see below), and ignored by th .Pp The commands are: .Bl -tag -width F_DUPFD_CLOEXEC +.It Dv F_BELOW +Locks a directory file descriptor +.Ar fd +so that +.Xr openat 2 +and related functions cannot perform relative accesses which are above +the directory itself. +Absolute path accesses are also blocked on this mode. .It Dv F_DUPFD Return a new descriptor as follows: .Pp Index: lib/libc/sys/open.2 =================================================================== RCS file: /cvs/src/lib/libc/sys/open.2,v diff -u -p -u -r1.51 open.2 --- lib/libc/sys/open.2 31 Mar 2022 17:27:16 -0000 1.51 +++ lib/libc/sys/open.2 24 May 2025 04:21:29 -0000 @@ -85,6 +85,15 @@ Any combination of the following flags m Do not block on open or for data to become available. .It Dv O_APPEND Append on each write. +.It Dv O_BELOW +Used alongside +.Dv O_DIRECTORY , +sets the +.Dv F_BELOW +flag which constrains relative accesses with +.Xr openat 2 . +Described further in +.Xr fcntl 2 . .It Dv O_CREAT Create file if it does not exist. An additional argument of type Index: sys/kern/kern_descrip.c =================================================================== RCS file: /cvs/src/sys/kern/kern_descrip.c,v diff -u -p -u -r1.211 kern_descrip.c --- sys/kern/kern_descrip.c 10 May 2025 09:44:39 -0000 1.211 +++ sys/kern/kern_descrip.c 24 May 2025 04:23:46 -0000 @@ -421,6 +421,14 @@ restart: return (EBADF); switch (SCARG(uap, cmd)) { + case F_BELOW: + vp = fp->f_data; + if (fp->f_type == DTYPE_VNODE && (vp->v_type == VDIR)) + fp->f_flag |= O_BELOW; + else + return ENOTTY; + break; + case F_DUPFD: case F_DUPFD_CLOEXEC: newmin = (long)SCARG(uap, arg); Index: sys/kern/vfs_lookup.c =================================================================== RCS file: /cvs/src/sys/kern/vfs_lookup.c,v diff -u -p -u -r1.88 vfs_lookup.c --- sys/kern/vfs_lookup.c 6 Jan 2023 19:08:36 -0000 1.88 +++ sys/kern/vfs_lookup.c 24 May 2025 05:06:00 -0000 @@ -125,6 +125,7 @@ namei(struct nameidata *ndp) int error, linklen; struct componentname *cnp = &ndp->ni_cnd; struct proc *p = cnp->cn_proc; + struct file *fp; ndp->ni_cnd.cn_cred = ndp->ni_cnd.cn_proc->p_ucred; #ifdef DIAGNOSTIC @@ -204,10 +205,30 @@ namei(struct nameidata *ndp) goto fail; } + if (ndp->ni_dirfd != AT_FDCWD) { + fp = fd_getfile(fdp, ndp->ni_dirfd); + if (fp == NULL) { + error = EBADF; + goto fail; + } + /* O_BELOW directory only allows "." or below */ + if (fp->f_flag & O_BELOW) { + /* absolute '/' paths blocked immediately */ + if (cnp->cn_pnbuf[0] == '/') { + error = ENOENT; + goto fail; + } + /* relative paths crossing upwards blocked in vfs_lookup */ + ndp->ni_rootdir = fp->f_data; + ndp->ni_cnd.cn_flags |= STAYBELOW; + } + } + /* * Check if starting from root directory or current directory. */ - if (cnp->cn_pnbuf[0] == '/') { + if ((ndp->ni_cnd.cn_flags & STAYBELOW) == 0 && + cnp->cn_pnbuf[0] == '/') { dp = ndp->ni_rootdir; vref(dp); if (cnp->cn_flags & REALPATH && cnp->cn_rpi == 0) { @@ -221,11 +242,7 @@ namei(struct nameidata *ndp) unveil_start_relative(p, ndp, dp); unveil_check_component(p, ndp, dp); } else { - struct file *fp = fd_getfile(fdp, ndp->ni_dirfd); - if (fp == NULL) { - error = EBADF; - goto fail; - } + /* fp figured out earlier */ dp = (struct vnode *)fp->f_data; if (fp->f_type != DTYPE_VNODE || dp->v_type != VDIR) { FRELE(fp, p); @@ -322,6 +339,10 @@ badlink: */ if (cnp->cn_pnbuf[0] == '/') { vrele(dp); + if (ndp->ni_cnd.cn_flags & STAYBELOW) { + error = ENOENT; + goto fail; + } dp = ndp->ni_rootdir; vref(dp); ndp->ni_unveil_match = NULL; @@ -536,6 +557,11 @@ dirloop: */ if (cnp->cn_flags & ISDOTDOT) { for (;;) { + if (dp == ndp->ni_rootdir && + (ndp->ni_cnd.cn_flags & STAYBELOW)) { + error = ENOENT; + goto bad; + } if (dp == ndp->ni_rootdir || dp == rootvnode) { ndp->ni_dvp = dp; ndp->ni_vp = dp; Index: sys/kern/vfs_vnops.c =================================================================== RCS file: /cvs/src/sys/kern/vfs_vnops.c,v diff -u -p -u -r1.125 vfs_vnops.c --- sys/kern/vfs_vnops.c 6 Jan 2025 08:57:23 -0000 1.125 +++ sys/kern/vfs_vnops.c 24 May 2025 04:08:04 -0000 @@ -99,6 +99,8 @@ vn_open(struct nameidata *ndp, int fmode return (EINVAL); if ((fmode & (O_TRUNC | FWRITE)) == O_TRUNC) return (EINVAL); + if ((fmode & (O_BELOW | O_DIRECTORY)) == O_BELOW) + return (EINVAL); if (fmode & O_CREAT) { ndp->ni_cnd.cn_nameiop = CREATE; ndp->ni_cnd.cn_flags |= LOCKPARENT | LOCKLEAF; Index: sys/sys/fcntl.h =================================================================== RCS file: /cvs/src/sys/sys/fcntl.h,v diff -u -p -u -r1.22 fcntl.h --- sys/sys/fcntl.h 21 Jan 2019 18:09:21 -0000 1.22 +++ sys/sys/fcntl.h 24 May 2025 00:33:56 -0000 @@ -84,6 +84,7 @@ #define O_ASYNC 0x0040 /* signal pgrp when data ready */ #define O_FSYNC 0x0080 /* backwards compatibility */ #define O_NOFOLLOW 0x0100 /* if path is a symlink, don't follow */ +#define O_BELOW 0x40000 /* openat(2) cannot open above */ #endif #if __POSIX_VISIBLE >= 199309 || __XPG_VISIBLE >= 420 #define O_SYNC 0x0080 /* synchronous writes */ @@ -116,7 +117,7 @@ #define OFLAGS(fflags) (((fflags) & ~O_ACCMODE) | (((fflags) - 1) & O_ACCMODE)) /* bits to save after open */ -#define FMASK (FREAD|FWRITE|FAPPEND|FASYNC|FFSYNC|FNONBLOCK) +#define FMASK (FREAD|FWRITE|FAPPEND|FASYNC|FFSYNC|FNONBLOCK|O_BELOW) /* bits settable by fcntl(F_SETFL, ...) */ #define FCNTLFLAGS (FAPPEND|FASYNC|FFSYNC|FNONBLOCK) #endif @@ -157,6 +158,7 @@ #endif #if __BSD_VISIBLE #define F_ISATTY 11 /* used by isatty(3) */ +#define F_BELOW 12 /* openat(2) cannot open above */ #endif /* file descriptor flags (F_GETFD, F_SETFD) */ Index: sys/sys/namei.h =================================================================== RCS file: /cvs/src/sys/sys/namei.h,v diff -u -p -u -r1.50 namei.h --- sys/sys/namei.h 11 Jan 2022 23:59:55 -0000 1.50 +++ sys/sys/namei.h 24 May 2025 05:05:29 -0000 @@ -144,6 +144,7 @@ struct nameidata { #define ISLASTCN 0x008000 /* this is last component of pathname */ #define ISSYMLINK 0x010000 /* symlink needs interpretation */ #define REALPATH 0x020000 /* save pathname buffer for realpath */ +#define STAYBELOW 0x040000 /* on a dir, prohibit going below ".." of root */ #define REQUIREDIR 0x080000 /* must be a directory */ #define STRIPSLASHES 0x100000 /* strip trailing slashes */ #define PDIRUNLOCK 0x200000 /* vfs_lookup() unlocked parent dir */