Index | Thread | Search

From:
"Theo de Raadt" <deraadt@openbsd.org>
Subject:
openat(2) is mostly useless, sadly
To:
tech@cvs.openbsd.org
Date:
Wed, 28 May 2025 08:03:29 -0600

Download raw body.

Thread
The family of system calls related to openat(2) are mostly useless in
practice, rarely used. When they are used it is often ineffectively or
even with performance-reducing results.

     int
     openat(int fd, const char *path, int flags, ...);

These are the others:

    sys_fstatat sys_utimensat sys_chflagsat sys_pathconfat sys_faccessat
    sys_fchmodat sys_fchownat sys_linkat sys_mkdirat sys_mkfifoat
    sys_mknodat sys_readlinkat sys_renameat sys_symlinkat sys_unlinkat

The idea is that you can open a directory as fd, typically using O_DIRECTORY,
and then do relative accesses.  This will reduce lookups, and corresponding
locking operations in the kernel.  In practice two things get in the way, as
POSIX specs say:

    The openat() function shall be equivalent to the open() function except
    in the case where path specifies a relative path.

1) What if it is not a relative path, meaning /etc/passwd?
   openat(herefd, "/etc/passwd, O_RDONLY) will open that file and completely
   ignore herefd.

2) What if the relative path is upwards, meaning "../../../../something".
   It walks up the path, and opens it.

To keep it simple, these calls were not designed to assist any security
model.

Both FreeBSD and Linux have designed variations which do this.  Since all
the *at(2) functions have a flags parameter, their strategy was to add an
additional flag which didn't allow upwards traversal.  I think that misses
the point, and have a different proposal.

Let's create directory fd's which cannot traverse upwards.  Mark the object,
instead of requiring a programmer to put a flag on every system call acting
upon the object.  Two operational flags are added, O_BELOW and F_BELOW.

Creating such a locked directory fd is done with either

     dirfd = open("path", O_DIRECTORY | O_BELOW);

or you can lock a pre-existing dirfd:

     fcntl(dirfd, F_BELOW);

This dirfd has two charactistics.  Absolute accesses always fail with ENOENT.
Relative accesses that attempt to traverse upwards fail with ENOENT.
You can openat(dirfd, ".") but you cannot openat(dirfd, "..").

Code using readdir() or similar must be careful because they will be provided
with "." and ".." but operations on ".." will now fail.

---
An interesting use case shows up that this is a tiny bit like a chroot()
system call allowed for non-root users.  You can
       
     dirfd = open("path", O_DIRECTORY | O_BELOW);
     fchdir(dirfd);

Your process is now contained inside that directory.  This does not
have the classic risks that prevented providing chroot() to regular
processes (meaning, the opening of absolute paths inside the chroot
could confuse library functions because they are now inspecting the
user-created files, and the consequences of this were considered too
grave).  Absolute paths accessses with open() start at the process
current directory, and now fail.  I have not explored this regular
user chroot-like thing extensively yet.  Some semantic changes maybe
be desired.  There's a chance that this becomes something we want
to use in many daemons instead of chroot().

This is just a draft.  The main idea comes out of review one program
which uses openat() strangely, and wondering if we can do pathname
containment better in the kernel.  This can work nicely alongside unveil(),
but it is cheaper because the kernel doesn't need to hold references to
vnodes like unveil() does.

Index: lib/libc/sys/fcntl.2
===================================================================
RCS file: /cvs/src/lib/libc/sys/fcntl.2,v
diff -u -p -u -r1.36 fcntl.2
--- lib/libc/sys/fcntl.2	29 Dec 2022 02:12:41 -0000	1.36
+++ lib/libc/sys/fcntl.2	24 May 2025 04:12:48 -0000
@@ -61,6 +61,14 @@ by others (see below), and ignored by th
 .Pp
 The commands are:
 .Bl -tag -width F_DUPFD_CLOEXEC
+.It Dv F_BELOW
+Locks a directory file descriptor
+.Ar fd
+so that
+.Xr openat 2
+and related functions cannot perform relative accesses which are above
+the directory itself.
+Absolute path accesses are also blocked on this mode.
 .It Dv F_DUPFD
 Return a new descriptor as follows:
 .Pp
Index: lib/libc/sys/open.2
===================================================================
RCS file: /cvs/src/lib/libc/sys/open.2,v
diff -u -p -u -r1.51 open.2
--- lib/libc/sys/open.2	31 Mar 2022 17:27:16 -0000	1.51
+++ lib/libc/sys/open.2	24 May 2025 04:21:29 -0000
@@ -85,6 +85,15 @@ Any combination of the following flags m
 Do not block on open or for data to become available.
 .It Dv O_APPEND
 Append on each write.
+.It Dv O_BELOW
+Used alongside
+.Dv O_DIRECTORY ,
+sets the
+.Dv F_BELOW
+flag which constrains relative accesses with
+.Xr openat 2 .
+Described further in
+.Xr fcntl 2 .
 .It Dv O_CREAT
 Create file if it does not exist.
 An additional argument of type
Index: sys/kern/kern_descrip.c
===================================================================
RCS file: /cvs/src/sys/kern/kern_descrip.c,v
diff -u -p -u -r1.211 kern_descrip.c
--- sys/kern/kern_descrip.c	10 May 2025 09:44:39 -0000	1.211
+++ sys/kern/kern_descrip.c	24 May 2025 04:23:46 -0000
@@ -421,6 +421,14 @@ restart:
 		return (EBADF);
 	switch (SCARG(uap, cmd)) {
 
+	case F_BELOW:
+		vp = fp->f_data;
+	        if (fp->f_type == DTYPE_VNODE && (vp->v_type == VDIR))
+			fp->f_flag |= O_BELOW;
+		else
+			return ENOTTY;
+		break;
+
 	case F_DUPFD:
 	case F_DUPFD_CLOEXEC:
 		newmin = (long)SCARG(uap, arg);
Index: sys/kern/vfs_lookup.c
===================================================================
RCS file: /cvs/src/sys/kern/vfs_lookup.c,v
diff -u -p -u -r1.88 vfs_lookup.c
--- sys/kern/vfs_lookup.c	6 Jan 2023 19:08:36 -0000	1.88
+++ sys/kern/vfs_lookup.c	24 May 2025 05:06:00 -0000
@@ -125,6 +125,7 @@ namei(struct nameidata *ndp)
 	int error, linklen;
 	struct componentname *cnp = &ndp->ni_cnd;
 	struct proc *p = cnp->cn_proc;
+	struct file *fp;
 
 	ndp->ni_cnd.cn_cred = ndp->ni_cnd.cn_proc->p_ucred;
 #ifdef DIAGNOSTIC
@@ -204,10 +205,30 @@ namei(struct nameidata *ndp)
 			goto fail;
 	}
 
+	if (ndp->ni_dirfd != AT_FDCWD) {
+		fp = fd_getfile(fdp, ndp->ni_dirfd);
+		if (fp == NULL) {
+			error = EBADF;
+			goto fail;
+		}
+		/* O_BELOW directory only allows "." or below */ 
+		if (fp->f_flag & O_BELOW) {
+			/* absolute '/' paths blocked immediately */
+			if (cnp->cn_pnbuf[0] == '/') {
+				error = ENOENT;
+				goto fail;
+			}
+			/* relative paths crossing upwards blocked in vfs_lookup */
+			ndp->ni_rootdir = fp->f_data;
+			ndp->ni_cnd.cn_flags |= STAYBELOW;
+		}
+	}
+
 	/*
 	 * Check if starting from root directory or current directory.
 	 */
-	if (cnp->cn_pnbuf[0] == '/') {
+	if ((ndp->ni_cnd.cn_flags & STAYBELOW) == 0 &&
+	    cnp->cn_pnbuf[0] == '/') {
 		dp = ndp->ni_rootdir;
 		vref(dp);
 		if (cnp->cn_flags & REALPATH && cnp->cn_rpi == 0) {
@@ -221,11 +242,7 @@ namei(struct nameidata *ndp)
 		unveil_start_relative(p, ndp, dp);
 		unveil_check_component(p, ndp, dp);
 	} else {
-		struct file *fp = fd_getfile(fdp, ndp->ni_dirfd);
-		if (fp == NULL) {
-			error = EBADF;
-			goto fail;
-		}
+		/* fp figured out earlier */ 
 		dp = (struct vnode *)fp->f_data;
 		if (fp->f_type != DTYPE_VNODE || dp->v_type != VDIR) {
 			FRELE(fp, p);
@@ -322,6 +339,10 @@ badlink:
 		 */
 		if (cnp->cn_pnbuf[0] == '/') {
 			vrele(dp);
+			if (ndp->ni_cnd.cn_flags & STAYBELOW) {
+				error = ENOENT;
+				goto fail;
+			}
 			dp = ndp->ni_rootdir;
 			vref(dp);
 			ndp->ni_unveil_match = NULL;
@@ -536,6 +557,11 @@ dirloop:
 	 */
 	if (cnp->cn_flags & ISDOTDOT) {
 		for (;;) {
+			if (dp == ndp->ni_rootdir &&
+			    (ndp->ni_cnd.cn_flags & STAYBELOW)) {
+				error = ENOENT;
+				goto bad;
+			}
 			if (dp == ndp->ni_rootdir || dp == rootvnode) {
 				ndp->ni_dvp = dp;
 				ndp->ni_vp = dp;
Index: sys/kern/vfs_vnops.c
===================================================================
RCS file: /cvs/src/sys/kern/vfs_vnops.c,v
diff -u -p -u -r1.125 vfs_vnops.c
--- sys/kern/vfs_vnops.c	6 Jan 2025 08:57:23 -0000	1.125
+++ sys/kern/vfs_vnops.c	24 May 2025 04:08:04 -0000
@@ -99,6 +99,8 @@ vn_open(struct nameidata *ndp, int fmode
 		return (EINVAL);
 	if ((fmode & (O_TRUNC | FWRITE)) == O_TRUNC)
 		return (EINVAL);
+	if ((fmode & (O_BELOW | O_DIRECTORY)) == O_BELOW)
+		return (EINVAL);
 	if (fmode & O_CREAT) {
 		ndp->ni_cnd.cn_nameiop = CREATE;
 		ndp->ni_cnd.cn_flags |= LOCKPARENT | LOCKLEAF;
Index: sys/sys/fcntl.h
===================================================================
RCS file: /cvs/src/sys/sys/fcntl.h,v
diff -u -p -u -r1.22 fcntl.h
--- sys/sys/fcntl.h	21 Jan 2019 18:09:21 -0000	1.22
+++ sys/sys/fcntl.h	24 May 2025 00:33:56 -0000
@@ -84,6 +84,7 @@
 #define	O_ASYNC		0x0040		/* signal pgrp when data ready */
 #define	O_FSYNC		0x0080		/* backwards compatibility */
 #define	O_NOFOLLOW	0x0100		/* if path is a symlink, don't follow */
+#define	O_BELOW		0x40000		/* openat(2) cannot open above */
 #endif
 #if __POSIX_VISIBLE >= 199309 || __XPG_VISIBLE >= 420
 #define	O_SYNC		0x0080		/* synchronous writes */
@@ -116,7 +117,7 @@
 #define	OFLAGS(fflags)	(((fflags) & ~O_ACCMODE) | (((fflags) - 1) & O_ACCMODE))
 
 /* bits to save after open */
-#define	FMASK		(FREAD|FWRITE|FAPPEND|FASYNC|FFSYNC|FNONBLOCK)
+#define	FMASK		(FREAD|FWRITE|FAPPEND|FASYNC|FFSYNC|FNONBLOCK|O_BELOW)
 /* bits settable by fcntl(F_SETFL, ...) */
 #define	FCNTLFLAGS	(FAPPEND|FASYNC|FFSYNC|FNONBLOCK)
 #endif
@@ -157,6 +158,7 @@
 #endif
 #if __BSD_VISIBLE
 #define F_ISATTY	11		/* used by isatty(3) */
+#define F_BELOW		12		/* openat(2) cannot open above */
 #endif
 
 /* file descriptor flags (F_GETFD, F_SETFD) */
Index: sys/sys/namei.h
===================================================================
RCS file: /cvs/src/sys/sys/namei.h,v
diff -u -p -u -r1.50 namei.h
--- sys/sys/namei.h	11 Jan 2022 23:59:55 -0000	1.50
+++ sys/sys/namei.h	24 May 2025 05:05:29 -0000
@@ -144,6 +144,7 @@ struct nameidata {
 #define ISLASTCN	0x008000      /* this is last component of pathname */
 #define ISSYMLINK	0x010000      /* symlink needs interpretation */
 #define REALPATH	0x020000      /* save pathname buffer for realpath */
+#define STAYBELOW	0x040000      /* on a dir, prohibit going below ".." of root */
 #define	REQUIREDIR	0x080000      /* must be a directory */
 #define STRIPSLASHES    0x100000      /* strip trailing slashes */
 #define PDIRUNLOCK	0x200000      /* vfs_lookup() unlocked parent dir */