From: Chris Cunningham Subject: vmd(8): refcount lazily-allocated refcount blocks in qcow2 writer To: "tech@openbsd.org" Date: Sun, 14 Jun 2026 14:27:03 +0000 Hello again, This is the second of the two patches I mentioned, split into its own email. I think I've found a small qcow2 metadata leak in vmd(8)'s qcow2 writer, though I may be misreading the intent of the code, so please correct me if so. As far as I can tell, when a guest grows a qcow2 image past a 2GB refcount-block boundary, vmd lazily allocates a new refcount block and installs it in the refcount table, but doesn't mark the new refcount block's own cluster as allocated. qemu-img check then reports one leaked metadata cluster per 2GB of allocated image data. For example, on one affected 40GB image: ERROR cluster 32769 refcount=0 reference=1 ERROR cluster 65537 refcount=0 reference=1 ERROR cluster 98305 refcount=0 reference=1 ERROR cluster 131073 refcount=0 reference=1 ERROR cluster 163841 refcount=0 reference=1 ERROR cluster 196609 refcount=0 reference=1 ERROR cluster 229377 refcount=0 reference=1 ERROR cluster 262145 refcount=0 reference=1 The pattern is N * 32768 + 1 with the default 64KB cluster size and 16-bit refcounts. Each refcount block covers 32768 clusters, or 2GB. The first refcount block created by vmctl create is fine. The problem seems to be only in the runtime lazy allocation path in inc_refs(). When the refcount table entry is zero, inc_refs() allocates a new refcount block at disk->end and writes that offset into the refcount table. The new block is then reachable from qcow2 metadata, but its own refcount entry remains zero. The diff below marks the newly allocated refcount block itself in use after installing it in the refcount table. The refcount-table entry is written before the recursive call, so each refcount region enters the allocation path at most once and the recursion terminates (depth 1 in the common case, where the new block lands in the same region as the cluster that triggered the allocation). The recursive call into the allocator was the part I was least sure about, so I'd welcome a closer look there. The corruption is metadata-only as far as I can see: data and the L1/L2 mapping tables are unaffected and guests keep running. qemu-img check -r all repairs the leaks, but qemu-img resize refuses the image until it is repaired. I reproduced this both with real vmd-written qcow2 images and with a small standalone harness that drives vioqcow2.c past the 2GB boundary. Before this change, the harness reports self_refs=0 for the lazy refcount block. After this change, it reports self_refs=1. Thanks, Chris diff --git a/usr.sbin/vmd/vioqcow2.c b/usr.sbin/vmd/vioqcow2.c index 917cba2cbc0..79a481f3ee5 100644 --- a/usr.sbin/vmd/vioqcow2.c +++ b/usr.sbin/vmd/vioqcow2.c @@ -621,6 +621,13 @@ inc_refs(struct qcdisk *disk, off_t off, int newcluster) buf = htobe64(l2cluster); if (pwrite(disk->fd, &buf, sizeof(buf), l1off) != 8) fatal("%s: failed to write ref block", __func__); + /* + * The newly allocated refcount block cluster must itself + * be marked in use, or qcow2 metadata leaks one cluster + * per refcount block (one per 2 GiB of allocated data at + * the default 64K cluster / 16-bit refcount sizing). + */ + inc_refs(disk, l2cluster, 1); } refs = 1;