Download raw body.
vmd(8): refcount lazily-allocated refcount blocks in qcow2 writer
Hello again,
This is the second of the two patches I mentioned, split into its own
email. I think I've found a small qcow2 metadata leak in vmd(8)'s qcow2
writer, though I may be misreading the intent of the code, so please
correct me if so.
As far as I can tell, when a guest grows a qcow2 image past a 2GB
refcount-block boundary, vmd lazily allocates a new refcount block and
installs it in the refcount table, but doesn't mark the new refcount
block's own cluster as allocated.
qemu-img check then reports one leaked metadata cluster per 2GB of
allocated image data. For example, on one affected 40GB image:
ERROR cluster 32769 refcount=0 reference=1
ERROR cluster 65537 refcount=0 reference=1
ERROR cluster 98305 refcount=0 reference=1
ERROR cluster 131073 refcount=0 reference=1
ERROR cluster 163841 refcount=0 reference=1
ERROR cluster 196609 refcount=0 reference=1
ERROR cluster 229377 refcount=0 reference=1
ERROR cluster 262145 refcount=0 reference=1
The pattern is N * 32768 + 1 with the default 64KB cluster size and
16-bit refcounts. Each refcount block covers 32768 clusters, or 2GB.
The first refcount block created by vmctl create is fine. The problem
seems to be only in the runtime lazy allocation path in inc_refs().
When the refcount table entry is zero, inc_refs() allocates a new
refcount block at disk->end and writes that offset into the refcount
table. The new block is then reachable from qcow2 metadata, but its own
refcount entry remains zero.
The diff below marks the newly allocated refcount block itself in use
after installing it in the refcount table. The refcount-table entry is
written before the recursive call, so each refcount region enters the
allocation path at most once and the recursion terminates (depth 1 in
the common case, where the new block lands in the same region as the
cluster that triggered the allocation). The recursive call into the
allocator was the part I was least sure about, so I'd welcome a closer
look there.
The corruption is metadata-only as far as I can see: data and the L1/L2
mapping tables are unaffected and guests keep running. qemu-img check
-r all repairs the leaks, but qemu-img resize refuses the image until
it is repaired.
I reproduced this both with real vmd-written qcow2 images and with a
small standalone harness that drives vioqcow2.c past the 2GB boundary.
Before this change, the harness reports self_refs=0 for the lazy
refcount block. After this change, it reports self_refs=1.
Thanks,
Chris
diff --git a/usr.sbin/vmd/vioqcow2.c b/usr.sbin/vmd/vioqcow2.c
index 917cba2cbc0..79a481f3ee5 100644
--- a/usr.sbin/vmd/vioqcow2.c
+++ b/usr.sbin/vmd/vioqcow2.c
@@ -621,6 +621,13 @@ inc_refs(struct qcdisk *disk, off_t off, int newcluster)
buf = htobe64(l2cluster);
if (pwrite(disk->fd, &buf, sizeof(buf), l1off) != 8)
fatal("%s: failed to write ref block", __func__);
+ /*
+ * The newly allocated refcount block cluster must itself
+ * be marked in use, or qcow2 metadata leaks one cluster
+ * per refcount block (one per 2 GiB of allocated data at
+ * the default 64K cluster / 16-bit refcount sizing).
+ */
+ inc_refs(disk, l2cluster, 1);
}
refs = 1;
vmd(8): refcount lazily-allocated refcount blocks in qcow2 writer