How Vitastor handles image snapshots

2023-08-30

In fact, there are only 3 ways to implement snapshots: “redo”, “undo”, and “cow”. Also, proper snapshots should always be atomic. What does this all mean?

Way 1 — “back” or “undo”

During a write after creating snapshot, the old version of a data block is internally copied to the “snapshot” entity, and after that the new version is written like usual, in the same place where the old one was previously stored. In this implementation, only the newest image version is self-contained, and snapshots represent “undo logs”, changes required to “roll back” it to the old state.

Way 2 — “forward” or “redo”

When a snapshot is created, the image is internally renamed to the snapshot and replaced by a new empty object representing the new image state. Old data isn’t moved anywhere, new data is written into the new empty image object. Reads first return data from the new image and then from the snapshot, if the new image doesn’t contain anything at the given offset. In this implementation, only the oldest image version is self-contained, and all snapshots represent “redo logs”, changes required to update it to each version.

Way 3 — “copy-on-write” or “redirect-write”

Data blocks are stored independently and with reference counters. When a snapshot is created, all references to data blocks are copied and get their reference counters increased by 1. During writes, new block versions are written just the same as all previous blocks, references are changed to refer to the new version, previous versions get reference counters decreased by 1, and deleted when it reaches 0.

Pros and cons

Way 1 is optimized for quick snapshot deletion, but rollback is slow and doesn’t allow to store a tree of snapshots/clones.
Way 2 is optimized for quick rollback, but at a cost of slight overhead during reads: you have to check the whole chain of parents.
Way 3 - both rollback and delete are fast, but at a cost of permanent overhead during all reads and writes - metadata indexes quickly grow and begin to require additional disk reads or extra memory usage, and additional disk writes during garbage collection.

Vitastor always uses (2) - both for snapshots and clones. Classic SAN systems often implement (1). Ceph has both (1) and (2), but both implementations are questionable. ZFS uses (3), because snapshots can be cloned, and internally it’s a CoW file system - but with respective overhead: where a raw device does 168k iops randwrite Q128, ZFS can only give 28-38k iops, depending on settings.

Snapshot atomicity

Every proper snapshot implementation implies atomicity. Why? Because atomic snapshots can be taken online without risk - restoring from an atomic snapshot for applications is identical to restoring from an abrupt power outage - and most applications handle that without problem. The only exceptions are systems like ClickHouse, where it’s an intentional trade-off.

But what is snapshot atomicity? Atomicity is a requirement that there is a time moment T which splits all writes into “old” and “new”. That is, snapshot should contain all writes issued before T and no writes issued after T. That is, snapshot shouldn’t contain a mess similar to old-new-old-new writes, because it breaks fsync and write ordering and leads to inconsistent data on restore attempts.

Vitastor snapshots are atomic within a single client - i.e. within a disk attached to a VM once or within one VDUSE or NBD daemon. Atomic snapshots of multiple images will also be supported, but it’s a spoiler for the future :-).