How to Enlarge your Iops again
Vitaliy Filippov
Examples:
0.6-0.7 — A lot of crash/hang fixes
0.8.9 — Test stability and CI
1.3.0 — RDMA without ODP
1.4.0 — VDUSE in CSI
1.6.0 — ENOSPC handling
1.10.1 — Failover speedup
0.7.0 — NFS
1.0.0 — Checksums
1.5.0 — VitastorFS & KV π
1.7.0 — I/O threads, Antietcd
1.9.0 — OpenNebula
2.0.0 — S3 ππ
2.2.0 — Local reads
0.8.0 — udev & vitastor-disk
0.8.7 — Online configuration
1.4.0 — Rebalance auto-tuning
1.6.0 — Hierarchical failure domains
1.7.0 — Prometheus
1.9.x — dd, resize, block_size auto-guess
1.10.0 — RDMA auto-configuration
Block storage layer based on the American blueprints π€£
Append to the end
π Amortized writes ⇒ 1x RAM, WA ↓↓
π’ Need Compaction (LSM?..) ⇒ WA ↑↑, CPU ↑↑
We can just track garbage
Put each object entries to a single block...
...But that's not LSMeta anymore, and WA ↑↑
However, the idea attracted me
Decided to rewrite it once more
WA 1 is worth the invalidation π
Buffer area instead of the journal
Hashmap and linked lists in RAM
It's absolutely pointless if you don't optimize CPU usage!
Memory copying, (de)serialization, threads, mutexes... πππ
Linux 6.11 To Introduce Block Atomic Writes –
Including NVMe & SCSI Support
Atomic Write Unit Power Fail (AWUPF): This field indicates the size of the write
operation guaranteed to be written atomically to the NVM across all namespaces with
any supported namespace format during a power fail or error condition.
— that's what we need!
⇒ 4 KB writes on SSD/NVMe are always atomic!
...cool, Vitastor already relies on it π€£
122 TB D5-5536 — 16 KB block
Write xxx KB blocks → pull the cable → check
MySQL: innodb_doublewrite=OFF
Vitastor: ???
Write Intents
| What do we store | |
|---|---|
| 2.4.1 | Raw blocks + cpp-btree + journal |
| heap | Blocks with 20% reserved space + hashmap |
| lsmeta | Separate malloc's + hashmap + linked lists |
| SSD | SSD (256 KB) | HDD (1 MB) | ||
|---|---|---|---|---|
| 2.4.1 β | 663 MB | 371 MB | 152 MB | |
| heap | 740 MB | 412 MB | 165 MB | +11% |
| ...but in the worst case — the whole metadata area π€£ | ||||
| lsmeta | 785 MB | 456 MB | 154 MB | +18% |
* without checksums (128 MB .. 1 GB per 1 GB more RAM)
