VitastorFS and pseudo-FS

Vitastor has two file system implementations. Both can be used via vitastor-nfs.

Commands:

mount
start
upgrade
defrag

⚠️ Important: follow the instructions from Linux NFS write size for optimal Vitastor NFS performance if you use EC and HDD and mount your NFS from Linux.

Pseudo-FS

Simplified pseudo-FS proxy is used for file-based image access emulation. It’s not suitable as a full-featured file system: it lacks a lot of FS features, it stores all file/image metadata in memory and in etcd. So it’s fine for hundreds or thousands of large files/images, but not for millions.

Pseudo-FS proxy is intended for environments where other block volume access methods can’t be used or impose additional restrictions - for example, VMWare. NFS is better for VMWare than, for example, iSCSI, because with iSCSI, VMWare puts all VM images into one large shared block image in its own VMFS file system, and with NFS, VMWare doesn’t use VMFS and puts each VM disk in a regular file which is equal to one Vitastor block image, just as originally intended.

To use Vitastor pseudo-FS locally, run vitastor-nfs mount --block /mnt/vita.

Also you can start the network server:

vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool

To mount the FS exported by this server, run:

mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp

VitastorFS

VitastorFS is a full-featured clustered (Read-Write-Many) file system. It supports most POSIX features like hierarchical organization, symbolic links, hard links, quick renames and so on.

VitastorFS metadata is stored in a Parallel Optimistic B-Tree key-value database, implemented over a regular Vitastor block volume. Directory entries and inodes are stored in a simple human-readable JSON format in the B-Tree. vitastor-kv tool can be used to inspect the database.

To use VitastorFS:

Create a pool or choose an existing empty pool for FS data
Create an image for FS metadata, preferably in a faster (SSD or replica-HDD) pool, but you can create it in the data pool too if you want (image size doesn’t matter): vitastor-cli create -s 10G -p fastpool testfs
Mark data pool as an FS pool: vitastor-cli modify-pool --used-for-app fs:testfs data-pool
Either mount the FS: vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita
Or start the NFS server: vitastor-nfs start --fs testfs --pool data-pool

Supported POSIX features

Read-after-write semantics (read returns new data immediately after write)
Linear and random read and write
Writing outside current file size
Hierarchical structure, immediate rename of files and directories
File size change support (truncate)
Permissions (chmod/chown)
Flushing data to stable storage (if required) (fsync)
Symbolic links
Hard links
Special files (devices, sockets, named pipes)
File modification and attribute change time tracking (mtime and ctime)
Modification time (mtime) and last access time (atime) change support (utimes)
Correct handling of directory listing during file creation/deletion

Limitations

POSIX features currently not implemented in VitastorFS:

File locking is not supported
Actually used space is not counted, so du always reports apparent file sizes instead of actually allocated space
Access times (atime) are not tracked (like -o noatime)
Modification time (mtime) is updated lazily every second (like -o lazytime)
Permission enforcement is disabled by default (and Linux NFS client doesn’t enforce them too). Use --enforce 1 to enable it.

Other notable missing features which should be addressed in the future:

Inode ID reuse. Currently inode IDs always grow, the limit is 2^48 inodes, so in theory you may hit it if you create and delete a very large number of files
Compaction of the key-value B-Tree. Current implementation never merges or deletes B-Tree blocks, so B-Tree may become bloated over time. Currently you can use vitastor-kv dumpjson & loadjson commands to recreate the index in such situations.
Filesystem check tool. VitastorFS doesn’t have journal because it would impose a severe performance hit, optimistic CAS-based transactions are used instead of it. So, again, in theory an abnormal shutdown of the FS server may leave some garbage in the DB. The FS is implemented is such way that this garbage doesn’t affect its function, but having a tool to clean it up still seems a right thing to do.

Linux NFS write size

Linux NFS client (nfs/nfsv3/nfsv4 kernel modules) has a hard-coded maximum I/O size, currently set to 1 MB - see rsize and wsize in man 5 nfs.

This means that when you write to a file in an FS mounted over NFS, the maximum write request size is 1 MB, even in the O_DIRECT mode and even if the original write request is larger.

However, for optimal linear write performance in Vitastor EC (erasure-coded) pools, the size of write requests should be a multiple of block_size, multiplied by the data chunk count of the pool (pg_size-parity_chunks). When write requests are smaller or not a multiple of this number, Vitastor has to first read paired data blocks from disks, calculate new parity blocks and only then write them back. Obviously this is 2-3 times slower than a simple disk write.

Vitastor HDD setups use 1 MB block_size by default. So, for optimal performance, if you use EC 2+1 and HDD, you need your NFS client to send 2 MB write requests, if you use EC 4+1 - 4 MB and so on.

But Linux NFS client only writes in 1 MB chunks. 😢

The good news is that you can fix it by rebuilding Linux NFS kernel modules 😉 🤩! You need to change NFS_MAX_FILE_IO_SIZE in nfs_xdr.h and then rebuild and reload modules.

The instruction, using Debian as an example (should be ran under root):

# download current Linux kernel headers required to build modules
apt-get install linux-headers-`uname -r`

# replace NFS_MAX_FILE_IO_SIZE with a desired number (here it's 4194304 - 4 MB)
sed -i 's/NFS_MAX_FILE_IO_SIZE\s*.*/NFS_MAX_FILE_IO_SIZE\t(4194304U)/' /lib/modules/`uname -r`/source/include/linux/nfs_xdr.h

# download current Linux kernel source
mkdir linux_src
cd linux_src
apt-get source linux-image-`uname -r`-unsigned

# build NFS modules
cd linux-*/fs/nfs
make -C /lib/modules/`uname -r`/build M=$PWD -j8 modules
make -C /lib/modules/`uname -r`/build M=$PWD modules_install

# move default NFS modules away
mv /lib/modules/`uname -r`/kernel/fs/nfs ~/nfs_orig_`uname -r`
depmod -a

# unload old modules and load the new ones
rmmod nfsv3 nfs
modprobe nfsv3

After these (not much complicated 🙂) manipulations NFS begins to be mounted with new wsize and rsize by default and it fixes Vitastor-NFS linear write performance.

Horizontal scaling

Linux NFS 3.0 client doesn’t support built-in scaling or failover, i.e. you can’t specify multiple server addresses when mounting the FS.

However, you can use any regular TCP load balancing over multiple NFS servers. It’s absolutely safe with immediate_commit=all and client_enable_writeback=false settings, because Vitastor NFS proxy doesn’t keep uncommitted data in memory with these settings. But it may even work without immediate_commit=all because the Linux NFS client repeats all uncommitted writes if it loses the connection.

RDMA

vitastor-nfs supports NFS over RDMA, which, in theory, should also allow to use VitastorFS from GPUDirect.

You can test NFS-RDMA even if you don’t have an RDMA NIC using SoftROCE:

First, add SoftROCE device on both servers: rdma link add rxe0 type rxe netdev eth0. Here, rdma utility is a part the iproute2 package, and eth0 should be replaced with the name of your Ethernet NIC.
Start vitastor-nfs with RDMA: vitastor-nfs start (--fs <NAME> | --block) --pool <POOL> --port 20049 --nfs_rdma 20049 --portmap 0
Mount the FS: mount 192.168.0.10:/mnt/test/ /mnt/vita/ -o port=20049,mountport=20049,nfsvers=3,soft,nolock,rdma

Commands

mount

vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>

Start local filesystem server and mount file system to .

Use regular umount <MOUNTPOINT> to unmount the FS.

The server will be automatically stopped when the FS is unmounted.

-o|--options <OPT> - Pass additional NFS mount options (ex.: -o async).

start

vitastor-nfs (--fs <NAME> | --block) start

Start network NFS server. Options:


`--bind <IP>`	bind service to <IP> address (default 0.0.0.0)
`--port <PORT>`	use port <PORT> for NFS services (default is 2049). Specify “auto” to auto-select and print port
`--portmap 0`	do not listen on port 111 (portmap/rpcbind, requires root)
`--nfs_rdma <PORT>`	enable NFS-RDMA at RDMA-CM port <PORT> (you can try 20049). If RDMA is enabled and --port is set to 0, TCP will be disabled
`--nfs_rdma_credit 16`	maximum operation credit for RDMA clients (max iodepth)
`--nfs_rdma_send 1024`	maximum RDMA send operation count (should be larger than iodepth)
`--nfs_rdma_alloc 1M`	RDMA memory allocation rounding
`--nfs_rdma_gc 64M`	maximum unused RDMA buffers

upgrade

vitastor-nfs --fs <NAME> upgrade

Upgrade FS metadata. Can be run online, but server(s) should be restarted after upgrade.

defrag

vitastor-nfs --fs <NAME> defrag [OPTIONS] [--dry-run]

Defragment volumes used for small file storage having more than <defrag_percent> % of data removed. Can be run online.

In VitastorFS, small files are stored in large “volumes” / “shared inodes” one after another. When you delete or extend such files, they are moved and garbage is left behind. Defragmentation removes garbage and moves data still in use to new volumes.

Options:


`--volume_untouched 86400`	Defragment volumes last appended to at least this number of seconds ago
`--defrag_percent 50`	Defragment volumes with at least this % of removed data
`--defrag_block_count 16`	Read this number of pool blocks at once during defrag
`--defrag_iodepth 16`	Move up to this number of files in parallel during defrag
`--trace`	Print verbose defragmentation status
`--dry-run`	Skip modifications, only print status
`--recalc-stats`	Recalculate all volume statistics
`--include-empty`	Include old and empty volumes; make sure to restart NFS servers before using it
`--no-rm`	Move, but do not delete data

Common options


`--fs <NAME>`	use VitastorFS with metadata in image <NAME>
`--block`	use pseudo-FS presenting images as files
`--pool <POOL>`	use <POOL> as default pool for new files
`--subdir <DIR>`	export <DIR> instead of root directory (pseudo-FS only)
`--nfspath <PATH>`	set NFS export path to <PATH> (default is /)
`--pidfile <FILE>`	write process ID to the specified file
`--logfile <FILE>`	log to the specified file
`--enforce 1`	enforce permissions at the server side (no by default)
`--foreground 1`	stay in foreground, do not daemonize