Vitastor has two file system implementations. Both can be used via vitastor-nfs
.
Commands:
Pseudo-FS
Simplified pseudo-FS proxy is used for file-based image access emulation. It’s not suitable as a full-featured file system: it lacks a lot of FS features, it stores all file/image metadata in memory and in etcd. So it’s fine for hundreds or thousands of large files/images, but not for millions.
Pseudo-FS proxy is intended for environments where other block volume access methods can’t be used or impose additional restrictions - for example, VMWare. NFS is better for VMWare than, for example, iSCSI, because with iSCSI, VMWare puts all VM images into one large shared block image in its own VMFS file system, and with NFS, VMWare doesn’t use VMFS and puts each VM disk in a regular file which is equal to one Vitastor block image, just as originally intended.
To use Vitastor pseudo-FS locally, run vitastor-nfs mount --block /mnt/vita
.
Also you can start the network server:
vitastor-nfs start --block --etcd_address 192.168.5.10:2379 --portmap 0 --port 2050 --pool testpool
To mount the FS exported by this server, run:
mount server:/ /mnt/ -o port=2050,mountport=2050,nfsvers=3,soft,nolock,tcp
VitastorFS
VitastorFS is a full-featured clustered (Read-Write-Many) file system. It supports most POSIX features like hierarchical organization, symbolic links, hard links, quick renames and so on.
VitastorFS metadata is stored in a Parallel Optimistic B-Tree key-value database,
implemented over a regular Vitastor block volume. Directory entries and inodes
are stored in a simple human-readable JSON format in the B-Tree. vitastor-kv
tool
can be used to inspect the database.
To use VitastorFS:
- Create a pool or choose an existing empty pool for FS data
- Create an image for FS metadata, preferably in a faster (SSD or replica-HDD) pool,
but you can create it in the data pool too if you want (image size doesn’t matter):
vitastor-cli create -s 10G -p fastpool testfs
- Mark data pool as an FS pool:
vitastor-cli modify-pool --used-for-fs testfs data-pool
- Either mount the FS:
vitastor-nfs mount --fs testfs --pool data-pool /mnt/vita
- Or start the NFS server:
vitastor-nfs start --fs testfs --pool data-pool
Supported POSIX features
- Read-after-write semantics (read returns new data immediately after write)
- Linear and random read and write
- Writing outside current file size
- Hierarchical structure, immediate rename of files and directories
- File size change support (truncate)
- Permissions (chmod/chown)
- Flushing data to stable storage (if required) (fsync)
- Symbolic links
- Hard links
- Special files (devices, sockets, named pipes)
- File modification and attribute change time tracking (mtime and ctime)
- Modification time (mtime) and last access time (atime) change support (utimes)
- Correct handling of directory listing during file creation/deletion
Limitations
POSIX features currently not implemented in VitastorFS:
- File locking is not supported
- Actually used space is not counted, so
du
always reports apparent file sizes instead of actually allocated space - Access times (
atime
) are not tracked (like-o noatime
) - Modification time (
mtime
) is updated lazily every second (like-o lazytime
)
Other notable missing features which should be addressed in the future:
- Inode ID reuse. Currently inode IDs always grow, the limit is 2^48 inodes, so in theory you may hit it if you create and delete a very large number of files
- Compaction of the key-value B-Tree. Current implementation never merges or deletes
B-Tree blocks, so B-Tree may become bloated over time. Currently you can
use
vitastor-kv dumpjson
&loadjson
commands to recreate the index in such situations. - Filesystem check tool. VitastorFS doesn’t have journal because it would impose a severe performance hit, optimistic CAS-based transactions are used instead of it. So, again, in theory an abnormal shutdown of the FS server may leave some garbage in the DB. The FS is implemented is such way that this garbage doesn’t affect its function, but having a tool to clean it up still seems a right thing to do.
Horizontal scaling
Linux NFS 3.0 client doesn’t support built-in scaling or failover, i.e. you can’t specify multiple server addresses when mounting the FS.
However, you can use any regular TCP load balancing over multiple NFS servers.
It’s absolutely safe with immediate_commit=all
and client_enable_writeback=false
settings, because Vitastor NFS proxy doesn’t keep uncommitted data in memory
with these settings. But it may even work without immediate_commit=all
because
the Linux NFS client repeats all uncommitted writes if it loses the connection.
RDMA
vitastor-nfs supports NFS over RDMA, which, in theory, should also allow to use VitastorFS from GPUDirect.
You can test NFS-RDMA even if you don’t have an RDMA NIC using SoftROCE:
-
First, add SoftROCE device on both servers:
rdma link add rxe0 type rxe netdev eth0
. Here,rdma
utility is a part the iproute2 package, andeth0
should be replaced with the name of your Ethernet NIC. -
Start vitastor-nfs with RDMA:
vitastor-nfs start (--fs <NAME> | --block) --pool <POOL> --port 20049 --nfs_rdma 20049 --portmap 0
-
Mount the FS:
mount 192.168.0.10:/mnt/test/ /mnt/vita/ -o port=20049,mountport=20049,nfsvers=3,soft,nolock,rdma
Commands
mount
vitastor-nfs (--fs <NAME> | --block) [-o <OPT>] mount <MOUNTPOINT>
Start local filesystem server and mount file system to
Use regular umount <MOUNTPOINT>
to unmount the FS.
The server will be automatically stopped when the FS is unmounted.
-o|--options <OPT>
- Pass additional NFS mount options (ex.: -o async).
start
vitastor-nfs (--fs <NAME> | --block) start
Start network NFS server. Options:
--bind <IP> |
bind service to <IP> address (default 0.0.0.0) |
--port <PORT> |
use port <PORT> for NFS services (default is 2049). Specify “auto” to auto-select and print port |
--portmap 0 |
do not listen on port 111 (portmap/rpcbind, requires root) |
--nfs_rdma <PORT> |
enable NFS-RDMA at RDMA-CM port <PORT> (you can try 20049). If RDMA is enabled and --port is set to 0, TCP will be disabled |
--nfs_rdma_credit 16 |
maximum operation credit for RDMA clients (max iodepth) |
--nfs_rdma_send 1024 |
maximum RDMA send operation count (should be larger than iodepth) |
--nfs_rdma_alloc 1M |
RDMA memory allocation rounding |
--nfs_rdma_gc 64M |
maximum unused RDMA buffers |
upgrade
vitastor-nfs --fs <NAME> upgrade
Upgrade FS metadata. Can be run online, but server(s) should be restarted after upgrade.
defrag
vitastor-nfs --fs <NAME> defrag [OPTIONS] [--dry-run]
Defragment volumes used for small file storage having more than <defrag_percent> % of data removed. Can be run online.
In VitastorFS, small files are stored in large “volumes” / “shared inodes” one after another. When you delete or extend such files, they are moved and garbage is left behind. Defragmentation removes garbage and moves data still in use to new volumes.
Options:
--volume_untouched 86400 |
Defragment volumes last appended to at least this number of seconds ago |
--defrag_percent 50 |
Defragment volumes with at least this % of removed data |
--defrag_block_count 16 |
Read this number of pool blocks at once during defrag |
--defrag_iodepth 16 |
Move up to this number of files in parallel during defrag |
--trace |
Print verbose defragmentation status |
--dry-run |
Skip modifications, only print status |
--recalc-stats |
Recalculate all volume statistics |
--include-empty |
Include old and empty volumes; make sure to restart NFS servers before using it |
--no-rm |
Move, but do not delete data |
Common options
--fs <NAME> |
use VitastorFS with metadata in image <NAME> |
--block |
use pseudo-FS presenting images as files |
--pool <POOL> |
use <POOL> as default pool for new files |
--subdir <DIR> |
export <DIR> instead of root directory (pseudo-FS only) |
--nfspath <PATH> |
set NFS export path to <PATH> (default is /) |
--pidfile <FILE> |
write process ID to the specified file |
--logfile <FILE> |
log to the specified file |
--foreground 1 |
stay in foreground, do not daemonize |