Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Our requirements from the storage are pretty much the same any other cloud provider has:-        

  • Full disk encryption. In more traditional cloud providers this is done to guard users from each other and from cloud provider employees. Guaranteeing that data will never leak. In our conditions we also don’t like the idea of the device being stolen, and unencrypted data can be accessed

...

  • Thin provisioning. Efficient usage of the storage - the blocks which are not used by a user should not be occupied

...

  • Snapshotting.  Snapshot a state of guest and easily roll back or forward between the snapshots

...

  • . Read more about snapshots on Snapshot page.
  • Compression also adds up to the efficiency of storage usage. And as well might increase read speeds on slower volumes (e.g. emmc)

Problem with current architecture

...

One way to do the thin provisioning would be to direct the data flow to LVM. But a file system with OCI features would give us much richer functionality:

  • LVM does support compression and thin provisioning, but the performance penalty is very high, which kills the major benefit of LVM-based solution
  • File system has much more context about what is happening with it's blocks: LVM can't know which blocks are free and just keeping junks. Therefore snapshotting and thin provisioning is much more space efficient when implemented in file system layer
  • Growing of the disk space takes a lot more steps in LVM (add disk, grow volume group, grow logical volume, grow file system sitting on the virtual media), which in generally can not be done online
  • LVM lacks quota support. Once a Logical Volume was allocated to a container, you can't easily change the size of that volume. While in filesystem base approach you would need only change the quota of a dataset
  • FS-approach allows to allocate "Project IDs", and count quotas for multiple datasets/files. For example the space allocated for all the containers belonging to Eve could be counted agains one quota


The . And the fs of choice is zfs (unfortunately there is not many file systems satisfying the requirements, currently the list is limited btrfs and bcachefs - the former still is not stable enough especially when it comes to software raid support, the latter is very promising but has to go a long way to become mature).

...

Here another advantage of the Intermediate step appears. If a production-ready zfs-based solution comes before the next wave of customers, we will endup with fewer nodes based on the legacy storage format. And transitioning from SCSI-vhost to NVMe-vhost is significantly easier than reformatting the whole system disk.

...

  • /dev/vda for virtio-scsi (current Eve implementation)
  • /dev/sda for vhost-scsi-pci (implementation described in the chapter Intermediate step)
  • /dev/nvme0n1 for nvme emulation

...

We will get the first numbers already pretty soon, when the Intermediate step PoC is ready. This would be also interesting to see what is going to be the bottleneck of such architecture, and potentially adjust the development plan according to the discoveries. 

...

References

  1. Project repository
  2. Snapshots in EVE