Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Anchor
storage_requirements
storage_requirements

Storage requirements

Our requirements from the storage are pretty much the same any other cloud provider has:-        

  • Full disk encryption. In more traditional cloud providers this is done to guard users from each other and from cloud provider employees. Guaranteeing that data will never leak. In our conditions we also don’t like the idea of the device being stolen, and unencrypted data can be accessed

...

  • Thin provisioning. Efficient usage of the storage - the blocks which are not used by a user should not be occupied

...

  • Snapshotting.  Snapshot a state of guest and easily roll back or forward between the snapshots. Read more about snapshots

...

  • on Snapshot page.
  • Compression also adds up to the efficiency of storage usage. And as well might increase read speeds on slower volumes (e.g. emmc)

Problem with current architecture

...

Our current approach is to rely on qcow2 in order to satisfy mentioned Storage requirements. While it does tick all the boxes, the major problem with it -very inefficient usage of memory and CPU of the host, impossible to parallelize the requests, and notoriously difficult to optimize.

...

While this is possible by engaging Vhost/Virtio and dedicating a partition on a disk to a virtual machine, we would lose the features mentioned in the Storage requirements chapter (such as thin provisioning). So, we are looking for a compromise.

...

One way to do the thin provisioning would be to direct the data flow to LVM. But a file system with OCI features would give us much richer functionality:

  • LVM does support compression and thin provisioning, but the performance penalty is very high, which kills the major benefit of LVM-based solution
  • File system has much more context about what is happening with it's blocks: LVM can't know which blocks are free and just keeping junks. Therefore snapshotting and thin provisioning is much more space efficient when implemented in file system layer
  • Growing of the disk space takes a lot more steps in LVM (add disk, grow volume group, grow logical volume, grow file system sitting on the virtual media), which in generally can not be done online
  • LVM lacks quota support. Once a Logical Volume was allocated to a container, you can't easily change the size of that volume. While in filesystem base approach you would need only change the quota of a dataset
  • FS-approach allows to allocate "Project IDs", and count quotas for multiple datasets/files. For example the space allocated for all the containers belonging to Eve could be counted agains one quota


The . And the fs of choice is zfs (unfortunately there is not many file systems satisfying the requirements, currently the list is limited btrfs and bcachefs - the former still is not stable enough especially when it comes to software raid support, the latter is very promising but has to go a long way to become mature).

Anchor
intermediate_step
intermediate_step

Intermediate step

While the efforts on NVMe vhost already started, and the first prototype is approaching, we still require time to upstream these efforts, and bring implementation up to production level (extensive testing and performance tuning).

...

This approach has additional advantages:

  1. Our partner company recently started parallel efforts on improving performance of Zfs Zvol. They would require  a playground for their development and benchmarking activities. More importantly, in the bundle Qemu / SCSI_virtio / SCSI_vhost / file_io / Zfs_zvols, zfs_zvols is the least tested component, since it is not used so widely as other features of zfs. Also linux has only recently become an important target for zfs to run on, and may require specific kernel patches to improve stability. But even for the greater final goal we still need to perform this testing, and it is better to start it earlier while we can reach Klara Inc. for our partner for help.
  2. We have important customers which are very interested to have storage with redundancy (aka RAID - Redundant Array of Inexpensive Disks). As of now we do not support software RAID. If zvol stress testing demonstrates level mature enough for production, we can provide this anticipated feature already in September 2021

These efforts have already started by Petr Fedchenkov. Zvol support has been merged into master in https://github.com/lf-edge/eve/pull/2134. Next step is to add Scsi-vhost backed by a zvol.As current state, Eve already supports Zfs-formatted /persist and joining multiple disks into redundant arrays.

NVMe Vhost - The greater goal

...

Here another advantage of the Intermediate step appears. If a production-ready zfs-based solution comes before the next wave of customers, we will endup with fewer nodes based on the legacy storage format. And transitioning from SCSI-vhost to NVMe-vhost is significantly easier than reformatting the whole system disk.

...

  • /dev/vda for virtio-scsi (current Eve implementation)
  • /dev/sda for vhost-scsi-pci (implementation described in the chapter Intermediate step)
  • /dev/nvme0n1 for nvme emulation

...

So this transition should happen also only with customer awareness (e.g. customer should explicitly press the button in order to update).

Anchor
testing_and_benchmarking
testing_and_benchmarking

Testing and benchmarking

Fio - Flexible I/O tester is the golden standard for benchmarking storage and file system performance. It is going to be our go-to tool for monitoring performance improvement/degradation. 

We will get the first numbers already pretty soon, when the Intermediate step PoC is ready. This would be also interesting to see what is going to be the bottleneck of such architecture, and potentially adjust the development plan according to the discoveries. 

...

Storage is fairly critical component, thus we need at least initial testing before rolling out the updated architecture to the customers. Therefore as the first step of integrating we have decided to invest some efforts into tests, as described in the chapter Testing and benchmarking.

Milestones / Next steps

  • End of August 2021: initial storage stress tests
  • End of September 2021
    • /persist partition formatted as ZFS, transport vhost-scsi or emulated NVMe in the guest
    • Prototype implementation of NVMe-Vhost emulation
  • End of October 2021: Submission of the first version of the patches implementing NVMe-Vhost in Linux and Qemu; Integrated ZFS into protype
  • End of November 2021: Upstreamed patches; Performance tuning, bug fixes
  • End of December 2021: Extensive testing; First production ready version w/o transitioning existing instances to new storage format (not implemented/tested yet)
  • Never ending: Continuous performance optimisation and bug fixing

References

  1. Project repository
  2. Snapshots in EVE