Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Creating a snapshot immediately after creating a logical volume and before the first launch of the application.
    This will allow us to reset the virtual machine to its initial state without recreating the logical volume (We are now
    recreating the logical volumes). Basically, it just simplifies the current process in EVE to clear the logical volume
    for the VM without adding anything new.
  2. Creating/rollback a snapshot when the application is turned off. This is considered as a basic capability that does not
    need to be consistent with applications. Can be used as a checkpoint that does not imply conflicts with rollback when
    the VM is powered off. User story for this case can be anything, such as rollback a logical volume to state N after
    an unsuccessful update.

...

  1. I need to be able to create a snapshot of the logical volume where stored database N and used by application N,
    for example before important updates once a week.
  2. I need to be able to rollback to snapshot N if something went wrong and, for example, data was corrupted in
    the database after the actions of user X.
  3. I would like to be able to get information about snapshots on the controller, for example, to understand the space
    occupied by snapshots and their status.
  4. I need to be able to manage snapshots through the controller (create/delete/rollback). For example, EVE has run out
    of space or is not enough to create a new VM. Thus, as an administrator, I can delete the old snapshot via controller
    that I no longer need, thereby freeing up space for the new VM. 

As a VM usual user without controller access:

  1. After I manually or through a script paused the I/O of application N and reset its cache (if the application has such
    functionality), I need to be able to send a command to create or roll back rollback a snapshot, and in case of a positive or
    negative outcome, receive information about this event from EVE. (for example, lack of space when creating a snapshot,
    a successful rollback operation, and other information or problems)
  2. Be have able to get/view a list of available snapshots for a specific disk, available for rollback or deletion on the VM side.

...

At the moment we see bottom-up (HostController-initiated) and top-down (VM-initiated) models to address the problem

In the VM-initiated snapshot, an application running in the VM would ensure it flushed the latest states on the disk,
and handshake to the Local Profile Server to initiate the snapshot. This is the most straightforward approach for us, but
has its drawbacks - once the flush operation is completed, the system is allowed to continue write operations.
This means that once we rolled back to one of the snapshots, the filesystem would likely have to use its fault-tolerance
the mechanism as if a sudden power-off happened. But at least, it will be guaranteed that it has the latest data from
the application.

For the Host-Initiated snapshot, we have to have an agent running in the virtual machine as a daemon. There is no way
around it. Fortunately, qemu comes with qemu-guest-agent software, which is available on Linux and Windows.
The agent is running as a daemon and communicates with the host via a virtio serial. From the host side commands can
be sent to the guest agent via the virtio-serial or AF_VSOCK socket.

Qemu-guest-agent fsfreeze command

Once such a command is received by the daemon, it would flush the guest file system and temporarily freeze it. That
would ensure that the snapshot is consistent and once we are rolled back to it, there will be no
fsck problems.

From the guest side, this is achieved with the FIRFREEZE syscall. A subsequent FITHAW will allow the writes again.

It is possible to specify a hook to run each time fsfreeze/thaw happened. Such a script can be useful to flush the state
of a user application (e.g. a database), to make sure not only the filesystem is consistent, but also it has the latest
(and consistent as well) data from the application.

The guest-agent waits for the hook to terminate, as can be seen from the code.

...

And you need to understand that both approaches are important and have the right to exist depending on user stories.
In fact, based on user stories, and regardless of whether we will implement one or another approach or even both at once,
we need to have our own or already existing guest agent that will work on the VM side (the reasons will become
clearer a little later).

VM-initiated snapshot

In the VM-initiated snapshot, an application running in the VM would ensure it flushed the latest states on the disk,
and handshake to the Local Profile Server to initiate the snapshot. This is the most straightforward approach for us, but
has its drawbacks - once the flush operation is completed, the system is allowed to continue write operations.
This means that once we rolled back to one of the snapshots, the filesystem would likely have to use its fault-tolerance
the mechanism as if a sudden power-off happened. But at least, it will be guaranteed that it has the latest data from
the application.

Let's try to take a closer look at the list of steps for executing a command to create or rollback a snapshot in this approach.
It is also worth noting that a guest agent (for example, some future eve-guest-agent) must be running on the VM in order
for the user to initiate the creation or rollback of a snapshot from the VM. Why eve-guest-agent? Because in order to implement
this approach, there is a critical need to receive feedback from EVE. Thus, it will be either a completely new guest agent tailored
for work and needs with EVE OS, or a modified one (for example, based on qemu-ga, google-ga, or others)

Steps:

  1. The user must use the functionality of the application that runs on the VM and execute a command that will flush
    the cache of this application to the “physical” disk and completely end or suspend I/O in the application
    (if creating a snapshot, you can only suspend I/O if rollback to a snapshot, then you can complete all current
    write operations). In this case, everything depends on the functionality of the N application that runs on the VM.
    As a last resort, if the application does not support such functionality for creating snapshots/backups, then
    the user can always exit the application. The user will perform this step manually (or through a script) since
    we cannot adapt to any specific applications.
  2. After the successful completion of step 1, the user, through the agent application, let it be eve-guest-agent for example,
    sends a command to create a snapshot, something like: eve-guest-agent snapshot --create /dev/sdb
  3. The command is sent to EVE, and all the necessary conditions for creating/rolling back a snapshot
    are calculated on EVE and if:
    1. For example, if there is not enough free space, EVE will return an error on the eve-guest-agent VM about
      not having enough space to create a snapshot on EVE. There may also be another error that appeared
      in preparation for the requested operation. And the command is interrupted.
    2. Next step
  4. EVE sends a command to the VM to do a sync and freeze the available filesystems.
  5. EVE checks that the filesystems on the VM have been frozen.
  6. EVE is running a command to create/rollback a snapshot
    1. If something went wrong, EVE sends an FS thaw command to the VM, and eve-guest-agent returns an error to the user
    2. Next step
  7. EVE thaws the FS on the guest VM and checks that it has been thawed.
  8. EVE notifies the user via eve-guest-agent that the operation was successful
  9. Guest VM returns to a normal state (user returns the application to its normal state)

Controller-Initiated snapshot

For Controller-Initiated snapshot, we, as in the case of VM-initiated snapshot, must have an agent running as a daemon on the VM.
And there is no way around it. The only difference will be that in place with the create or rollback command, we will also have to send
a list/script with commands for the guest VM from the controller. This list of commands will also be set by the user, and it is assumed
that these will be ordinary commands for running on the VM, which will be performed before and after snapshot operations to ensure
consistent snapshots.

Qemu-guest-agent fsfreeze command

Also, qemu comes with the qemu-guest-agent software, which is available for Linux and Windows. The agent runs as a daemon and
communicates with the host via the virtio or AF_VSOCK serial port. Qemu-guest-agent allows the VM side to execute sync/fsfreeze/fs-thaw
or any other command sent by the host. This can be considered as a potential solution for the Controller-Initiated snapshot approach,
or as a basis for the development of eve-guest-agent

First, EVE will send user commands to the VM to put the application into for create snapshot/backup state. After the command
is successfully executed, EVE will begin the snapshot creation or rollback procedure. Next EVE send
 fsfreeze command

Once such a command is received by the daemon, it would flush the guest file system and temporarily freeze it. That
would ensure that the snapshot is consistent and once we are rolled back to it, there will be no
fsck problems.

From the guest side, this is achieved with the FIRFREEZE syscall. A subsequent FITHAW will allow the writes again.

The steps for this are rough as follows:

  1. The host sends a command to qemu-guest-agent to put the application into for create snapshot/backup state;
  2. Receive a command from the controller to create or rollback a snapshot;
  3. Checking the state of the file system in the VM;
  4. Make sure the file system is working normally;
  5. Flush RAM to disk and freeze the file system in the VM;
  6. Make sure the file system in the VM is frozen;
  7. Create a new snapshot or rollback to an existing snapshot;
  8. After completing the command with snapshot, send the command to unfreeze the file system in the VM;
  9. Make sure that the file system is unfrozen and works normally;
  10. The host sends a command to qemu-guest-agent ;
  11. fsfreeze/thaw occurs then a hook is called to execute app-specific data flush, this requires a code change in-app;
  12. Hook returns and qemu-guest-agent completes the command processing;
  13. The host notifies the guest that a snapshot has been taken;
  14. to put the application into a normal state.
  15. Sending up-to-date information to the controller.
  16. VM Guest goes back to normal updating of their volume.

*Between step-2 5 and step-5 9 the guest can not do updates.

...