Snapshots in EVE

Currently, support for snapshot functionality will only be available in EVE in a ZFS storage configuration.

Requirements and motivation:

After we began to actively develop the ZFS storage in EVE, our users had a requirement in functionality for working
with snapshots, namely, at the moment we want to have support for the following functionality:

Should be able to provide an ability to snapshot a storage volume and store the snapshot locally;
Should be able to create multiple snapshots of each volume;
Should be able to rollback any volume to a given snapshot (without causing data corruption);

EVE doesn't need to transfer the snapshot out of the local storage on the other device.
All external backup and restores should be done as part of application data backup by a 3rd party or by cluster storage.

Snapshots are not backups

It is important to clarify, that while snapshots are providing time machine capabilities, it is not a backup alternative.
Similarly, a raid is not a backup. If the disk is broken, or the whole node gets in the fire/flood/tornado/etc, all the data,
including every snapshot is gone.

Furthermore, an application (e.g. a database), knows better how to backup itself more efficiently. Such backup usually
would take much less disk space, as it does not contain the application itself. And generally uploading these backups
to a cloud (e.g. s3) is a better disaster recovery strategy.

Snapshots, on the other hand, are a perfect first-line defense. If something is going wrong in the application, it is very
easy to rollback the whole system to an earlier state.

To summarize, we want to be communicated to users that snapshots do not qualify as a disaster recovery strategy.
But they are a good addition to an existing strategy.

Crash-Consistent snapshots vs application-consistent snapshots

The Crash-Consistent snapshot contains an immediate state of the disk. As if there was a power cut-off at the point of
snapshot creation. Whatever was in memory is lost. For modern applications, Crash-Consistent snapshots are enough
because they are designed to tolerate power cut-offs.

The application-consistent snapshot is created with the collaboration of the app running in the VM. Of course, the
application needs to be aware of the snapshot being taken to flush all the data before it happens.

Application concern of EVE-OS volume snapshots

Problem statement

The main target is to make EVE-OS take ZFS snapshots of the volumes of an application instance when the controller
requests (using some new EVE API) and also add an EVE API to request that the application instance restart from
a particular snapshot. This is very efficient and the ZFS snapshots are atomic.

Problem

However, the concern is around applications and guest VMs which have their own internal buffering and/or ordering
of writes to their virtual disks. Even though the ZFS snapshots are perfectly consistent such an application instance
might never be able to successfully run using a snapshot.

Typically a guest VM (kernel) will not have much of an issue. It might have some kernel buffers which are not written
in which case some fsck might run when booting guest VM from the snapshot.

But if there is some (home-grown or otherwise) database-like application running in the guest VM then all bets are
off since we do not know what assumptions it makes. We know that such applications exist because we’ve been
requested to add support for the graceful shutdown of applications since they can not handle the box just being
powered off.

Thus if we promise that the snapshots are usable we might end up with very unhappy users.

Possible Solution

At a minimum, we need to make it clear that the custom application need to make sure it can be restarted from an
arbitrary snapshot, but we don’t have much confidence that will be sufficient. If there was an API where EVE-OS
would tell the app instance “flush all of your application buffers now; snapshot will be taken” followed by EVE-OS
doing the snapshot and then calling an API to tell the app instance “snapshot was done”.

Approaches we can take to implement application-consistent snapshots

At the moment we see bottom-up (Host-initiated) and top-down (VM-initiated) models to address the problem

In the VM-initiated snapshot an application running in the VM would ensure it flushed the latest states on the disk,
and talked to the Local Profile Server to initiate the snapshot. This is the most straightforward approach for us, but
has its drawbacks - once the flush operation is completed, the system is allowed to continue write operations.
This means that once we rolled back to one of the snapshots, the filesystem would likely have to use its fault-tolerance
the mechanism as if a sudden power-off happened. But at least, it will be guaranteed that it has the latest data from
the application.

For the Host-Initiated snapshot, we have to have an agent running in the virtual machine as a daemon. There is no way
around it. Fortunately, qemu comes with qemu-guest-agent software, which is available on Linux and Windows.
The agent is running as a daemon and communicates with the host via a virtio serial. From the host side commands can
be sent to the guest agent via the QMP socket.

Qemu-guest-agent fsfreeze command

Once such a command is received by the daemon, it would flush the guest file system and temporarily freeze it. That
would ensure that the snapshot is consistent and once we are rolled back to it, there will be no fsck problems.

From the guest side, this is achieved with the FIRFREEZE syscall. A subsequent FITHAW will allow the writes again.

It is possible to specify a hook to run each time fsfreeze/thaw happened. Such a script can be useful to flush the state
of a user application (e.g. a database), to make sure not only the filesystem is consistent, but also it has the latest
(and consistent as well) data from the application.

The guest-agent waits for the hook to terminate, as can be seen from the code.

The steps for this are rough as follows:

The host sends a command to qemu-guest-agent;
fsfreeze/thaw occurs then a hook is called to execute app-specific data flush, this requires a code change in-app;
Hook returns and qemu-guest-agent completes the command processing;
The host notifies the guest that a snapshot has been taken;
Guest goes back to normal updating of their volume.

^{*Between step-2 and step-5 the guest can not do updates.}

Implementing a plan of snapshot functionality in EVE

The implementation of snapshots in EVE is divided into two parts.

In the first part, the API and the main functionality for processing commands for snapshots are implemented, namely:

Create
To create an EVE snapshot, it is enough to receive from the controller a configuration for this snapshot, which did
not exist before. The configuration for this command must include the UUID of the logical volume for which the
snapshot will be created, the UUID of the snapshot itself (generated on the controller side, but it can also be
generated in EVE if it is not present), DisplayName is the visual user-friendly name of the snapshot on the controller
(Can be obtained from the controller when created, or it is maybe generated automatically in EVE). This command
also requires actions with the application (Dumping data to disk, which is written above, or completely suspending it
in a simple implementation to avoid problems)
Remove
Remove snapshot command is executed when the controller does not receive a snapshot configuration that exists in EVE.
Rename
This command has no effect on ZFS. Changes the Display-name for a snapshot in EVE. Executed when the display-name
the field is changed in the configuration received from the controller.
Rollback
This command is executed when the counter is changed in the incoming configuration for a specific snapshot.
This command also requires preparatory actions on the EVE side (For example, a complete stop of the application or
actions to interact with the application to reset the data in the application to the logical volume, this process is
described in the previous subchapter). Another thing to consider is how snapshot rollback works in ZFS, that the
rollback to snapshot k command will implicitly automatically delete all snapshots from k+1 to N if the deletion of
snapshots from k+1 to N was not initiated by the controller prior to the rollback command. For this reason, before
rolling back to snapshot k, it is recommended to delete snapshots from k+1 to N (automatically) on the controller.

To implement these commands on the EVE side, the libzfs library is used.

In the second part of the implementation, it is planned to implement a process for EVE to interact with applications to flush data
and suspend I/O to disks inside the application before executing commands to Create or Rollback a Snapshot. Proposals for the
implementation of this functionality have been described above. At the moment, before executing these two commands, it is
necessary to turn off the application on the side of the controller in order to avoid problems with data corruption.
And the task of the second part of the implementation is to change this, to simplify the processes of working with snapshots
without the need to suspend applications.

Snapshot information we expect from the controller

To manage snapshots in EVE through a controller, a new configuration message SnapshotConfig is planned to be added:

// RollbackCmd - snapshot rollback command
message RollbackCmd {
	string snpshot_uuid = 1;
	string volume_uuid = 2;
	DeviceOpsCmd rollback = 3;
}

// SnapshotConfig describes a snapshot for a specific logical
// volume that must exist on the device. It has a required field
// volume_uuid for which it was created and UUID for snapshot name in ZFS.
message SnapshotConfig {
	// The real name of the snapshot in ZFS. It is the link between
	// the command and the response to the command. It is assumed
	// that the field will always be filled with a unique value
	// that the controller will generate.
	string uuid = 1;
	// Display name (User-friendly name). This name does not affect
	// the snapshot properties in ZFS in any way.
	// Can be filled on the controller side.
	// If a snapshot has already been created and this field has changed,
	// it can be assumed that the friendly name for this snapshot has
	// been renamed on the controller side.
	string display_name = 2;
	// Volume ID of the volume this snapshot belongs to. The field is
	// required for all messages. Must always be filled in on the
	// controller side before sending to create snapshot command.
	string volume_uuid = 3;
	// The command for the snapshot rollback operation.
	// It should also be taken that the rollback cmd
	// to snapshot k will implicitly automatically delete
	// all snapshots from k+1 to N, If the deletion of snapshots
	// from k+1 to N was not initiated by the controller before
	// the rollback cmd
	RollbackCmd rollback = 4;
}

Information about snapshots to send to the controller

In addition to commands, it is also planned to send information about snapshots to the controller, which will be divided
into two parts: Information and metrics. The need to send metrics comes from the fact that over time when data on the
disk changes or when earlier snapshots are deleted, snapshots tend to increase in size, so it is necessary to inform the
controller about up-to-date information about the space occupied by snapshots in ZFS.

The structure of the informational message ZInfoSnapshot will consist of the following fields:

// Snapshot states
enum ZSnapshotState {
	Z_SNAPSHOT_STATE_UNSPECIFIED = 0;
	// This state is used when a snapshot is in the process of being
	// created or an error occurred during the first attempt to create it.
	// (For example, the operation was delayed)
	Z_SNAPSHOT_STATE_CREATING = 1;
	// This state is used when the snapshot has been successfully created.
	Z_SNAPSHOT_STATE_CREATED = 2;
	// This state is used when the snapshot is pending deletion or
	// the first deletion attempt was not successful.
	Z_SNAPSHOT_STATE_DELETING = 3;
	// This state is used to send information to the controller about a
	// snapshot that was implicitly deleted after a rollback snapshot
	// or volume delete command.
	Z_SNAPSHOT_STATE_IMPLICITLY_DELETED = 4;
}

// ZInfoSnapshot - Information about snapshot in zfs for zvol
message ZInfoSnapshot {
	uint64 creation_time = 1; // In seconds
	string uuid = 2; // link a command and a response. Is the real name of the snapshot in ZFS
	string volume_uuid = 3; // Volume ID of the volume this snapshot belongs to.
	string display_name = 4; // Ex: "Tuesday" or creation time (User-friendly name)
	bool encryption = 5;
	ZSnapshotState current_state = 6; // Displays the current state of the snapshot
	string error_msg = 7; // Ops error
	uint32 rollback_cmd_counter = 8; // Counter for rollback cmd
	uint64 rollback_time_last_op = 9; // The time when the last rollback operation was performed for this snapshot
}

The structure of the message ZMetricSnapshot with metrics will consist of the following fields:

// Metrics for a snapshot
// When a snapshot is created, its disk space is initially shared between
// the snapshot and the file system, and possibly with previous snapshots.
// As the file system changes, disk space that was previously shared becomes
// unique to the snapshot, and thus is counted in the snapshot's used property.
// Additionally, deleting snapshots can increase the amount of disk space
// unique to (and thus used by) other snapshots.
message ZMetricSnapshot {
	// Snapshot UUID
	string uuid = 1;
	// User-friendly name on controller
	string display_name = 2;
	// Identifies the amount of space consumed by the dataset and all its
	// descendants. (in byte)
	uint64 used_space = 3;
	// Identifies the amount of data accessible by this snapshot, which might
	// or might not be shared with other datasets in the pool. When a snapshot or
	// clone is created, It initially references the same amount of space as the
	// file system or snapshot it was created from because
	// its contents are identical. (in byte)
	uint64 referenced = 4;
	// Identifies the compression ratio achieved for this snapshot,
	// expressed as a multiplier.
	double compressratio = 5;
	// Specifies the logical size of the volume. (in byte)
	uint64 vol_size = 6;
	// The amount of space that is "logically" accessible by this dataset.
	// See the referenced property. The logical space ignores the effect of
	// the compression and copies properties, giving a quantity closer to
	// the amount of data that applications see.
	// However, it does include space consumed by metadata.
	// (in byte)
	uint64 logicalreferenced = 7;
}

Comments on API messages from these blocks can be left directly in PR #2633 at this link.

Comments on the implementation for processing these commands can also be left in PR #2607 at this link

Discussion

It is necessary to define a clear course of action for implementing the second part, namely adding an implementation
for EVE to interact with applications to perform certain actions within the application, before executing commands to
create or roll back snapshots in EVE. Implementations have been suggested in this document, but this is not the final
path and is still under discussion.