Volume (and Cont= ent) device API

We currently implicitly create volum= es when deploying application instances and we are adding the support to ex= plicitly create volumes from the controller/UI. The existence of such volum= es need to be configured by the controller and reported by EVE to the contr= oller.

For most volumes there is some immut= able content (formerly known as images; not called =E2=80=9Ccontent trees= =E2=80=9D) which are used to create the volume. We=E2=80=99d also like to r= eport those in info messages, but the details are less known since it depen= ds on what can be made available from containerd when it comes to layers et= c.

However, other volumes will be creat= ed from blank space, or merely serve as an adapter to some external storage= , in which case there is no associated local content tree.

The kind of config EVE will expect
We are going to see one e= xtra top level EVE config object (to be described in storage.proto) called = Volume and we will see our good old object Image (message Image) rename int= o ContentTree. Volume will be very similar in structure to what used to be = known as Drive (buried inside of an app config part of the config). So putt= ing it all together (and marking new/updated part in red):

message  { // previously known as (PKA) =E2=80=9Cmessage Image= =E2=80=9D

string contentTreeID // UUID

string dsId // effectively po= inter/key into dsConfigs

stri= ng URLsuffix // PKA= name - added to the datastore URL

Form= at iformat // RAW, QCOW2, CONTAINEROCI,&= nbsp;BLOB_TREE,

// the following is only used= for individual blobs, if this message references a group of blobs
// e.g. in the case of OCI --= this information is expected to be provided by a top level blob

// that this message points t= o via its URL

string sha256

int6= 4 sizeBytes // used for capping resource con= sumption eve for OCI & BLOB_TREE

SignatureInfo siginfo<= /p>
}

message  { // previously known as (PKA) =E2=80=9Cmessage Drive=E2=80= =9D

string volumeID // UUID uuid

  = ;volumeContentOrigin origin&n= bsp;

  VolumeAccessProtocols protocols[] // describes all the different wa= ys how this Volume can
= ; &nb= sp; // be offered to Tasks= (9p, nfs, scsi, ata, virtio, etc.)

int64 [re]generationCount

// miscellaneous attributes o= f the Volume

int64 maxsizebytes

bool readonly

  bool preserve

}

The at the top AppInstanceConfig wil= l change based on these red colors (for transition phases see section at en= d of document):

message AppInstanceConfig {
UUIDandVersion uuidandversion= =3D 1;

string displayname =3D 2;

VmConfig fixedresources =3D 3= ;

repe= ated Volume drives =3D 4; <= /span>// To be deprecated in phase 3; replaced by VolumeRef

bool activate =3D 5;

...

// contains the encrypted use= rdata

CipherBlock cipherData =3D 13= ;

repeated string vo= lumeRef =3D 14; // UUIDs of the volumes

}

Types of information to report

Name and identification

An explicitly created vol= ume will have a VolumeID which is a UUID allocated by the controller. A dev= ice may rely on the volume ID to be unique across all volumes on a single d= evice, but may not rely upon a device ID being reused on other devices. Dep= ending on how the controller allocates IDs, they might be unique across the= infrastructure, or might be unique only across volumes on a single device<= /span>. However, there is no impact on this API whether or not the controller co= mbines the volumeID with the device UUID.

There is a desire to be a= ble to re-generate the volume from the immutable content.= This can be done by creating a new volume with a new ID, but there are use= cases where this is cumbersome. Hence it seems useful to add a generationI= D integer^[a]^=
[b]^[c]^[d][e]^[f]^[g]^[h]= ; thi= s might be called purgeCounter elsewhere (because the operation is commonly= referred to as purging the local modifications.)

As we transition to volumes the cont= roller will explicitly allocate UUIDs for the volumes and include those in = the configuration APIs. Thus even the volumes which used to be created impl= icitly through the Drive in the AppInstanceConfig message will have a volum= e UUID.

Thus for the purpose of identificati= on we have:

// The volume is identifier by volum= e UUID and a generation number

message volumeName {

string volumeID =3D 1; &= nbsp; &nbs= p; // UUID string

int generationCount =3D 2;

string displayName =3D 3; &nb= sp; = // Some user-friendly name carried from the UI for debug= ging?

}

Volume Lifecycle

We will reuse the current states whi= ch are used for app instances (ZSwState), but states past INSTALLED do not = apply to volumes (not that in particular Purging does not apply; a new volu= me is reported using the new generationCount when there is a purge operatio= n in progress).

Volumes created from blank space wil= l transition from INIT to INSTALLED since there is no download or verificat= ion associated with them.

message volumeStatus {

ZSwState state =3D 1; &n= bsp; = ; // State of Software Image download/install

uint32 downloadProgress =3D 2= ; // Download progress; 0-10= 0 percent

}

Resource Consumption

We at least should have a maxSizeByt= es which comes from the configuration.

But it might make sense to have info= include the curSizeBytes which is what is currently used from storage for = the volume.

TBD: We might want to define a separ= ate metrics message with information about read/write bytes/ops.

message volumeResources {

int maxSizeBytes =3D 1;= &nb= sp; // From config

int curSizeBytes =3D 2;= &nb= sp; // Current disk usage

}

Volume Usage

Knowing when it was created, and a r= eference count (which could be more than one if shared)

message volumeUsage {

google.protobuf.Timesta= mp createTime =3D 1;

uint refCount =3D 2;

google.protobuf.Times= tamp lastRefcountChangeTime =3D 3; // When refCount last changed

}

Content Origin

Presumably we will have this to para= llel the configuration.

Initially we will need tw= o types: Downloaded Content, and Blank Content. This allows us to add mor= e types for the network storage access without having to pretend that every= thing is backed by a (local) content tree.

enum volumeContentOriginType {

= UNKNOWN =3D 0;

= BLANK =3D 1;

= DOWNLOAD =3D 2;

}

message volumeContentOrigin {=

= volumeContentOriginType type =3D 1;

VolumeType voltype // describes the type of th= e constructed volume (note that "EMPTY" is not used; that is the "BLANK" ty= pe)

= string downloadContentStoreID =3D 2; // where we get DOWNL= OAD types from

// = TBD More optional fields for other originTypes

}

~~message volumeDownloadOrigin {~~

~~string datastoreID = =3D 1; &nb= sp; // UUID string~~

~~string URLsuffix =3D= 2; = // what to append to the datastore URL~~

~~string sha =3D 3;&nb= sp; = // Either= specified in config or determined from registry~~

}

Putting it together

message ZInfoVolume {

volumeName name =3D 1;<= /span>

volumeStatus status =3D= 2;

volumeResources resourc= es =3D 3;

volumeUsage usage =3D 4= ;

volumeContentOrigin ori= gin =3D 5;

}

Image/Content information

For the image/content we should extr= act what we can get from the containerd for the layers. But it is keyed by = a hash (and I don=E2=80=99t know if we should have a reference to the regis= try we got it from). The notion of a createTime, refCount, and lastUseTime = might make sense for the content.

Biggest TBD is the extent to which w= e want to represent (and how) the tree of content. Current placeholder is t= he componentShaList below.

message contentName {

string sha =3D 1; = &nb= sp; // hash

string datastoreID =3D 2;&nbs= p; &= nbsp; // UUID string - useful?

}

message contentResources {

int curSizeBytes =3D 1;= &nb= sp; // Current disk usage

}

message ZInfoContentTree {

contentName name =3D 1;=

volumeStatus status =3D= 2; // Same info as for volumes

contentResources resour= ces =3D 3;

volumeUsage usage =3D 4= ; &n= bsp; // Same info as for volumes

repeated string compone= ntShaList =3D 5;

}

Top-level info message

Following the current scheme we add = as in red:

enum ZInfoTypes {

ZiNop =3D 0;

ZiDevice =3D 1;

// deprecated =3D 2;

ZiApp =3D 3;

// deprecated =3D 4;

// deprecated =3D 5;

ZiNetworkInstance =3D 6;

ZiVolume =3D 7;

ZiContentTree =3D = 8;

}

message ZInfoMsg {

ZInfoTypes ztype =3D 1;

string devId =3D 2;

oneof InfoContent {

ZInfoDevice dinfo =3D = 3;

ZInfoApp ainfo =3D 5;<= /span>

// deprecated =3D 10;<= /span>

// deprecated =3D 11;<= /span>

ZInfoNetworkInstance n= iinfo =3D 12;

ZInfoVolume= vinfo =3D 13;

ZInfoConten= tTree ctinfo =3D 14;

}

google.protobuf.Timestamp atT= imeStamp =3D 6;

}

Metrics

TBD but a rough sketch is based on t= he current diskMetrics with some tweaks to use the volumeName. Note that th= e used/free semantics depends on the type of volume. For a directory we can= report file system usage. For a qcow2 image we can only report how full th= e qcow2 is relative to its max size.

// For Volume; counts since boot

message Metric {

volumeName name =3D 1; &= nbsp; &nbs= p;

uint64 readBytes =3D 3; = &nb= sp; // In MB

uint64 writeBytes =3D 4; = ; // In MB

uint64 readCount =3D 5; = &nb= sp; // Number of ops

uint64 writeCount =3D 6; = ; // Number of ops

uint64 total =3D 7; &nbs= p; &= nbsp; // in MBytes

uint64 used =3D 8; = ; &n= bsp; // in MBytes

uint64 free =3D 9; = ; &n= bsp; // in MBytes

}

Transition plan

As we add support to the controller = and EVE we will go through the following steps:

Today: old EVE, old controller.
Phase1: old EVE, new controller. Controller is sending = both Volume and Drive for the appInstanceConfig.
Phase2: new EVE, new controller.

The new EVE will upgrade the sch= ema for /persist/img on first boot by using the checkpointed protobuf messa= ge from before the reboot.

Phase 3: new EVE, cleaned up controller. Controller wil= l no longer send Drive in appInstanceConfig; only sending Volume

NOTE If there is a downgrade of = EVE during phase2 to an old EVE (which does not support the new schama for = /persist/img) the volumes in /persist/img will not be used which can be dis= ruptive for deployed applications.

Considered and rejected ideas

Name and identification us= ing the /persist/img schema

An explicitly created vol= ume will have a VolumeID which is a UUID allocated by the controller. EVE a= ssumes that this UUID is unique across the device on which EVE runs.= . How= ever, there is no impact on this API whether or not the controller combines= the volumeID with the device UUID.

Currently the controller implicitly = asks EVE to create volumes by the Drive in the API. There are different way= s the controller might transition to using volumes for existing, deployed a= pplication instances:

The controller takes the current Zededa manifest for th= e application and extracts the drive/image information and uses that to cre= ate a Volume object in the controller (with a UUID) and sends that as part = of the EVE configuration. Hence even for existing applications there will b= e explicit volumes with UUIDs.
The controller continues to use the Drive message in th= e API to specify volumes for existing application instances, while new ones= use the Volume object. In that case there will be no UUID associated with = the volumes implicitly specified by the Drive protobuf message.

If we need to support the= second approach in EVE, then we will have volumes which = are created implicitly as part of deploying an app instance do not have a v= olumeID, but can be identified by a combination of the App Instance UUID an= d the Image UUID (which we might want to rename to =E2=80=9CContent = Tree UUID=E2=80=9D). The content tree in turn might refer to a datastore, h= ave some relative URL/name in that datastore, and any given use of that con= tent tree will have a hash which uniquely identifies it.

Thus for the purpose of identificati= on we have:

// If the volume is explicitly creat= ed it has a volume UUID

// Otherwise it has a app instance U= UID plus a image UUID

// In all cases there is a generatio= n number

message volumeName {

string volumeID =3D 1; &= nbsp; &nbs= p; // UUID string

string appInstID =3D 2; = ; &n= bsp; // UUID string

string imageID =3D 3; &= nbsp; &nbs= p; // UUID string =3D = ContentTreeID [q]= ^[r]

int generationCount =3D 4;

string displayName =3D 5; &nb= sp; = // Some user-friendly name?

}

Note that the appInstID and imageID = are only needed if EVE needs to support implicitly created volumes (case 2 = above).

[a]I don't understand this. I have some immutable content (image). = I generate a volume from it. At that point, an app might or might not chang= e it. If I need a volume that is a fresh, clean version of that volume, I n= eed to generate a new one with a new UUID. How would the generation ID help= ? I need a new volume.

[b]Well, you have a series of immutable content blobs -- but aside = from that you can ask the volumemaneger to basically re-set the Volume to i= ts original state right after the creation.

Think of it this way -- this is gett= ing Volume to the state of the snapshot at the beginning of Volume's life += avi@= zededa.com

[c]Understood. But if I am getting it to "the state of the snapshot= at the beginning of the Volume's life", then it is identical to the state = of a new snapshot, or to discarding all changes since then.

What purpose does the generation ID = serve here?

[d]While we are running using generation0 and in the process of dow= nloading, verifying, creating generation1, we want to be able to report the= existence of both volumes. Note that we try to minimize the outage for the= application to just a reboot using the new generation of the volume.

[e]So the case is:

1. I create an ECO, using volume 111= based on image A, version 1 (A:1)

2. The image for the ECO is updated = to A:2. I want to start a new version of the ECO, based on A:2, but I want = to keep the ECO around until everything is ready for a near-zero downtime s= witch

3. I download A:2

4. I create a new volume (111 gen1)<= /span>

5. I stop ECO, start it on the new v= olume, and I am good to go.

If that is the case, why make it con= fusing with gen0, gen1, etc.? Just call it a new volume. The volume UUID is= generated by the controller (or by the device, doesn't matter for this sce= nario). It isn't generated or seen by the end-user.

1. Create an ECO using volume 111 ba= sed on A:1

2. Download A:2

3. Create volume 6a4 based on A:2

4. Swap the ECO to run off of 6a4 in= stead of 111

Both 6a4 and 111 were based on A, di= fferent versions, which might just as well be different images; A:1->A:2= vs A:1->B:5 is just an ease-of-use thing. Why confuse it with "generati= on IDs"?

[f]What part of the API would tell EVE to swap in step 4? The API w= e have is a purgeCmd counter. We don't have an API to say "replace volume X= 1 with volume X2 for this app instance".

[g]Actually, I am thinking more "replaced appA complete spec abcd12= 3 with AppA complete spec 543ddf6, and do it rolling".

[h]Well, that isn't what we have in the API today. And I think the = notion of updating an app is more natural than replacing. Also, whatever we= do I think we need the flexibility to say "update the app container with t= he new version, but keep the data or empty volume unchanged", as opposed to= recreating the empty volume.

[i]I don't understand this. I have some immutable content (image). = I generate a volume from it. At that point, an app might or might not chang= e it. If I need a volume that is a fresh, clean version of that volume, I n= eed to generate a new one with a new UUID. How would the generation ID help= ? I need a new volume.

[j]Well, you have a series of immutable content blobs -- but aside= from that you can ask the volumemaneger to basically re-set the Volume to = its original state right after the creation.

Think of it this way -- this is gett= ing Volume to the state of the snapshot at the beginning of Volume's life += avi@= zededa.com

[k]Understood. But if I am getting it to "the state of the snapsho= t at the beginning of the Volume's life", then it is identical to the state= of a new snapshot, or to discarding all changes since then.

What purpose does the generation ID = serve here?

[l]While we are running using generation0 and in the process of do= wnloading, verifying, creating generation1, we want to be able to report th= e existence of both volumes. Note that we try to minimize the outage for th= e application to just a reboot using the new generation of the volume.

[m]So the case is:

1. I create an ECO, using volume 111= based on image A, version 1 (A:1)

2. The image for the ECO is updated = to A:2. I want to start a new version of the ECO, based on A:2, but I want = to keep the ECO around until everything is ready for a near-zero downtime s= witch

3. I download A:2

4. I create a new volume (111 gen1)<= /span>

5. I stop ECO, start it on the new v= olume, and I am good to go.

1. Create an ECO using volume 111 ba= sed on A:1

2. Download A:2

3. Create volume 6a4 based on A:2

4. Swap the ECO to run off of 6a4 in= stead of 111

Both 6a4 and 111 were based on A, di= fferent versions, which might just as well be different images; A:1->A:2= vs A:1->B:5 is just an ease-of-use thing. Why confuse it with "generati= on IDs"?

[n]What part of the API would tell EVE to swap in step 4? The API = we have is a purgeCmd counter. We don't have an API to say "replace volume = X1 with volume X2 for this app instance".

[o]Actually, I am thinking more "replaced appA complete spec abcd1= 23 with AppA complete spec 543ddf6, and do it rolling".

[p]Well, that isn't what we have in the API today. And I think the= notion of updating an app is more natural than replacing. Also, whatever w= e do I think we need the flexibility to say "update the app container with = the new version, but keep the data or empty volume unchanged", as opposed t= o recreating the empty volume.

[q]Hmm... should this be the image _ID_, or its _hash_?

[r]Currently we refer to all images using a UUID; this is the Imag= e in the Drive in the API. Inside the image there will be a sha.

Volume (and Content) device API

Types of information to report

Name and identification

Volume Lifecycle

Resource Consumption

Volume Usage

Content Origin

Putting it together

Image/Content information

Top-level info message

Metrics

Transition plan

Considered and rejected ideas

Name and identification us= ing the /persist/img schema