Date: Fri, 29 Mar 2024 05:39:38 +0000 (UTC) Message-ID: <1318256863.35457.1711690778258@aws-us-west-2-lfedge-confluence-1.web.codeaurora.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_35456_1826398769.1711690778258" ------=_Part_35456_1826398769.1711690778258 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
Volume (and Cont= ent) device API
We currently implicitly create volum= es when deploying application instances and we are adding the support to ex= plicitly create volumes from the controller/UI. The existence of such volum= es need to be configured by the controller and reported by EVE to the contr= oller.
For most volumes there is some immut= able content (formerly known as images; not called =E2=80=9Ccontent trees= =E2=80=9D) which are used to create the volume. We=E2=80=99d also like to r= eport those in info messages, but the details are less known since it depen= ds on what can be made available from containerd when it comes to layers et= c.
However, other volumes will be creat= ed from blank space, or merely serve as an adapter to some external storage= , in which case there is no associated local content tree.
We are going to see one e= xtra top level EVE config object (to be described in storage.proto) called = Volume and we will see our good old object Image (message Image) rename int= o ContentTree. Volume will be very similar in structure to what used to be = known as Drive (buried inside of an app config part of the config). So putt= ing it all together (and marking new/updated part in red):
message
string contentTreeID // UUID
string dsId // effectively po= inter/key into dsConfigs
stri=
ng URLsuffix //
Form= at iformat // RAW, QCOW2, CONTAINEROCI,&= nbsp;BLOB_TREE,
// the following is only used= for individual blobs, if this message references a group of blobs= p>
// e.g. in the case of OCI --= this information is expected to be provided by a top level blob
// that this message points t= o via its URL
string sha256
int6= 4 sizeBytes // used for capping resource con= sumption eve for OCI & BLOB_TREE
SignatureInfo siginfo<= /p>
}
message
string volumeID // UUID uuid
 = ;volumeContentOrigin origin&n= bsp;
VolumeAccessProtocols protocols[] // describes all the different wa=
ys how this Volume can
 =
; &nb=
sp; // be offered to Tasks=
(9p, nfs, scsi, ata, virtio, etc.)
int64 [re]generationCount
// miscellaneous attributes o= f the Volume
int64 maxsizebytes
bool readonly
bool preserve
}
The at the top AppInstanceConfig wil= l change based on these red colors (for transition phases see section at en= d of document):
message AppInstanceConfig {= p>
UUIDandVersion uuidandversion= =3D 1;
string displayname =3D 2;
VmConfig fixedresources =3D 3= ;
repe= ated Volume drives =3D 4; <= /span>// To be deprecated in phase 3; replaced by VolumeRef
bool activate =3D 5;= p>
...
// contains the encrypted use= rdata
CipherBlock cipherData =3D 13= ;
repeated string vo= lumeRef =3D 14; // UUIDs of the volumes
}
An explicitly created vol= ume will have a VolumeID which is a UUID allocated by the controller. A dev= ice may rely on the volume ID to be unique across all volumes on a single d= evice, but may not rely upon a device ID being reused on other devices. Dep= ending on how the controller allocates IDs, they might be unique across the= infrastructure, or might be unique only across volumes on a single device<= /span>. However, there is no impact on this API whether or not the controller co= mbines the volumeID with the device UUID.
There is a desire to be a= ble to re-generate the volume from the immutable content.= This can be done by creating a new volume with a new ID, but there are use= cases where this is cumbersome. Hence it seems useful to add a generationI= D integer[a]= [b][c][d][e][f][g][h]= ; thi= s might be called purgeCounter elsewhere (because the operation is commonly= referred to as purging the local modifications.)
As we transition to volumes the cont= roller will explicitly allocate UUIDs for the volumes and include those in = the configuration APIs. Thus even the volumes which used to be created impl= icitly through the Drive in the AppInstanceConfig message will have a volum= e UUID.
Thus for the purpose of identificati= on we have:
// The volume is identifier by volum= e UUID and a generation number
message volumeName {
string volumeID =3D 1; &= nbsp; &nbs= p; // UUID string
int generationCount =3D 2;
string displayName =3D 3; &nb= sp; = // Some user-friendly name carried from the UI for debug= ging?
}
We will reuse the current states whi= ch are used for app instances (ZSwState), but states past INSTALLED do not = apply to volumes (not that in particular Purging does not apply; a new volu= me is reported using the new generationCount when there is a purge operatio= n in progress).
Volumes created from blank space wil= l transition from INIT to INSTALLED since there is no download or verificat= ion associated with them.
message volumeStatus {
ZSwState state =3D 1; &n= bsp;  = ; // State of Software Image download/install
uint32 downloadProgress =3D 2= ; // Download progress; 0-10= 0 percent
}
We at least should have a maxSizeByt= es which comes from the configuration.
But it might make sense to have info= include the curSizeBytes which is what is currently used from storage for = the volume.
TBD: We might want to define a separ= ate metrics message with information about read/write bytes/ops.
message volumeResources {
int maxSizeBytes =3D 1;= &nb= sp; // From config
int curSizeBytes =3D 2;= &nb= sp; // Current disk usage
}
Knowing when it was created, and a r= eference count (which could be more than one if shared)
message volumeUsage {
google.protobuf.Timesta= mp createTime =3D 1;
uint refCount =3D 2;
google.protobuf.Times= tamp lastRefcountChangeTime =3D 3; // When refCount last changed
}
Presumably we will have this to para= llel the configuration.
Initially we will need tw= o types: Downloaded Content, and Blank Content. This allows us to add mor= e types for the network storage access without having to pretend that every= thing is backed by a (local) content tree.
enum volumeContentOriginType {
= UNKNOWN =3D 0;
= BLANK =3D 1;
= DOWNLOAD =3D 2;
}
message volumeContentOrigin {=
= volumeContentOriginType type =3D 1;
VolumeType voltype // describes the type of th= e constructed volume (note that "EMPTY" is not used; that is the "BLANK" ty= pe)
= string downloadContentStoreID =3D 2; // where we get DOWNL= OAD types from
// = TBD More optional fields for other originTypes
}
message volumeDownloadOrigin {
string datastoreID =
=3D 1; &nb=
sp; // UUID string
string URLsuffix =3D=
2; =
// what to append to the datastore URL=
p>
string sha =3D 3;&nb=
sp; =
// Either=
specified in config or determined from registry
}
message ZInfoVolume {
volumeName name =3D 1;<= /span>
volumeStatus status =3D= 2;
volumeResources resourc= es =3D 3;
volumeUsage usage =3D 4= ;
volumeContentOrigin ori= gin =3D 5;
}
For the image/content we should extr= act what we can get from the containerd for the layers. But it is keyed by = a hash (and I don=E2=80=99t know if we should have a reference to the regis= try we got it from). The notion of a createTime, refCount, and lastUseTime = might make sense for the content.
Biggest TBD is the extent to which w= e want to represent (and how) the tree of content. Current placeholder is t= he componentShaList below.
message contentName {
string sha =3D 1; = &nb= sp; // hash
string datastoreID =3D 2;&nbs= p; &= nbsp; // UUID string - useful?
}
message contentResources {
int curSizeBytes =3D 1;= &nb= sp; // Current disk usage
}
message ZInfoContentTree {
contentName name =3D 1;=
volumeStatus status =3D= 2; // Same info as for volumes
contentResources resour= ces =3D 3;
volumeUsage usage =3D 4= ; &n= bsp; // Same info as for volumes
repeated string compone= ntShaList =3D 5;
}
Following the current scheme we add = as in red:
enum ZInfoTypes {
ZiNop =3D 0;
ZiDevice =3D 1;
// deprecated =3D 2;= p>
ZiApp =3D 3;
// deprecated =3D 4;= p>
// deprecated =3D 5;= p>
ZiNetworkInstance =3D 6;
ZiVolume =3D 7;
ZiContentTree =3D = 8;
}
message ZInfoMsg {
ZInfoTypes ztype =3D 1;
string devId =3D 2;
oneof InfoContent {
ZInfoDevice dinfo =3D = 3;
ZInfoApp ainfo =3D 5;<= /span>
// deprecated =3D 10;<= /span>
// deprecated =3D 11;<= /span>
ZInfoNetworkInstance n= iinfo =3D 12;
ZInfoVolume= vinfo =3D 13;
ZInfoConten= tTree ctinfo =3D 14;
}
google.protobuf.Timestamp atT= imeStamp =3D 6;
}
TBD but a rough sketch is based on t= he current diskMetrics with some tweaks to use the volumeName. Note that th= e used/free semantics depends on the type of volume. For a directory we can= report file system usage. For a qcow2 image we can only report how full th= e qcow2 is relative to its max size.
// For Volume; counts since boot
message
volumeName name =3D 1; &= nbsp; &nbs= p;
uint64 readBytes =3D 3; = &nb= sp; // In MB
uint64 writeBytes =3D 4; = ; // In MB
uint64 readCount =3D 5; = &nb= sp; // Number of ops
uint64 writeCount =3D 6; = ; // Number of ops
uint64 total =3D 7; &nbs= p; &= nbsp; // in MBytes
uint64 used =3D 8;  = ; &n= bsp; // in MBytes
uint64 free =3D 9;  = ; &n= bsp; // in MBytes
}
As we add support to the controller = and EVE we will go through the following steps:
The new EVE will upgrade the sch= ema for /persist/img on first boot by using the checkpointed protobuf messa= ge from before the reboot.
NOTE If there is a downgrade of = EVE during phase2 to an old EVE (which does not support the new schama for = /persist/img) the volumes in /persist/img will not be used which can be dis= ruptive for deployed applications.
An explicitly created vol= ume will have a VolumeID which is a UUID allocated by the controller. EVE a= ssumes that this UUID is unique across the device on which EVE runs.= . How= ever, there is no impact on this API whether or not the controller combines= the volumeID with the device UUID.
There is a desire to be a= ble to re-generate the volume from the immutable content.= This can be done by creating a new volume with a new ID, but there are use= cases where this is cumbersome. Hence it seems useful to add a generationI= D integer[i]= [j][k][l]= [m][n][o][p]; this might be called purgeCounter elsewhere (because the operation is c= ommonly referred to as purging the local modifications.)
Currently the controller implicitly = asks EVE to create volumes by the Drive in the API. There are different way= s the controller might transition to using volumes for existing, deployed a= pplication instances:
If we need to support the= second approach in EVE, then we will have volumes which = are created implicitly as part of deploying an app instance do not have a v= olumeID, but can be identified by a combination of the App Instance UUID an= d the Image UUID (which we might want to rename to =E2=80=9CContent = Tree UUID=E2=80=9D). The content tree in turn might refer to a datastore, h= ave some relative URL/name in that datastore, and any given use of that con= tent tree will have a hash which uniquely identifies it.
Thus for the purpose of identificati= on we have:
// If the volume is explicitly creat= ed it has a volume UUID
// Otherwise it has a app instance U= UID plus a image UUID
// In all cases there is a generatio= n number
message volumeName {
string volumeID =3D 1; &= nbsp; &nbs= p; // UUID string
string appInstID =3D 2;  = ; &n= bsp; // UUID string
string imageID =3D 3; &= nbsp; &nbs= p; // UUID string =3D = ContentTreeID[q]= [r]
int generationCount =3D 4;
string displayName =3D 5; &nb= sp; = // Some user-friendly name?
}
Note that the appInstID and imageID = are only needed if EVE needs to support implicitly created volumes (case 2 = above).
[a]I don't understand this. I have some immutable content (image). = I generate a volume from it. At that point, an app might or might not chang= e it. If I need a volume that is a fresh, clean version of that volume, I n= eed to generate a new one with a new UUID. How would the generation ID help= ? I need a new volume.
[b]Well, you have a series of immutable content blobs -- but aside = from that you can ask the volumemaneger to basically re-set the Volume to i= ts original state right after the creation.
Think of it this way -- this is gett= ing Volume to the state of the snapshot at the beginning of Volume's life += avi@= zededa.com
[c]Understood. But if I am getting it to "the state of the snapshot= at the beginning of the Volume's life", then it is identical to the state = of a new snapshot, or to discarding all changes since then.
What purpose does the generation ID = serve here?
[d]While we are running using generation0 and in the process of dow= nloading, verifying, creating generation1, we want to be able to report the= existence of both volumes. Note that we try to minimize the outage for the= application to just a reboot using the new generation of the volume.
[e]So the case is:
1. I create an ECO, using volume 111= based on image A, version 1 (A:1)
2. The image for the ECO is updated = to A:2. I want to start a new version of the ECO, based on A:2, but I want = to keep the ECO around until everything is ready for a near-zero downtime s= witch
3. I download A:2
4. I create a new volume (111 gen1)<= /span>
5. I stop ECO, start it on the new v= olume, and I am good to go.
If that is the case, why make it con= fusing with gen0, gen1, etc.? Just call it a new volume. The volume UUID is= generated by the controller (or by the device, doesn't matter for this sce= nario). It isn't generated or seen by the end-user.
1. Create an ECO using volume 111 ba= sed on A:1
2. Download A:2
3. Create volume 6a4 based on A:2
4. Swap the ECO to run off of 6a4 in= stead of 111
Both 6a4 and 111 were based on A, di= fferent versions, which might just as well be different images; A:1->A:2= vs A:1->B:5 is just an ease-of-use thing. Why confuse it with "generati= on IDs"?
[f]What part of the API would tell EVE to swap in step 4? The API w= e have is a purgeCmd counter. We don't have an API to say "replace volume X= 1 with volume X2 for this app instance".
[g]Actually, I am thinking more "replaced appA complete spec abcd12= 3 with AppA complete spec 543ddf6, and do it rolling".
[h]Well, that isn't what we have in the API today. And I think the = notion of updating an app is more natural than replacing. Also, whatever we= do I think we need the flexibility to say "update the app container with t= he new version, but keep the data or empty volume unchanged", as opposed to= recreating the empty volume.
[i]I don't understand this. I have some immutable content (image). = I generate a volume from it. At that point, an app might or might not chang= e it. If I need a volume that is a fresh, clean version of that volume, I n= eed to generate a new one with a new UUID. How would the generation ID help= ? I need a new volume.
[j]Well, you have a series of immutable content blobs -- but aside= from that you can ask the volumemaneger to basically re-set the Volume to = its original state right after the creation.
Think of it this way -- this is gett= ing Volume to the state of the snapshot at the beginning of Volume's life += avi@= zededa.com
[k]Understood. But if I am getting it to "the state of the snapsho= t at the beginning of the Volume's life", then it is identical to the state= of a new snapshot, or to discarding all changes since then.
What purpose does the generation ID = serve here?
[l]While we are running using generation0 and in the process of do= wnloading, verifying, creating generation1, we want to be able to report th= e existence of both volumes. Note that we try to minimize the outage for th= e application to just a reboot using the new generation of the volume.
[m]So the case is:
1. I create an ECO, using volume 111= based on image A, version 1 (A:1)
2. The image for the ECO is updated = to A:2. I want to start a new version of the ECO, based on A:2, but I want = to keep the ECO around until everything is ready for a near-zero downtime s= witch
3. I download A:2
4. I create a new volume (111 gen1)<= /span>
5. I stop ECO, start it on the new v= olume, and I am good to go.
If that is the case, why make it con= fusing with gen0, gen1, etc.? Just call it a new volume. The volume UUID is= generated by the controller (or by the device, doesn't matter for this sce= nario). It isn't generated or seen by the end-user.
1. Create an ECO using volume 111 ba= sed on A:1
2. Download A:2
3. Create volume 6a4 based on A:2
4. Swap the ECO to run off of 6a4 in= stead of 111
Both 6a4 and 111 were based on A, di= fferent versions, which might just as well be different images; A:1->A:2= vs A:1->B:5 is just an ease-of-use thing. Why confuse it with "generati= on IDs"?
[n]What part of the API would tell EVE to swap in step 4? The API = we have is a purgeCmd counter. We don't have an API to say "replace volume = X1 with volume X2 for this app instance".
[o]Actually, I am thinking more "replaced appA complete spec abcd1= 23 with AppA complete spec 543ddf6, and do it rolling".
[p]Well, that isn't what we have in the API today. And I think the= notion of updating an app is more natural than replacing. Also, whatever w= e do I think we need the flexibility to say "update the app container with = the new version, but keep the data or empty volume unchanged", as opposed t= o recreating the empty volume.
[q]Hmm... should this be the image _ID_, or its _hash_?
[r]Currently we refer to all images using a UUID; this is the Imag= e in the Drive in the API. Inside the image there will be a sha.