Introduction

The document has the Edge-node Cluster design considerations and detailed procedures for Cluster Creation; Network Instance Deployment; Volume Instance Deployment; Application Deployment; Application Migration and Status for both Controller and Onsite edge-nodes. It also has the Edge-Node API for the clustering operation.

EVE development has the extension to build the kubevirt image with kubernetes and kubevirt managing the user VMs, see document Cluster compute and storage support in EVE. This Edge-Node Clustering is built on top of that development.

Terminology


Cluster Workflow


We first go into the workflow of edge-node clustering. 1) Create a cluster; 2) Create Network Instance(s) and Volume Instances if any 3) Deploy an Application on the edge-node cluster. 4) Move an application manually or automatically among cluster nodes 5) Seed-Server change on cluster 6) Adding new node to existing cluster 7) Delete a node from the cluster 8) Replace a node in the cluster


Create Edge-Node Cluster


There are two steps for the cluster creation process, the first step is to create it in the Controller, and the second step is on site the devices need to form a kubernetes cluster among themselves upon receiving the configuration from the Controller.

Controller Edge-Node Cluster Creation

  1. Assign unique cluster name and UUID
  2. Select 3 edge-nodes, by some mechanism
  3. Designate an edge-node as ‘seed-server’, either manually or automatically
  4. For each edge-node, designate one physical port as the ‘Cluster-Interface’. This port can be either a Mgmt type or it can be ‘App Shared’ and it should be consistent in the cluster.
  5. Cluster can define ‘ResourceLabels’ if needed. Each edge-node can be attached with zero or multiple ResourceLabels. For instance, ResourceLabels can be ‘GPU:2’ or ‘DirectConnect:eth3’. Those can be used when a user application needs to be satisfied with certain hardware resources during Kubernetes dynamic scheduling. If the cluster comprises identical devices, then there is no need to do the ResourceLabel for it.
  6. Generate a bearer token for the cluster, with 32 bytes of length, see the k3s doc. This bearer token is created by the controller, and is passed to the seed node to use for future nodes that join the cluster, as well as other nodes to use to join the cluster. Needs to be encrypted with device’s public certificate when downloading with the device configuration
  7. Add to above #4, a ‘Cluster-Interface’ needs to be selected with a real physical port name.
  8. Regardless of of above ‘Cluster-Interface’ is using DHCP or Static IP assignment, we may always create a new IP prefix on that ‘Cluster-Interface’, let's call this ‘Cluster-Prefix’, we can automatically assign this e.g. ‘10.244.244.0/28’ to allow say 16 devices to form a cluster on site. But we should give users a choice of defining their own if there is a conflict on this prefix or some other reasons. The controller will automatically assign the IP address to each of the edge-node within this ‘Cluster-Prefix’ range.
  9. Controller is responsible for creating the cluster with the set of device configurations for each of the edge-nodes in the cluster, but forming the cluster on site depends on collective edge-node’s bootstrap processing of the EVE pillar and kubernetes software.
  10. We may need to add an edge-node to the cluster later, either as a server in the case of a 5 server HA cluster or a replacement edge-node for an original edge-node; or as an agent node for the cluster.
  11. Controller deployment policy may be used for some of the above cluster wide configurations.


On-Site Edge-Node Cluster Bootstrap



On-Site Edge-Node Networking in Clustering

  1. A.    For the Cluster-Interface and Cluster-Prefix, it will have an extra IP prefix to be configured on the device port. Service ‘nim’ probably is the right place to perform this.
  2. For the new Cluster-Prefix, the IP rules and ACLs need to be evaluated and installed in addition to the current single-node mode.
  3. In addition to the single-node case, this cluster is a kubernetes distributed system and different modules communicate with each other using different TCP ports. We need to open more to allow inbound kubernetes service IP packets(the options are, if we just open those ports, or we implement some checks on allowing packets to those ports with certain source IP address being on the same subnet):
    1. 53 - DNS, or CoreDNS port, this is there in single-node also
    2. 6443 - basic cluster api-server endpoint port
    3. 2379-2381 - Etcd endpoints ports
    4. 8472 - Flannel overlay port
    5. 9500-9503 - Longhorn system ports
    6. 8000-8002 - Longhorn Webhook ports
    7. 3260 - Iscsi port
    8. 2049 - NFS port
  4. In kubevirt single-node case, one of the IP rules for port is added with ‘cni0’ subnet route to allow kubernetes internal traffic. In cluster mode, this change needs to be modified to handle inter-device kubernetes traffic.


Cluster Creation Diagram For Devices


Network Instance (NI) Deployment in Cluster


Controller NI Provisioning

  1. New cluster wide NI vs current edge-node NI
  2. Same NI name and UUID be configured onto each of the edge-node in the cluster
  3. For ‘Local’ type of ‘Auto’ or ‘Manual’ IP Configuration, We may need to add a choice of ‘Same Prefix’ or ‘Different Prefix’ for NI on each of the edge-nodes in the cluster. The ‘Different Prefix’ scheme may be useful in the future if we want to allow one App in edge-node ‘A’ to communicate with another App in edge-node ‘B’. Some CNI such as ‘Calico’ allow multiple IP prefixes on the same kubernetes node to be redistributed in the cluster. For ‘Different Prefix’ mode, Controller assigns e.g. 10.1.10.0/24 to first edge-node for the NI (with the same name/UUID), 10.1.11/0/24 to the 2nd edge-node, and 10.1.12.0/24 to the 3rd node of the NI.
  4. May want to let users pick which edge-nodes need this NI deployment, or the deployment can be bound by the ‘ResourceLabels’ of the edge-nodes. If an App can only run on the edge-node with ResourceLabel of ‘foo’, then the NI used by this App makes sense to be only deployed onto the same set of edge-nodes. In particular, in the future, we may have more than 3 nodes in the cluster.
  5. The ResourceLabel list of the NI does not need to be downloaded onto the edge-node, it can be used for the controller side to filter out certain edge-nodes. Thus there is no new addition to the NetworkInstanceConfig API.
  6. Once each edge-node NI configuration is built, it will be used in the device configuration for each edge-node selected
  7. Should we reuse the Library’s ‘Network Instance’ object, probably?
  8. Should we have default NI for a cluster? Probably yes just as in the single-node case. Should be default to ‘Same Prefix’ or to ‘Different Prefix’ for the ‘default’ NI?
  9. Normally once a NI is configured and it is being downloaded onto the edge-node. We may want to hold this in the controller until the App is attached to it. In the ‘Manual-Migration’ case, only the DNId edge-node will get this NI, while in ‘Auto-Migration’ case, all the edge-nodes get this NI configuration. Multiple App-Instance may use the same NI, so as long as at least one App is using it, it needs to be downloaded. The NI does not need to have ‘DNId’ string attached (unlike the VI case which consumes valuable resources)
  10. Normally a NI specifies one port or ‘uplink’ to external connection, in the cluster case, a ‘port’ can be on different devices. We need to let users pick device:port pairs for all the devices (3 in this case) involved in Application using the NI.


On-Site Edge-Node NI Handling


Volume Instance (VI) Deployment in Cluster


Controller Volume Instance Provisioning

  1. Need to extend the type ‘filesystem storage’, currently only allows ‘Block Storage’ for non image usage.
  2. Same as in the NI case, the same VI name and UUID can be deployed onto each of the edge-nodes in the cluster.
  3. Same as in the NI case, may want to let users pick which edge-nodes need this VI or assign the ResourceLabel requirement.
  4. Same as in the NI case, if the ResourceLabel list exists, it does not need to be downloaded to the edge-nodes.
  5. Once each edge-node VI configuration is built, it will be used in the device configuration for each edge-node selected
  6. If the VI does not need to be replicated by the cluster, then the user should only pick one edge-node
  7. Same as above NI question, should we reuse the same ‘VI’ object or create a new one for Edge-node Cluster VI?
  8. Normally once a VI is configured independent of the Application, it is downloaded onto the edge-node. This cluster case, we should allow users to pick up to three devices for this VI in the cluster. When the App is attached with this VI and if the App device selection does not match the VI device selection, we generate an error and ask user to modify.  In the ‘Manual Migration’ case, only the DNId will get this VI configuration. In the ‘Auto-Migration’ of the App case, the other nodes may also get the VI configuration. A new DNId string is added to the ‘Volume’ API.


On-Site Edge-Node VI Handling

Cluster Application Deployment


Controller Application Deployment

  1. An Application (App-Bundle or App manifest) is being deployed on a cluster, rather than on a device
  2. The ResourceLabels may not be implemented in the first phases of the clustering.
  3. Users can manually pick an edge-node in the cluster for the initial placement, if the user does not pick, then the system automatically assigns one, in particular based on the requirement of the ResourceLabels needed for the App. The picking by users is optional, just like the ResourceLables is optional. System may need to automatically assign the App to a node. Users may not know which device to pick, but they may know which resources this App is needed.
  4. Users should specify if this placement is allowed to be Auto-Migrated onto a different edge-node in the cluster or not, by user specify, it means the controller needs to decide if the App’s resource requirement matches the nodes, or pinned to the initial node.
  5. If Auto-Migration is allowed, then users can pick a list of ‘ResourceLabels’ for the requirement of the migration, otherwise, it can be migrated onto any edge-node which has resources in the case the initial edge-node is down. If the App has ResourceLabels configured, maybe the controller will only show the eligible nodes for migration purposes. Since ResourceLabels is not in the first phases, users have no ResourceLabels to pick.
  6. The UUID of the Designated Node for App (DNId) is set for the edge-node which is responsible for launching the application on the edge-node through kubernetes. For instance, when the initial placement is set for edge-node ‘A’ for this App Instance, the DNId is included in the AppInstanceConfig API when downloading the configuration to edge-node ‘A’.
  7.  DNId is added to the AppInstanceConfig API, see section ‘Cluster API Changes’
  8. ‘AutoMigration’ and ‘NodeSelectLabels’ (not in first phases) also need to be downloaded to edge-nodes


On-Site App Instance Deployment


Cluster Application Deployment Diagram


Cluster Application Migrate (Manual)


Controller Application Manual Migrate

  1. Manual migration can be applied to the Apps with ‘AutoMigration’ set or unset.
  2. Users can decide to migrate the Application from one edge-node to another. Similar to the initial placement of the App, users may manually pick one edge-node, or let the system pick automatically based on ResourceLabels of the App.
  3. The ‘DNId’ string is changed from the previous edge-node to the newly selected edge-node
  4. Download the modified configuration to all the edge-nodes in the cluster

On-Site Application Manual Migrate

Cluster Application Migrate (Automatic)

Controller Application Automatic Migrate

  1. Controller monitors the info messages from all the edge-nodes in the cluster
  2. When the Application is ‘AutoMigration’ set, if the Controller get an edge-node in the cluster reporting it is handling the running of the application on the edge-node, it needs to change the ‘DNId’ from the previous edge-node onto this new edge-node. Controller may need to verify either the previous edge-node is unreachable or get the info message from the previous edge-node reporting the application is no longer running on it (e.g. the edge-node is running in low memory condition).
  3. ZInfoApp message added ‘ClusterAppRunning’ API, or a new info message can be defined for this purpose.


On-Site Application Automatic Migrate


Cluster Application Migration Diagram


Adding new node to existing cluster

The following process is used to add a new node to an existing cluster. This is true for any node type:


Controller add or delete node to existing cluster


Need to plan to have nodes to be added or deleted, in particular one of the nodes in the cluster needs to be replaced.

On-Site add or delete node to existing cluster

Cluster Metrics and Status

Every edge-node in the cluster will report its own App Metrics, for the VMI or Pods run on the edge-node. Controller can aggregate some of the metrics to be part of the cluster metrics display.
If there is a need to create new kubernetes cluster specific metrics for the cluster display, we can add those, and have the ‘Seed-Server’ to report those metrics. Each edge-node can also report their own view of cluster status (this is useful if the Seed-Server is down or the local network is partitioned)
It is needed for UI and ZCli to query about a cluster status, it probably should include this list:


Volume Instance Metrics and Status


Longhorn storage tracks the state of a volume into the following values:


-    1 creating
-    2 attached
-    3 detached
-    4 attaching
-    5 detaching
-    6 deleting

and the “robustness” of a volume into one of these values:

-    0 unknown 
-    1 healthy
-    2 degraded ( one or more replicas are offline )
-    3 faulted ( volume is not available )


These two values should be read from longhorn metrics and possibly used to calculate the runstate of a volume instance.  The last transition time of the volume robustness should be shown to convey the urgency of a degraded volume as a volume replicating towards being healthy for 1 hour is not as large of a worry as one which has been degraded for a week for example.

API Changes for Clustering

Edge-Node API

This is an example of the new ‘cluster’ API, is part of the device configuration of EVE API


message EdgeDevConfig {
...
    // cluster configuration
    EdgeNodeCluster cluster;
}

message EdgeNodeCluster {
    // cluster name, in case it has multiple cluster on the same site
    string cluster_name;
    // cluster UUID
    string cluster_id;
    // Cluster-Interface, for example "eth1"
    string cluster_interface;
    // The ‘cluster-prefix’ IP address of the ‘Cluster-Interface’, e.g. 10.244.244.2/28
    string cluster_ip_prefix;
    // This device is an ‘Agent’ node
    bool Is_agent;
    // Server IP address to join the cluster. E.g. 10.244.244.1
    string join_server_ip;
    // encrypted token string, use edge-node TMP to decrypt
    org.lfedge.eve.common.CipherBlock encrypted_cluster_token;
}

App Instance API

This is an example of the change of the ‘AppInstanceConfig’ API


message AppInstanceConfig {

    // This edge-node UUID for the Designated Node for the Application
   string designated_node_id;

}

Volume API

This is an example of the change of the ‘Volume’ API


message Volume {

    // To inform the edge-node if the device receives this Volume is
    // responsible to create volume, convert PVC or not
    string designated_node_id;
}

App Info Message API

This is an example of the change of the ‘Info’ API


enum ZInfoClusterNodeStatus {
   Z_INFO_CLUSTER_NODE_STATUS_UNSPECIFIED;
   Z_INFO_CLUSTER_NODE_STATUS_READY;          // cluster reports our node is ready
   Z_INFO_CLUSTER_NODE_STATUS_NOTREADY;  // cluster reports our node is ready
   Z_INFO_CLUSTER_NODE_STATUS_DOWN;         // cluster API server can not be reached
}

message ZInfoClusterNode {
    ZInfoClusterNodeStatus node_status;
}

message ZInfoMsg {
    oneof InfoContent {
       ...
       ZInfoClusterNode cluster_node;
    }
}

message ZInfoApp {

    // The App in cluster mode is currently running on this edge-node
    bool cluster_app_running;
}