Status: Complete

Sponsor User: IBM

Date of Submission: 16 Aug 2021

Submitted by: David Booz (booz@us.ibm.com)

Affiliation(s): IBM

<Please fill out the above fields, and the Overview, Design and User Experience sections below for an initial review of the proposed feature.>

Scope and Signoff: (to be filled out by Chair)

Overview

...

Code Block

{
  "owner":"",                // The user that created this resource
  ''"label": "",               // (Optional) A short description of the policy
  "description": "",         // A(Optional) A much longer description of the policy
  "constraints": [""],       // (Optional) Typical constraint expression used to select nodes
  "properties": [{}],        // (Optional) Typical property expressions used by nodes to select policy 
  "patterns": [""],          // (Optional) This policy applies to nodes using one of these patterns
  "enabled": boolean,        // Is this policy enabled or disabled, default false == disabled
  "agentUpgradePolicy": {},  // (Optional) Assertions on how the agent should update itself
  "lastUpdated":"<time-stamp>"
}

...

The "pattern" field is NOT mutually exclusive with "properties" and "constraints". This allows a single NMP to apply to nodes which use patterns or policies. The patterns list may contain the string "*", indicating that the NMP applies to any node that is using a pattern.

Because patterns can be public, the pattern array can contain org qualified pattern names. That is, the format of the strings in the patterns field is org/pattern-name, where the character '/' is used as a separator. If org is omitted, the org of the NMP is the default org.(Ling: the nodes with patterns can also have node policies, feel like the patterns attribute is redundant here, we can put "pattern == blah" in the constraints as a special constraints, pattern can be a special property for a node).

The constraints field contains the same text based constraint language found in deployment, service, model and node The constraints field contains the same text based constraint language found in deployment, service, model and node policy constraints. The same language parser and therefore the same syntax is supported. The properties field contains policy properties, using the same syntax as deployment, service, model and node policies. The properties and constraints are used to determine which policy based nodes are subject to the intent of the NMP. This determination includes the same bi-directional policy compatibility calculations used for deployment, service, model and node policy.

An important new concept with NMP is the ability to administratively enable and disable the policy. This allows an NMP to be published but not acted on by the agent until the policy is enabled. Retrofitting this concept into deployment policy is not within the scope of this design.

??? Question: Is there a flavor of NMP that deactivates itself after it is complete? How do we know when an NMP is complete? Do we need to know? ???

...

In a large scale environment with thousands of nodes, an enabled NMP with empty constraints and/or patterns set to "*" could be quite disruptive to the management hub or even the entire system of nodes by attempting to upgrade all agents at approximately the same time. Publishing a NMP that meets these criteria criteria requires the user to confirm that they understand the potentially disruptive nature of the policy before the NMP is published in the exchange.

Note that a default NMP is equivalent to having no NMP at all because the default for the enabled field is false.

NMP Constraint Collisions

Given the flexibility of policy within OH, it is possible that NMPs with conflicting actions could be compatible with the same node(s). There are a myriad of circumstances in which this could occur. Within this design, the conflict could occur in relation to the atLeastVersion setting of the agent update policy. Features (in the future) which add additional policy to a NMP could create additional conflicting situations. Two or more NMPs cannot be in conflict if they are enacting different kinds of policy. This is important for the future when additional management policy kinds are added.

The agent is responsible for interpreting each NMP and acting on it accordingly, therefore an agent is only concerned with the NMPs that apply to it. If there are any conflicts, the agent will not enact the conflicting policy action. For example, NMP1 enables policy intent1 at time t1, NMP2 enables policy intent2 at time t2, and NMP3 enables policy intent1 at time t3 with a policy setting that conflicts with NMP1. For the purposes of conflict resolution, the only potential conflicts are between NMP1 and NMP3. Agents will implement the policy intents in NMP1 and 2. NMP3 is ignored because NMP3 conflicts over intent1 and was added after NMP1 was already enacted.

It is possible that conflicting actions could be introduced in close temporal proximity. This could result in actions that are immediately reversed. The agent should be prepared to handle these cases. For example, NMP1 and NMP2 are introduced/enabled a few seconds apart and both adjust the same policy setting (e.g. autoUpgrade). Some of the nodes that match NMP1 and NMP2 will be notified of both NMP1 and NMP2 at the same time. These nodes will enact the actions in NMP2, completely ignoring NMP1. Other nodes might only be notified of NMP1, and will enact the policy in NMP1. Very soon after that, these same nodes are notified of NMP2, and recognizing that NMP2 is newer than NMP1, will enact the action in NMP2, reversing what was done when NMP1 was enacted.

Here are the changes that might result in a management policy collision:

new/changed node policy - this can occur at node registration (which is essentially creation of a node policy) or when there are changes to an existing node policy
enablement of an existing NMP - an NMP can be published in the disabled state
new/changed NMP that is also enabled

Agent Upgrade Policy

There are several discrete lifecycle steps associated with upgrading the agent software and/or configuration to a new version:

a stimulus to begin the process
download the new version
verify a hash of the downloaded package
install the new package (if there is one)
update any new or changed configuration (if any)
restart the agent process
report the new (or failed) version info to the node/status resource in the exchange

This is the implied lifecycle for a new kind of policy called "agentUpgradePolicy" defined within a NMP.

For any given node, the stimulus to initiate the "agentUpgradePolicy" occurs when a NMP is successfully enabled (or created already enabled) that matches the node (policy or pattern). The node self discovers this stimulus (through the exchange notification framework, i.e. the /changes API) and initiates the policy actions on its own. Packages are downloaded from the hub (CSS), the package hash is verified and the package is installed and/or the config is updated. The agent is restarted (see below) and updates it's new package version in its exchange node/status resource.

The "agentUpgradePolicy" section in a NMP contains the specifics of the agent upgrade intent being expressed by the administrator, and conforms to the following JSON snippet:
"agentUpgradePolicy": {
      ~~"autoUpgrade": boolean, // Do the selected nodes auto upgrade or not, default false == no auto upgrade~~
      "atLeastVersion": "<version> | current", // specify the minimum agent version these nodes should have, default "current"
      "start": "<RFC3339 timestamp> | now", // when to start an upgrade, default "now"
      "duration": seconds // enable agents to randomize upgrade start time within start + duration, default 0
},

~~The autoUpgrade flag instructs the agent whether or not to actively (periodically) check for new updates. Checking for new updates is performed based on the node's heartbeat configuration.~~

The atLeastVersion field indicates the minimum version the agents should upgrade to when instructed. ~~Enabling autoUpgrade implies that toVersion is set to current, therefore any value other than current is invalid. The toVersion field may be omitted when autoUpgrade == true.~~ When atLeastVersion is set to current, the agent will periodically check for new versions and automatically upgrade to the newest version.

The start field contains an RFC3339 timestamp indicating when the upgrade should start. The timestamp is in UTC format so that affected nodes anywhere in the world are acting at approximately the same time. ~~When autoUpgrade == true, the start field is ignored.~~

The duration field is specified in seconds, and is added to the start time to form a window of time within which the agent will initiate an update. Each agent will randomly select a point within that window to begin the upgrade process. This prevents all affected agents from upgrading at the same time, possibly overwhelming the management hub. The combination of selecting nodes based on properties and constraints plus the start time and duration is intended to enable org administrators sufficient control over mass upgrades to prevent overwhelming the management hub. ~~The duration field is ignored when autoUpgrade == true.~~

If an agent is instructed to upgrade to a version which it is already running, it will not perform the upgrade.

If an agent detects a policy conflict, it will log the conflict in its event log.

If an agent upgrade fails, the agent will be rolled back to the previous version. The failure status will be recorded in the node's /status resource in the exchange, and the upgrade will not be attempted again until a newer version of the package is made available.
(Ling: rolling back is not supported yet in agent-install)

??? Can agents be instructed to downgrade ???
(Ling: downgrade is not supported yet in agent-install)

In the absence of any enabled NMPs, an agent never checks for an update.

Agent Version Status

The agent will maintain it's version in the exchange, in its /status resource. Following is the JSON encoded pseudo-schema for the new version status:

"agentVersion": {
      "package":{}, // The version(s) of the most recent package installed on the agent
      "failed":"<version>", // The last version upgrade failed on this version, or empty
      "upgradeStart":"<RFC3339 timestamp>", // The time (in UTC) when the last upgrade started
      "upgradeEnd":"<RFC3339 timestamp>", // The time (in UTC) when the last upgrade ended
}

The package field contains the JSON object from the package manifest of the most recently successful agent install. See the "Agent package" section.

Agent package:

BP: I think it is important for this and agent-install.sh to work from the same files, because they are doing essentially the same thing.

An updated package downloaded by the agent from the management hub, may contain any or all of the following:

Package manager specific packages for updating agent binaries
Management hub URLs
SSL certificate for the management hub

The package will be structured as follows. It is stored as a gzipped tar file in the management hub, where the tar file has the following directory structure:
/packages - Contains the binary packages for the node's package manager, e.g. debs, rpms, etc.
/ssl - Contains a PEM encoded certificate, similar to the agent-install.crt file.
/config - contains the env var settings to bootstrap the node to the hub, similar to the agent-install.cfg file.

Any of these directories may be omitted or empty, in which case the agent will simply skip the directory and move on to the next one.

At the root of the package is a metadata file called the package manifest; it is named package.json. It contains a JSON encoded object defining the contents of the package:
{
"version: "<semantic-version>",
"packagesVersion": "<semantic-version>",
"sslVersion":"<semantic-version>",
"configVersion":"<semantic-version>"
}

The version field corresponds to the version of the overall package. The other version fields correspond to the version of the contents of the similarly named directory in the package. The version field is required, the other version fields are optional and are used strictly for documentation. The agent makes use of ONLY the version field to determine if the package is different from the agent's current version.

The version field is a semantic versions, based on the standard 3 part dotted decimal syntax; e.g. 1.0.2.

Agent packages can be added to the management hub at any time, but will not be acted upon by any registered agent until an NMP is enabled that matches the agent.

Agent packages are stored in the MMS, specifically in the CSS in the management hub. The MMS object which defines the package has a version field which MUST be the same as the version of the package.

Agent packages all reside within a single organization within the management hub. It does not matter which org because the packages are declared to be public objects in the MMS, meaning that they can be downloaded by an agent in any org. A NMP (which is org specific) initiates the deployment of agent packages, it is not the act of uploading a new package that causes existing agents to update themselves.

Agent packages have an object type of "openhorizon.agent-package", to discriminate them from other objects in the MMS.

Agent packages can be created using any tools the user chooses, as long as the package conforms to the design described above. The hzn CLI is extended to provide some simple commands to help the user get started.

hzn [agentpackage | ap] [new | list | publish]

...describe details...

Edge Clusters

Agents running in an edge cluster are running within a container. The container's guest OS and package manager will be used to upgrade the agent binaries. The agent will upgrade itself within the container, without terminating the container.

??? Still deciding on this. The alternative is to have the agent installed via an operator (which we dont have today) and allow the operator to do the upgrade. Perhaps this is a better way? The agent process would have to communicate with the operator (through a CRD) in order to trigger an upgrade. ???

??? Is there a case where we would need to update the container in some way that is different from updating the agent binary of the agent config ???

Agent in Linux Container

existing upgrade script depends on pulling image from dockerhub OH
...tbd...

Additional management policy:
Thinking forward a bit, NMP could be extended with the following schema to support additional scenarios:

The ability to scale to 100s millions of nodes will require multiple management hubs working together in the horizontal scaling sense (there will be limits to the vertical scale of a single management hub). This will require that an agent communicates with multiple management hubs.
The ability to configure a failover hub.
The ability for an agent to upgrade a top level service's dependencies without a new agreement.
The ability for nodes to self align into HA groups, ensuring that at least one of them is always running a stable version of edge services, even when some of those services are being upgraded to new versions.

Here's some schema examples:

{
"service-dependency-upgrade": { // enable service dependencies to upgrade without a new agreement
    "enable": boolean,
},
"management-hub": { // management hub configuration, if the primary doesn't work, use an alternate.
    "primary" : {
      "exchange": "<url>",
      "mms": "<url>",
      "tls-cert": "<url>"
    },
    "alternate": [ // optional
      {"exchange": "<url>",
      "mms": "<url>",
      "tls-cert": "<url>"},
    ],
    "ha-group": { // establish a group of nodes that are working together, service upgrades happen one at a time
      "partners": ["<node-id>|<node-name>"], // An explicit list of partner nodes, XOR
      "constraint": [""]
    }
}

User Experience

Command Line Interface

Agent Upgrade Policy

There are several discrete lifecycle steps associated with upgrading the agent software and/or configuration to a new version:

a stimulus to begin the process
download the new version
verify a hash of the downloaded package
install the new package (if there is one)
update any new or changed configuration (if any)
restart the agent process
report the new (or failed) version info to the node/status resource in the exchange

This is the implied lifecycle for a new kind of policy called "agentUpgradePolicy" defined within a NMP.

For any given node, the stimulus to initiate the "agentUpgradePolicy" occurs when a NMP is successfully enabled (or created already enabled) that matches the node (policy or pattern). The node self discovers this stimulus (through the exchange notification framework, i.e. the /changes API) and initiates the policy actions on its own. Packages are downloaded from the hub (CSS), the package hash is verified and the package is installed and/or the config is updated. The agent is restarted (see below) and updates it's new package version in its exchange node/status resource.

The "agentUpgradePolicy" section in a NMP contains the specifics of the agent upgrade intent being expressed by the administrator, and conforms to the following JSON snippet:

Code Block

"agentUpgradePolicy": {
      "atLeastVersion": "<version> | current", // Specify the minimum agent version these nodes should have, default "current"
      "start": "<RFC3339 timestamp> | now",    // When to start an upgrade, default "now"
      "duration": seconds                      // Enable agents to randomize upgrade start time within start + duration, default 0
},

The atLeastVersion field indicates the minimum version the agents should upgrade to when instructed. When atLeastVersion is set to current, the agent will periodically check for new versions and automatically upgrade to the newest version.

The start field contains an RFC3339 timestamp indicating when the upgrade should start. The timestamp is in UTC format so that affected nodes anywhere in the world are acting based on the same clock.

The duration field is specified in seconds, and is added to the start time to form a window of time within which the agent will initiate an update. Each agent will randomly select a point within that window to begin the upgrade process. This prevents all affected agents from upgrading at the same time, possibly overwhelming the management hub. The combination of selecting nodes based on properties and constraints plus the start time and duration is intended to enable org administrators sufficient control over mass upgrades to prevent overwhelming the management hub.

If an agent is instructed to upgrade to a version which it is already running, it will not perform the upgrade.

If an agent detects a policy conflict, it will log the conflict in its event log.

If an agent upgrade fails, the agent will be rolled back to the previous version. The failure status will be recorded in the node's /status resource in the exchange, and the upgrade will not be attempted again until a newer version of the package is made available.

Question: Can agents be instructed to downgrade?

NMP Constraint Collisions

Given the flexibility of policy within OH, it is possible that NMPs with conflicting actions could be compatible with the same node(s). There are a myriad of circumstances in which this could occur. Within this design, the conflict could occur in relation to the atLeastVersion setting of the agent update policy. Features (in the future) which add additional policy to a NMP could create additional conflicting situations. Two or more NMPs cannot be in conflict if they are enacting different kinds of policy. This is important for the future when additional management policy kinds are added.

The agent is responsible for interpreting each NMP and acting on it accordingly, therefore an agent is only concerned with the NMPs that apply to it. If there are any conflicts, the agent will not enact the conflicting policy action. For example, NMP1 enables policy intent1 at time t1, NMP2 enables policy intent2 at time t2, and NMP3 enables policy intent1 at time t3 with a policy setting that conflicts with NMP1. For the purposes of conflict resolution, the only potential conflicts are between NMP1 and NMP3. Agents will implement the policy intents in NMP1 and 2. NMP3 is ignored because NMP3 conflicts over intent1 and was added after NMP1 was already enacted.

It is possible that conflicting actions could be introduced in close temporal proximity. This could result in actions that are immediately reversed. The agent should be prepared to handle these cases. For example, NMP1 and NMP2 are introduced/enabled a few seconds apart and both adjust the same policy setting (e.g. autoUpgrade). Some of the nodes that match NMP1 and NMP2 will be notified of both NMP1 and NMP2 at the same time. These nodes will enact the actions in NMP2, completely ignoring NMP1. Other nodes might only be notified of NMP1, and will enact the policy in NMP1. Very soon after that, these same nodes are notified of NMP2, and recognizing that NMP2 is newer than NMP1, will enact the action in NMP2, reversing what was done when NMP1 was enacted.

Here are the changes that might result in a management policy collision:

new/changed node policy - this can occur at node registration (which is essentially creation of a node policy) or when there are changes to an existing node policy
enablement of an existing NMP - an NMP can be published in the disabled state
new/changed NMP that is also enabled

In the absence of any enabled NMPs, an agent never checks for an update.

Agent Version Status

The agent will maintain it's version in the exchange, in its /status resource. Following is the JSON encoded pseudo-schema for the new version status:

Code Block

"agentVersion": {
      "package":{}, // The version(s) of the most recent package installed on the agent
      "failed":"<version>", // The last version upgrade failed on this version, or empty
      "upgradeStart":"<RFC3339 timestamp>", // The time (in UTC) when the last upgrade started
      "upgradeEnd":"<RFC3339 timestamp>", // The time (in UTC) when the last upgrade ended
}

The package field contains the JSON object from the package manifest of the most recently successful agent install. See the "Agent package" section.

Agent package:

BP: I think it is important for this and agent-install.sh to work from the same files, because they are doing essentially the same thing.

An updated package downloaded by the agent from the management hub, may contain any or all of the following:

Package manager specific packages for updating agent binaries
Management hub URLs
SSL certificate for the management hub