<Please fill out the Overview, Design and User Experience sections for an initial review of the proposed feature.>

Overview

Open Horizon generally treats nodes as entities with an independent lifecycle, apart from all other nodes. But there are use cases, such as using sensors for monitoring critical systems, where it is important to have redundant monitoring in place so that there is always at least one monitoring agent operating. This is of course similar to the principles of high availability and continuous availability that are commonly found within IT systems. Open Horizon already contains a little known feature, called HA Groups, that enables nodes to be associated as a group such that at least 1 copy of a service deployed to the group is always running. Further, Open Horizon also ensures that when services are upgraded, the upgrade will be rolled across all the members of a node group such that at least 1 copy of the service is always running. One of the problems with the existing HA Group support is that it is not dynamic. For example, nodes must be added to a group as part of registering them with the management hub. Nodes cannot be removed from a group. Nodes cannot be added to an existing group without unregistering the entire group and registering again with the new group members. Node registration is something that happens once for the lifetime of the node. A node should never need to be unregistered unless it is being decommissioned.

Design

The design proposes to enhance the current concept of HA node groups by enabling organization administrators to create HA/CA node groups at any time in the lifecycle of a node. Further, the design proposes to loosen the current restriction that all services deployed to a node in an HA/CA group are deployed on all nodes in the group, enabling the use of heterogeneous node equipment within a group. Following are the key design constraints which define a new HA/CA node group concept:

Nodes in an HA group MAY have different node policies.
Adding a node to an HA Group MUST NOT terminate/restart running services.
Nodes MAY be placed into an HA Group after node registration.
A node MUST be in 0 or 1 HA Groups. A node MUST NOT be in more than 1 HA Group.
A node specifies the other nodes in it's HA Group by Id. All nodes in an HA Group MUST specify all the other nodes in the group. Can we get the exchange to enforce this part of the model?
A user MUST have permission to modify all the node's (resources) in an HA Group in order to form the group.
A service that is deployed to all the nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the service.
The node agent on nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the agent itself.

Agbot

The Agbot is responsible for ensuring that all deployment policies have been checked for compatibility against all nodes in the system, and making adjustments to the deployed state of services as node policy, deployment policy and service policy change over time. The Agbot already has support for ensuring that service upgrades are performed in a rolling fashion across an HA Group. The existing support will have to be augmented in the following ways:

HA Group membership is obtained from the node's hagroup resource in the Exchange. The existing HA support obtains this info from an internal representation of the node (in the code it's called the producer policy, and the info is also saved in the Agbot's agreement object in the DB). The HA Group membership should be removed from this internal representation and obtained from the node's hagroup resource. The hagroup resources will also need to be added to the resource cache in the Agbot.
The existing HA support assumes that ALL services running on a node in an HA Group are supposed to be running on ALL nodes in the group. This assumption is no longer true with this design. The Agbot needs to perform some additional checking (before attempting a rolling upgrade) that the service being upgraded is intended to be running on all nodes in the group. A service is intended to be running on a node if the node policy is compatible with the service's policy and all deployment policies that reference the service.

Exchange Changes API

When new resources are added to the system, the scope of change notification of those resources needs to be defined. The Agbot needs to be made aware of hagroup resource creation/update/deletion. Nodes do not need to be aware.

Agent Upgrades

In addition to HA/CA support for service software upgrades, agent upgrades also need to be performed in a rolling fashion across all the nodes in an HA/CA node group. Agents are responsible for upgrading themselves based on node management policy as defined by the administrator, therefore there is no central entity that is able to coordinate agents within a group. The only entity in the system capable of assisting with the coordination is the Agbot. Agents in an HA Group will ask the Agbot if the agent can start the agent upgrade process. If the Agbot agrees, it will record (in the database) that the calling node is performing an upgrade, including which NMP is being processed by the node. With multiple Agbot instances, the database is needed to ensure that concurrent calls from different agents receive the correct response (i.e. only one agent is allowed to proceed with the upgrade). Subsequent calls from other nodes in the group will result in the agent being told to pause the upgrade. It is the agent's responsibility to poll the Agbot until it agrees that the upgrade may proceed. This will ensure that only 1 agent in an HA Group is upgrading at any point in time. The Agbot will use NMP status to know when a node upgrade has completed, allowing another node in the group to proceed.

User Experience

As an org admin, I want to place two or more nodes into an HA Group with rolling service upgrades so that my services will be continuously available.
As an org admin, I want to add a node to an HA Group without affecting currently deployed services.
As an org user, I want to place two or more nodes into an HA Group with rolling service upgrades so that my services will be continuously available.
As an org user, I want to add a node to an HA Group without affecting currently deployed services.
As a device owner, I want to place two or more nodes into an HA Group with rolling service upgrades so that my services will be continuously available.
As a service deployer, I want to deploy non-HA services to a subset of members in an HA Group to avoid compute resource consumption for services that don't need to be continuously available.
As a service deployer, I want to deploy a service ONLY to nodes in an HA Group.
As a device owner, I want rolling agent upgrades within my HA Group, so that my services (and nodes) will be continuously available.
As an org administrator, I want rolling agent upgrades within my HA Group, so that my services (and nodes) will be continuously available.

Command Line Interface

<Describe any changes to the hzn CLI, including before and after command examples for clarity. Include which users will use the changed CLI. This section should flow very naturally from the User Experience section.>

hzn exchange node hagroup create --nodeId node1 --nodeId node2 --nodeId node3 --force

hzn exchange node hagroup remove --nodeId node1

hzn exchange node hagroup list

hzn deploycheck --checkHA

External Components

<Describe any new or changed interactions with components that are not the agent or the management hub.>

None

Affected Components

<List all of the internal components (agent, MMS, Exchange, etc) which need to be updated to support the proposed feature. Include a link to the github epic for this feature (and the epic should contain the github issues for each component).>

Agent - Awareness of HAGroup membership for agent upgrade procedure

Agbot - Awareness of a node's HAGroup membership for making agreements

CLI - To list, add, remove nodes from an HA Group

Exchange - To hold the new HAGroup membership APIs

Security

None

APIs

<Describe and new/changed/deprecated APIs, including before and after snippets for clarity. Include which components or users will use the APIs.>

The following new APIs are introduced in this design. Any user in an org can use these APIs (or corresponding CLI). Org users can only create/modify/delete ha groups containing nodes that the user has permission to modify.

The HAGroup object schema

members: [

{

"node":"node1234",

"agent-upgrade:{

.....coordination state....

}

},

]

Exchange APIs:

Create a new node group. The caller must have permission to modify all the nodes listed in the body (shown above). The Exchange will set this same object onto all the nodes listed in the body. The Exchange will return an error (409) if one of the nodes is already in an hagroup. If force=true is specified, the Exchange will set this membership onto all listed nodes and will remove listed nodes from any group they are already in.

POST /org/<org>/node/<node-id>/hagroup?force=true

Modify the group membership of an existing group. All the desired members of the group MUST be listed in the body. This API behaves like a full replace. The force=true parameter has the same behavior as on POST.

PUT /org/<org>/node/<node-id>/hagroup?force=true

List all the members in an hagroup. This API returns the exact same results when called on any member in an ha group.

GET /org/<org>/node/<node-id>/hagroup

Build, Install, Packaging

None

Documentation Notes

Need an Overview of how HA Groups work on the OH doc site.

Need a new article describing how to use HA Groups, this would be focused toward the administrator and node owners. It could be the same article for both roles (but we might change our mind on this AFTER we have tried to write it).

Update/remove HA doc in the anax repo for the HA /attribute API. This is being removed.

CLI commands

Test

<Summarize new automated tests that need to be added in support of this feature, and describe any special test requirements that you can foresee.>

Edge Clusters are out of scope for this support. Edge clusters already natively support HA and CA, and therefore don't need any special assistance from OpenHorizon.
Test service lifecycles with services deployed to HA groups that are homogeneous (all members have the same services) and heterogeneous (there is at least 1 service common to all members, but some services are only running on a subset of members).
Performance test of service upgrades with and without HA groups in the system.

Space shortcuts

Page tree

Overview

Design

Agbot

Exchange Changes API

Agent Upgrades

User Experience

Command Line Interface

External Components

Affected Components

Security

APIs

Build, Install, Packaging

Documentation Notes

Test

Space shortcuts

Page tree

Continuously Available Node Groups

Overview

Design

Agbot

Exchange Changes API

Agent Upgrades

User Experience

Command Line Interface

External Components

Affected Components

Security

APIs

Build, Install, Packaging

Documentation Notes

Test