<Please fill out the Overview, Design and User Experience sections for an initial review of the proposed feature.>

Overview

Open Horizon generally treats nodes as entities with an independent lifecycle, apart from all other nodes. But there are use cases, such as using sensors to for monitoring critical systems, where it is important to have redundant monitoring in place so that there is always at least one monitoring agent operating. This is of course similar to the principles of high availability and continuous availability that are commonly found within IT systems. Open Horizon already contains a little known feature, called HA Groups, that enables nodes to be associated with each node in a group such that at least 1 copy of a service deployed to the group is always running. Further, Open Horizon also ensures that when services are upgraded, the upgrade will be rolled across all the members of a node group such that at least 1 copy of the service is always running. The problem with the existing HA Group support is that it is not dynamic. For example, nodes must be added to a group as part of registering them with the management hub. Nodes cannot be removed from a group. Nodes cannot be added to an existing group without unregistering the entire group and registering again with the new group member. Node registration is something that happens once for the lifetime of the node. A node should not need to be unregistered unless it is being decommissioned.

Design

A few design principles to get started:

Nodes in an HA group MAY have different node policies.
Adding a node to an HA Group MUST NOT terminate/restart running services.
Nodes MAY be placed into an HA Group after node registration.
A node MUST be in 0 or 1 HA Groups. A node MUST NOT be in more than 1 HA Group.
A node specifies the other nodes in it's HA Group by Id. All nodes in an HA Group MUST specify all the other nodes in the group. Can we get the exchange to enforce this part of the model?
A user MUST have permission to modify all the node's (resources) in an HA Group in order to form the group.
A service that is deployed to all the nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the service.
The node agent on nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the service and model deployment capability.

User Experience

As an org admin, I want to place two or more nodes into an HA Group so that my services will be continuously available.
As an org admin, I want to add a node to an HA Group without affecting currently deployed services.
As an org user, I want to place two or more nodes into an HA Group so that my services will be continuously available.
As an org user, I want to add a node to an HA Group without affecting currently deployed services.
As a device owner, I want to place two or more nodes into an HA Group so that my services will be continuously available.
As a service deployer, I want to deploy non-HA services to a subset of members in an HA Group to avoid compute resource consumption for services that don't need to be continuously available.
As a service deployer, I want to deploy a service ONLY to nodes in an HA Group.

Command Line Interface

<Describe any changes to the hzn CLI, including before and after command examples for clarity. Include which users will use the changed CLI. This section should flow very naturally from the User Experience section.>

External Components

<Describe any new or changed interactions with components that are not the agent or the management hub.>

Affected Components

<List all of the internal components (agent, MMS, Exchange, etc) which need to be updated to support the proposed feature. Include a link to the github epic for this feature (and the epic should contain the github issues for each component).>

Security

APIs

<Describe and new/changed/deprecated APIs, including before and after snippets for clarity. Include which components or users will use the APIs.>

Build, Install, Packaging

Documentation Notes

Test

<Summarize new automated tests that need to be added in support of this feature, and describe any special test requirements that you can foresee.>

Space shortcuts

Page tree

Overview

Design

User Experience

Command Line Interface

External Components

Affected Components

Security

APIs

Build, Install, Packaging

Documentation Notes

Test

Space shortcuts

Page tree

Continuously Available Node Groups

Overview

Design

User Experience

Command Line Interface

External Components

Affected Components

Security

APIs

Build, Install, Packaging

Documentation Notes

Test