Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Open Horizon generally treats nodes as entities with an independent lifecycle, apart from all other nodes. But there are use cases, such as using sensors for monitoring critical systems, where it is important to have redundant monitoring in place so that there is always at least one monitoring agent operating. This is of course similar to the principles of high availability and continuous availability that are commonly found within IT systems. Open Horizon already contains a little known feature, called HA Groups, that enables nodes to be associated as a group such that at least 1 copy of a service deployed to the group is always running. Further, Open Horizon also ensures that when services are upgraded, the upgrade will be rolled across all the members of a node group such that at least 1 copy of the service is always running. The problem One of the problems with the existing HA Group support is that it is not dynamic. For example, nodes must be added to a group as part of registering them with the management hub. Nodes cannot be removed from a group. Nodes cannot be added to an existing group without unregistering the entire group and registering again with the new group membermembers. Node registration is something that happens once for the lifetime of the node. A node should not never need to be unregistered unless it is being decommissioned.

...

<Describe how the problem is fixed. Include all affected components. Include diagrams for clarity. This should be the longest section in the document. Use the sections below to call out specifics related to each aspect of the overall system, and refer back to this section for context. Provide links to any relevant external information.>


Here are the key design constraintsThe design proposes to enhance the current concept of HA node groups by enabling organization administrators to create HA/CA node groups at any time in the lifecycle of a node. Further, the design proposes to loosen the current restriction that all services deployed to node in an HA/CA group are deployed on all nodes in the group, enabling the use of heterogeneous node equipment within a group. Following are the key design constraints which define a new HA/CA node group concept:

  1. Nodes in an HA group MAY have different node policies.
  2. Adding a node to an HA Group MUST NOT terminate/restart running services.
  3. Nodes MAY be placed into an HA Group after node registration.
  4. A node MUST be in 0 or 1 HA Groups. A node MUST NOT be in more than 1 HA Group.
  5. A node specifies the other nodes in it's HA Group by Id. All nodes in an HA Group MUST specify all the other nodes in the group. Can we get the exchange to enforce this part of the model?
  6. A user MUST have permission to modify all the node's (resources) in an HA Group in order to form the group.
  7. A service that is deployed to all the nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the service.
  8. The node agent on nodes in an HA Group MUST be upgraded in a rolling restart in order to avoid a complete outage of the service and model deployment capabilityagent itself.

Agbot

The Agbot is responsible for ensuring that all deployment policies have been checked for compatibility against all nodes in the system, and making adjustments to the deployed state of services as node policy, deployment policy and service policy change over time. The Agbot already has support for ensuring that service upgrades are performed in a rolling fashion across an HA Group. The existing support will have to be augmented in the following ways:

...

When new resources are added to the system, the scope of change notification of those resources needs to be defined. The Agbot needs to be made aware of hagroup resource creation/update/deletion. Nodes do not need to be aware.

Agent Upgrades

In addition to HA/CA support for service software upgrades, agent upgrades need to be performed in a rolling fashion across all the nodes in an HA/CA node group. Agents are responsible for upgrading themselves based on node management policy as defined by the administrator, therefore there is no central entity that is able to coordinate agent's within a group.

User Experience

<Describe which user roles are related to the problem AND the solution, e.g. admin, deployer, node owner, etc. If you need to define a new role in your design, make that very clear. Remember this is about what a user is thinking when interacting with the system before and after this design change. This section is not about a UI, it's more abstract than that. This section should explain all the aspects of the proposed feature that will surface to users.>

...

        "node":"node1234",

        "agent-upgrade:{

            .....coordination state....

        }

    },

]


Exchange APIs:

Create a new node group. The caller must have permission to modify all the nodes listed in the body (shown above). The Exchange will set this same object onto all the nodes listed in the body. The Exchange will return an error (409) if one of the nodes is already in an hagroup. If force=true is specified, the Exchange will set this membership onto all listed nodes and will remove listed nodes from any group they are already in.

...