Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

EVE Node health check functionality, consists of the following, 

...

 Watchdog Time : For Pillar Agent(s)

...

Health and responsiveness

...

.

 Each           Each agent's health is monitored through hardware watchdog timer. 

 Controller connectivity

The controller connectivity for the EVE node is evaluated, as following,

Reset Time

. If an agent does not retouch the pid file, for a specified interval, the device is rebooted.

Reset Time: Cloud connectivity health in normal operation

On In normal operation scenario, for controller connectivity loss, the EVE node is rebooted after the reset timer time interval.

 Fallback Time

...

: Controller Connectivity health during baseos upgrade

...

validation phase

For , for controller connectivity loss, EVE Node falls back to fallback image, after the fallback time interval.

Current Implementation

The watchdog time handler functionality is based on hardware watchdog(wdctl), and it is part of device-steps.sh.

The EVE node reset and fallback timer time functionalities are currently part of ZedAgent Module.  

Proposal for Refactoring

The watchdog time functionality will remain as such. The reset and fallback time functionality will be moved into a new agent called, devagent. The whole baseos upgrade validation functionality will be moved into devagent module. Devagent will be spwaned along with ledmanager. Devagent will listen to ledmanager ledblinker config messages to determine controller connectivity status along with successful configuration pull message time stamps, to orchestrate the baseof upgrade validation functionality. Devagent will be owner for Zboot config and will publish them for usage by baseosmanager. Also on successful baseos installation and reset/fallback timer expiry, the device reboot operations will be triggered through "devagent status"  pusub topic.

Zedagent module will only be responsible for controller connectivity related functionalities, like pulling latest configuration blob from controller, and publishing status/info/metrics  tmessages to controller. And will update this information through "zedagent status" pubsub topic. Zedagent will subscribe to "devagent status" pubsub topic to execute device reboot command.

Baseosmanger will listen to devagent module, zboot config messages to handle, and update zboot status, during baseos validation orchestration.

In a nutshell, this is going to be overall,change in event handling per module.

Baseosmgr Module

 Will subscribe to the following 

      - zbootconfig topic from devagent

ZedAgent Module

 Zedagent wiill subscribe to the following topics,

     - devagent status topic from devagent

            - for executing device reboot command 

            - to publish the remaining test time, for baseos upgrade validation

  will publish the following

      - zedagent status

            - time stamp for last successful configuration pull

DevAgent Module

DevAgent  module will  listen  subscribe to the following topics,

   - ledBlinker Config, generated by zedclient/zedagent, 

          - Status.  – for EVE node registration, controller connectivity change events

   - Zboot Status, generated by baseosmgr

           - for baseos installation and upgrade validation orchestration

   - Zedagent Status

            - to know, the last successful config fetch from controller, for baseos upgrade validation and testing orchestration

DevAgent will publish to the following topics,

    - Zboot Config

              - zboot partition information

    - DevAgent Status

ZedAgent additionally will listen to the following,

    - Dev Agent Status

PS. 

              - for reboot commands, in baseos installation or, reset/fallback timer expiry.

               - remaining test time, for publishing  to controller (by zedagent)


PS

For completeness and future workscope, the following items are noted, for EVE node health. This list is not exhaustive, and the nessary actions for them needs be defined. Currently, the scope of device health, as defined above, does not include the following,

            - cpu usage health

...