Considering Fit-n-Finish is the priority as compared to new features, the following are some of the things that need to be improved:

  1. Logging / Debuggability
    1. Linux state ( Ip Table Rules, Disk contents, ifconfig -a, ip route etc)
      1. Need Kernel Information out to debug Broken hardware
    2. PubSub state
    3. Ability to run some pre-defined diagnostic commands
    4. Better visibility into each agent state
      1. Internal Agent state
      2. Internal Agent counters
      3. e.t.c
    1. App Instance Create / Delete / Refresh / Purge / Restart etc. For Example:
      1. Verifying?
      1. Auto Retriable Error
      2. User intervention needed.
      1. App Instance Create RCVD ( Name, time etc )
      2. Downloading / Verifying Images
      3. Download / Verifying images Done
      4. Error
      5. Copying For RW image
      6. Reserving Resources
      7. Starting Instance
      8. Instance Create Done
    2. Have a structured format to log such events as INFO events - which user can then look to know the actual details from the device
      1. These are also very useful from the developer perspective.
    1. Kernel state / Disk State / Temp etc..
    1. The main thrust is to improve the content of logging, to be able to debug 95% of the issues without reproducing the issue or accessing the device using console / ssh.
    2. Ability to prioritize messages based on severity level - CRIT / ERROR / INFO / DEBUG - in that order and drop the messages starting from lowest priority ones
    3. Proper Device Events - In INFO - Provide various events, AS SEEN FROM DEVICE, for each trigger:
    4. Kernel Coredump in case of kernel crashes..
    5. Information to debug broken hardware
  2. Device Reboot reason
    1. First boot after install
    2. User Triggered Reboot - Time
    3. Upgrade - Time
    4. Upgrade Failure - Rollback
    5. Unexpected Reboot
      1. Agent Crash - Details
      2. Agent Watchdog Timeout - Details
      3. Hardware Watchdog - Details
      4. Kernel Crash
      5. Power Failure
      6. In these cases - log.Errorf() message details so that it goes into Kibana
      7. UI - should just put this from user perspective - and hide the details of the crashes
    1. Current reboot reason is more for the developer.
    2. For the user - Reboot reason should be seen as follows:
  3. Device Events
    1. These don’t always correlate with what is going through the system, especially after reboot
    2. We see events toggling between Init / unknown / init / downloading etc.
    3. These fixes are more of Bug Fixes
  4. More visibility into state of Bigger triggers
    1. Include more granular information, like Reboot Started etc.
      1. Currently, there is no visibility on when the device received the reboot, when it is done shutting down all app instances and when it is actually rebooting. This is all useful information to an admin waiting for systems to come up.
      2. Some states are:
      3. Shutdown applications
      4. Starting Reboot ( Still from Older image )
      5. Bootup ( As soon as possible )
      6. Or may be the existing msg is good enough
    2. Upgrade - We can provide more visibility to the user:
      1. Image Downloaded
      2. Verification
      3. Install
      4. Shutting down applications
      5. Reboot
      6. Booted up as part of Upgrade
      7. Testing in progress ( Update time remaining )
      8. Testing Done / Upgrade Successful
    3. Upgrade FAILURE Fallback - cases:
      1. Shutdown applications
      2. Reboot to Fallback image Started
      3. Bootup of Fallback image
      4. .. This part is common to bootup
    1. Reboot Device
  5. Advertise capabilities to Cloud to enable smooth upgrades, to easily allow cloud deal with multiple versions of devices.
    1. For example - Cloud can send Encrypted secrets Vs. Clean text - Doesn’t need to send both
    1. Are these used by Cloud to change behavior dynamically? Or just for Inventory analysis?
  6. Eve images on Docker hub.
  7. Fallback Interface configuration ( Lower priority )