EVE device logging redesign goals

Motivation

Logs generated from EVE services and others are currently being written to files in the file system. These logs are then read by a different EVE service called logmanager, which then batches individual logs, puts them in protobufs before exporting to cloud using APIs. With a file based mechanism and different EVE services writing to different log message files on the file system, order of logs exported to cloud cannot be guaranteed to be in the same order as they were written to files. When there is verbose logging enabled on EVE device, log files start bloating (even with log rotation in force) up to a point where they do not leave any free space for the other use cases when disk space is required. We have seen several devices getting bricked because of disk space exhaustion. It has also been observed that the logs file exported to cloud as seen by cloud are hours behind the logs seen on the device at a specific point in time.

Idea is to move away from file based logging and use a purpose built logging service like rsyslogd or fluentbit with the following goals.

No more logging to files, unless there is a component that we cannot make to use standard syslog (eg. hypervisor logs, lisp logs etc). Even the containers launched by EVE (eg: wlan, wwan etc) should be made to use standard syslog.
Have a disk backed queueing mechanism that saves logs from being lost in the event of unexpected power failures or reboots. This includes both main message queues and more specific action queues.
Have mechanism to save debug logs on the device disk along with sending them to cloud. This can help engineers to access debug logs from device in the event when remote log level is not set to accept debug logs. Or should we ignore the remote log level? If we decide to persist debug logs in device, the logging infrastructure should take care of limiting the space occupied and also rotate logs without using any additional tools like linux logrotate.
In the event of an upgrade failure, queueing mechanism should make sure to not lose the other partition (failed partition with failure messages) logs. These logs should be preserved and sent to cloud after the device comes back online.
Have a transformer that adds the partition attribute (partition name IMGA/IMGB), eve service name and version of EVE to log messages that are exported to cloud. This helps while debugging to grep for logs specific to a particular release, partition and service.
To prevent making too many API calls to cloud, logs should be exported to cloud in batches.
The logging in /opt/zededa/bin/watchdog-report.sh should be preserved so that we get the reboot-reason. (This can be done by having the agentlog append to the reboot-reason file and avoid having to grep the log files in watchdog-report.sh)

Schema of log exported to cloud will have the following fields:

Time stamp of log generation
EVE version
Device image partition IMG[AB]
severity
priority
TAG or Service name that generated the log
Any free form json specific to the source that generates the log

Space shortcuts

Page tree

Motivation