Date: Fri, 29 Mar 2024 09:51:13 +0000 (UTC) Message-ID: <246461323.35499.1711705873155@aws-us-west-2-lfedge-confluence-1.web.codeaurora.org> Subject: Exported From Confluence MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_Part_35498_685699263.1711705873155" ------=_Part_35498_685699263.1711705873155 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Content-Location: file:///C:/exported.html
We currently have the device send log information to the control= ler using the log API, and this can be useful when debugging issues i= n EVE.
However, in some cases it is useful to also be able to inspect the curre= nt state. That state could be the state maintained by the EVE microservices= (e.g., the AppInstanceStatus maintained by zedmanager), or it could be ext= ernal state such as the iptables or ps process output.
This proposal specifies how a well-defined set of such information can b= e retrieved by the controller.
We currently deliver the logs from the EVE microservices to the co= ntroller, plus specific information relating to the device and instance sta= tus and metrics. However, two issues makes it harder to use those logs than= those on the device, the first being that they are consolidated from all t= he agents, and the second being that the logs are for the lifetime of the d= evice (split in IMGA and IMGB logs) and in most cases one cares about what = happened after the last reboot.
In addition, the current state of the device is easier to determin= e by examining /var/run on the device, and looking at things like the outpu= t of ps or xl list.
Finally, there are implementation internal aspects (such as iptabl= es -L, ip rule show, ip route show) which are useful when debugging issues.=
We already have the logging API as a flexible and scaleable way to deliv= er information from the device to the controller, with the appropriate retr= y/retransmission logic in EVE. Its only constraint is that a single log ite= m must be smaller than the maximum size configured in the web server runnin= g on the controller.
We also have a flexible way to extend the configuration using the Config= Item message in the configuration; a string key plus a string value, which = is used for timer and policy settings.
Last but not least we have a way to send commands such as the RebootCmd = using eventual consistency by having a counter to ensure that a command is = executed at least once.
Combining those we can add support for additional debug commands by defi= ning a ConfigItem key string for each, where the value is a number. When th= e device receives such a ConfigItem it checks if the number is different th= an what it last processed for that particular key, and if it is the device = performs the operation and the output is sent to the log API.
Command |
Reported information |
Potential use |
ps |
ps output |
Look for hung processes |
du |
du -a /persist |
Track down disk usage |
du.<subdir> |
du -a /persist.<subdir> |
E.g., du.log, du.IMGA |
state |
All of /var/run content |
Snapshot for all agents and object |
state.<agent> |
/var/run/<agent> |
Snapshot for one agent |
state.<agent>.<type>= |
/var/run/<agent>/<type> |
For agent and type |
state.obj.<key> |
/var/run/*/*/<key> |
E.g., look for an instance UUID<= /p> |
config |
/config except any *.key.pem = |
Looking for stale files |
persist.<subdir> |
ls /persist/<subdir> = td> | Looking for stale files or missing cert= s |
lspci |
Alpine lspci output |
Check if pci controllers match model |
lsusb |
Alpine lsusb output |
Check if any USB devices connected |
iptables |
Iptables -t filter; iptables -t raw; ip= tables -t nat, all with -L -nv |
Check if iptables are wrong + counters<= /span> |
route |
ip route show |
|
route.X |
ip route show table X |
|
rule |
ip rule show |
For security reasons any command should be of a fixed function; no comma= nd should ever allow arbitrary execution of e.g., shell commands. Furthermo= re, when defining new commands one needs to take care to not expose any sec= ret information from the device, such as the content of running edge contai= ner objects, or credentials for datastore access.
Currently none of the defined commands alter the state of the device, an= d if there is a desire to alter the state (e.g., purge certain directories = to recover from low on disk space) it would make sense to explore alternati= ve approaches than this basic fire-and-forget approach.
The device will retain counter Y value for command string X, in si= milar ways as it retains a rebootCount and uuidtonum persistently across re= boots.
This could be in /persist/status/zedagent/KeyToNum/X.json= p>
When zedagent receives config items from the controller it will co= mpare the counter Y with what is recorded, and if it is different than it w= ill send the requested output to the log API. It makes sense for the log ou= tput to include the command string and counter value.