...
In order to improve observability of the HTTP request processing (in EVE and possibly is other projects written in Go as well), a new Golang package nettrace was developed. The idea was to hook into Golang's HTTP client and record a summary of all important operations performed at different network layers of the HTTP request processing, which, when combined, provide a clear picture of what was happening and in what order behind the scenes. When a request fails, it should be much easier to backtrace to its root cause. We decided to call this process "network tracing" and the output "network trace" (taking some inspiration from the trace package).
...
- Wrap http.Transport (implementing RoundTripper interface) and observe HTTP requests/responses at the application layer. Note that some code in EVE and in 3rd party libs used by zedUpload assumes that RoundTripper’s concrete type is http.Transport. Inside EVE we simply changed such code to remove this assumption and in the 3rd party libraries this is not reachable from EVE, because we pass a ready to use
http.Client
so that a 3rd party library does not need to do any further customization. In any case, there is an option to disable this hook (at the cost of losing tracing info on the application and the session (TLS) layers - see point 2.). On the other hand, it will not cause any issues if the caller is adding some additional Transport wrapping on top of it (like we do here). - With Transport wrapped, we can add ClientTrace into the request context to call our callbacks and inform us about: TCP connection selected, TLS handshake(s) done/failed (incl. server certs and negotiated parameters) and some more (less useful info). Note that it will work even if the caller also adds its own ClientTrace (like EVE does here), because httptrace.WithClientTrace is smart enough to be able to merge multiple traces.
- Put a custom DialContext (into the underlying http.Transport) as well as a custom Dial for Resolver, both of which still call the original methods, but also wrap the returned net.Conn to observe socket read/write operations.
- In our custom Dial methods, use Dialer.Control to get access to information about every TCP connection attempt (including failed ones).
...
|
...
Note that nim is evaluating connectivity status by running a /ping
request towards the controller. If the request fails, the error returned from http.Client
is reported inside DevicePortStatus. With tracedhttp.Client
, we are able to obtain and publish more information when the connectivity checks are failing. Additionally, during a traced connectivity check, a failed /ping
request is followed by GET requests for http://www.google.com
and https://www.google.com
(just like diag is performing). From trace output we are then therefore able to tell if, at the time of the test, the device had any Internet connectivity at all.
As opposed to the downloader, in nim it makes sense to include all tracing information, including packet tracing capture so that we can narrow down the root cause of a failed check as much as possible. However, we should then perform tracing much less frequently - not with each connectivity check performed by nim, which is at least once every 5 minutes. Multiple traces obtained inside a duration of the same network issue would likely not add any additional information. We decided to run full HTTP tracing only at most once per hour before onboarding and at most once per day after onboarding (the second interval is configurable, learn more here) and only when the latest DPC is being tested. It does not make sense to troubleshoot obsolete network configurations.
...
With nim sporadically tracing /ping
and google.com
requests, it still makes sense to utilize network tracing in zedagent as well. This microservice is running the most important requests: /config
to get the latest device configuration, aka the intended state, and /info
to publish the actual device state. In fact, both of these must succeed for the device to be considered as Online and not as Suspect by zedcloud. As it was pointed out above, a failing latest DPC is applied only temporarily - until nim performs one connectivity check and fallbacks to a previous DPC. This means that as long as the latest DPC is marked as not working, it does not make sense for zedagent to trace its requests, because they would likely be using an obsolete DPC anyway. However, if nim evaluates the latest DPC as working yet zedagent is failing to get config or publish information (specifically ZInfoDevice), then zedagent is eligible to run tracing and publish the output. However, the same tracing interval (at most once per hour/day before/after onboarding by default) applies here as well.
Both successful and failed config/info network traces are published.