Motivation
In EVE, HTTP is the main application protocol used to carry the management traffic as well as for downloading images
Used for all communication between device and controller
Used to verify network connectivity (i.e. network connectivity errors often come from Golang’s HTTP client)
Used to download application and EVE images (including for AWS, Azure and GCP storages) - the only exception is SFTP datastore (not sure how frequently used by customers)
Troubleshooting of customer-reported network issues is quite challenging …
EVE only reports the final error as returned by HTTP client - typically wrapping multiple errors as moving up the stack, but often important information from lower layers is lost
Difficult to retrospectively backtrace to the root cause
Single HTTP request hides a complicated sequence of operations behind the scenes, possibly consisting of:
multiple DNS requests (A + AAAA, multiple DNS servers)
additional TCP connection attempts (if multiple resolved IPs are available and some fail)
reused TCP connections (previously opened)
TLS handshakes (some possibly with reused sessions), server cert verification
HTTP/HTTPS proxying
HTTP redirects
When we receive error message, it is often difficult to determine which underlying operation has triggered it and what has led to it - for example, each of the operations has some timeout and when we receive “Context deadline exceeded” it is hard or impossible to tell which operation has failed to finalize in time
Image downloading is even more difficult because we do not know (without very close code inspection) what a 3rd party library (like github.com/aws/aws-sdk-go) is doing behind the scenes (potentially running multiple HTTP requests) - we often get errors like “<very-very-long-url>: connection timeout” - how did we get there and what is this particular request doing?
When we cannot progress with a customer issue, we often ask the customer to make some observations and collect information for us, such as interface packet counters, conntrack entries, packet trace, etc., while invoking curl/ping/dig command
would be more efficient (and professional) if we could get such information automatically alongside a logged/published network error
Same error can have different causes - for example “Context deadline exceeded” may be returned when TCP SYN packet has no response (possibly blocked by a firewall) as well as when larger packets are not getting through (e.g. MTU issue). Without some connection stats (like bytes sent, received) it is difficult to tell them apart.
Nettrace
In order to improve observability of the HTTP request processing (in EVE and elsewhere), a new Golang package nettrace was developed. The idea was to hook into Golang's HTTP client and capture all kinds of information available from different network layers of the HTTP request processing, which, when combined, provide a clear picture of what was happening behind the scenes. When a request fails, it is much easier to backtrace to its root cause. We decided to call this process "network tracing" and the output "network trace" (taking some inspiration from the trace package).
...
When we collect and combine all this information, we are able to report:
|
“optional” means that the particular feature is configurable and provides an option to disable it.
...
Packet captures, if enabled, are returned separately and can be stored into .pcap files.
HTTP tracing inside EVE
We decided to collect and publish HTTP traces from three microservices: downloader, nim and zedagent.
...