Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • In EVE, HTTP is the main application protocol used to carry the management traffic as well as for downloading images

    • Used for all communication between device and controller

    • Used to verify network connectivity (i.e. network connectivity errors often come from Golang’s HTTP client)

    • Used to download application and EVE images (including for AWS, Azure and GCP storages) - the only exception is SFTP datastore (not sure how frequently used by customers)

  • Troubleshooting of customer-reported network issues is quite challenging …

  • EVE only reports the final error as returned by HTTP client - typically wrapping multiple errors as returning back up the stack, but often important information from lower layers is lost

  • Difficult to retrospectively backtrace to the root cause

  • Single HTTP request hides a complicated sequence of operations behind the scenes, possibly consisting of:

    • multiple DNS requests (A + AAAA, multiple DNS servers)

    • additional TCP connection attempts (if multiple resolved IPs are available and some fail)

    • reused TCP connections (previously opened)

    • TLS handshakes (some possibly with reused sessions), server cert verification

    • HTTP/HTTPS proxying

    • HTTP redirects

  • When we receive error message, it is often difficult to determine which underlying operation has triggered it and what has led to it - for example, each of the operations has some timeout and when we receive “Context deadline exceeded” it is hard or impossible to tell which operation has failed to finalize in time or consumed unexpected amount of it

  • Image downloading is even more difficult because we do not know (without very close code inspection) what a 3rd party library (like github.com/aws/aws-sdk-go) is doing behind the scenes (potentially running multiple HTTP requests) - we often get errors like “<very-very-long-url>: connection timeout” - how did we get there and what is this particular request doing? Download process gets even more convoluted if customer has put a load-balancer in front of the datastore, e.g. Azure Traffic Manager.

  • When we cannot progress with a customer issue, we often ask the customer to make some observations and collect information for us, such as interface packet counters, conntrack entries, packet trace, etc., by invoking curl/ping/dig command on our behalf

    • would be more efficient (and professional) if we could get such information automatically alongside a logged/published network error

  • Same error can have different causes - for example “Context deadline exceeded” may be returned when TCP SYN packet has no response (possibly blocked by a firewall) as well as when larger packets are not getting through (e.g. MTU issue). Without some connection stats (like bytes sent, received) it is difficult to tell them apart.

...