Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • All Dial attempts. Each record contains:
    • reference to trace record of established TCP connection (undefined if failed)
    • dial begin + end time, context close time
    • destination address
    • proxy config
    • static source IP (if set, otherwise undefined)
    • dial error (nil if OK)
  • All DNS queries. Each record contains
    • reference to trace record of Dial where this originated from
    • reference to trace record of the underlying UDP or TCP connection (used as a fallback from truncated UDP DNS response)
    • (optional) sent DNS questions and received DNS message header + answers (we are able to parse DNS messages from sent/received data)
  • All TCP connections (attempts + established). Each record will containcontains:
    • reference to trace record of Dial where this originated from
    • handshake start + done time, conn close time
    • 4-tuple (src IP, src port, dst IP, dst port)
    • was it reused?
    • total sent + received bytes (L4 payload)
    • (optional) conntrack (captured-at time, 5-tuple after NAT, mark, flags, packet/byte counters)
    • (optional) socket trace - array of:
      • operation type (read or write), op begin+end time, transferred data length, error (nil if OK)
  • All UDP "connections" (or rather exchanges of messages). Each record contains:
    • reference to trace record of Dial where this originated from
    • time when the socket was created and when it was closed
    • 4-tuple (src IP, src port, dst IP, dst port)
    • total sent + received bytes (L4 payload)
    • (optional) conntrack (captured-at time, 5-tuple after NAT, mark, flags, packet/byte counters)
    • (optional) socket trace - array of:
      • operation type (read or write), op begin+end time, transferred data length, error (nil if OK)
  • All TLS tunnels (attempted + established). Each record contains:
    • reference to trace record of the underlying TCP connection
    • was resumed from a previous session?
    • handshake start + done time, error (nil if OK)
    • negotiated cipher and application proto
    • SNI value
    • for every peer cert in the chain:
      • subject, issuer, validity time range (NotBefore, NotAfter)
  • All HTTP requests made. Each record contains info for both the request and the response:
    • reference to trace record of the underlying TCP connection
    • reference to trace record(s) of the underlying TLS tunnel(s) (2 tunnels are made with proxy listening on HTTPS)
    • time when the request was sent
    • method, URL, HTTP version
    • (optional) request headers
    • request message content length (not transport length which can differ)
    • time when response was received, error (nil if OK)
    • response status code, HTTP version
    • (optional) response headers
    • response message content length (not transport length which can differ)

...

As opposed to the downloader, in nim it makes sense to include all tracing information, including packet tracing capture so that we can narrow down the root cause of a failed check as much as possible. However, we should then perform tracing much less frequently - not with each connectivity check performed by nim, which is at least once every 5 minutes. Multiple traces obtained inside a duration of the same network issue would likely not add any additional information. We decided to run full HTTP tracing only at most once per hour before onboarding and at most once per day after onboarding (the second interval is configurable, learn more here) and only when the latest DPC is being tested. It does not make sense to troubleshoot obsolete network configurations.

...

With nim sporadically tracing /ping and google.com requests, it still makes sense to utilize network tracing in zedagent as well. This microservice is running the most important requests: /config to get the latest device configuration, aka the intended state, and /info to publish the actual device state. In fact, both of these must succeed for the device to be considered as Online and not as Suspect by zedcloud. As it was pointed out above, a failing latest DPC is applied only temporarily - until nim performs one connectivity check and fallbacks to a previous DPC. This means that as long as the latest DPC is marked as not working, it does not make sense for zedagent to trace its requests, because they would likely be using an obsolete DPC anyway. However, if nim evaluates the latest DPC as working yet zedagent is failing to get config or publish information (specifically ZInfoDevice), then zedagent is eligible to run tracing and publish the output. However, the same tracing interval (at most once per hour/day before/after onboarding by default) applies here as well.

Both successful and failed config/info network traces are published.