Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The required semantics is to have network instances separated from each other as much as possible. For example, applications deployed inside different network instances should not be able to communicate with each other directly. Only through so-called "port maps", which are effectively port-forwarding ACL rules, network communication can be established between applications on different networks. But even in this case, traffic does not get forwarded between networks directly, instead it is hairpinned via an uplink interface where the port mapping is configured. For networks using different uplink interfaces it is even required to hairpin the traffic outside the box, even if the communicating applications are deployed on the same edge device.

...

We can get better separation, including IP address isolation if we split network instances using either VRFs or with network namespaces. Furthermore, if we use a containerd task to run network instance networking (especially the external processes like dnsmasq, radvd, etc.), we can even isolate resource usage and apply limiting. We will now describe VRFs and network instances separately, with a bit more focus on VRFs, which, after some internal discussion, is are now the preferred choice.

...

VRF is implemented as just another type of network device that can be created and described using the ip command. For every VRF, there is a separate routing table automatically created together with a special IP rule, matching packets with the corresponding VRF routing table. For an interface to enter a VRF domain, it has to be enslaved under the VRF device (just like interfaces are enslaved under a bridge). The main drawback of VRFs , is that processes have to explicitly bind their sockets to the VRF in which they want to operate. This, however, can be solved outside of those processes by hooking up into the socket() function call using LD_PRELOAD. Alternatively, ip vrf exec command can be used to bind a process with a VRF using some eBPF magic. We can also make upstream contributions and add native VRF support to applications that we need to run inside VRFs, i.e. dnsmasq and radvd (currently neither of those supports VRFsare VRF aware).

Diagram below shows how the separation of network instances using VRFs would look like. For every local/vpn network instance, zedrouter would create a separate VRF (automatically with its own routing table) and put the NI bridge under this VRF device. External processes providing network services separately for each NI, i.e. dnsmasq and radvd (and http server as a go routine of zedbox), would have their sockets bound to the VRF (most likely using 'ip vrf exec'). There will be no VRFs created for switch network instances. From EVE point of view these are L2-only networks.

Uplink interfaces would remain in the default (i.e. not configured) VRF domain and a VETH pair per NI would be used to interconnect the downlink (left) and the uplink (right) side (i.e. the left side of the VETH is enslaved under the NI VRF, while the right side remains in the default VRF domain). VETH interface will operate in the L3 mode - it will have IP addresses assigned on both ends from a /30 subnet. What supernet to allocate these subnets from is up to the discussion. It should be selected such that the risk of collision with other routed subnets is minimal. Already considered and tested are these special-purpose or reserved subnets:  0.0.0.0/8, 127.0.0.0/8, 169.254.0.0/16 and 240.0.0.0/4. However, only the last two turned out to be routable by the Linux network stack without issues.

...

There will be no VETH links between VRFs of network instances. The current behavior of applications from different networks not being able to talk to each other directly will be preserved (and enforced with stronger measures). Hairpinning through portmaps will remain as the only option for communication for network-separated applications. In the default VRF domain there will be one routing table per uplink interface. Using ip rules each network instance will be matched with the RT of the uplink that was selected for that network by the configuration/probing. Network instances that use different uplinks at a given moment will be completely isolated from each other, not even sharing any RT along the routing path. Consequently, connections between uplink-separate NIs can only be established by hairpinning outside the edge device (through portmaps).

The implementation of ACLs is not going to undergo any significant changes, the The implementation of ACLs is not going to undergo any significant changes, the existing iptables rules will remain pretty much the same (aside for the two DNAT rules for each portmap). However, using only VRFs is not enough for isolation when NAT and connection marking is being used (for ACLs and flowlog). It is necessary to also separate conntrack entries between different VRF domainsdifferent VRF domains to avoid collisions with overlapping IP addressing. This can be easily accomplished using conntrack zones (conntrack entries split using integer zone IDs). A NI-specific conntrack (CT) zone is used in-between the bridge and the NI-side of the VETH (same scope as that of the NI VRF). For the default routing table (between uplinks and the uplink-side of VETHs), we could leave the default CT zone 0. However, experiments showed that when VRF devices are being used the default CT zone stops working correctly, resulting in skipped iptables rules and some strange behaviour. Using any non-zero CT zone seems to fix the issue (see PoC section below).

...

Special attention should be given to VPN networks. For the most part, these networks would be extended with VRFs just like the local network instances. However, for strongSwan (IKE daemon) to operate in multi-VRF mode, we have to switch to Route-based VPN mode. Using a special XFRM device (successor an enhanced alternative to VTI device), it is possible to bind an IPsec tunnel with a VRF domain as shown in the diagram below.

A single strongSwan process will continue operating for all VPN network instances. For every VPN NI there will be a separate XFRM device created inside the NI VRF, linked with the corresponding IPsec connection configuration using XFRM interface ID. Packets Packet sent from applications application will be routed by the VRF routing table via the associated XFRM device, which then determines which SAs to use for encryptionencryption. An encrypted (and encapsulated) packet then continues through the VETH pair into the default VRF domain where it is routed out by the uplink routing table. In the opposite direction, the SPI field will link to the XFRM device and thus the VRF where the decrypted packet should be inserted into for forwarding (i.e. VETH is skipped in this direction).

With this, it will be possible to deploy multiple VPN network instances with overlapping traffic selectors and still route/encrypt/decrypt unambiguously.

...

An alternative solution to VRFs is , that instead of using per-NI VRF and CT zone, we could isolate on all network levels and run each Linux bridge and associated network interfaces plus external processes (dnsmasq, radvd, ...) in a separate network namespace. For the most part, this is very similar to the VRF proposal, in that both solutions use VETHs to route and NAT packets from/to apps twice. Also, PBR routes/rules and iptables are very much the same, just spread across multiple namespaces.

The advantage of having multiple namespaces is a stronger isolation and not having all routes and iptables crammed in one network stack. Also, this solution is completely transparent to processes (like dnsmasq, radvd, etc.). The major downside of this solution is a higher overhead (memory footprint in particular). Also debugging will be somewhat more difficult. For example, for packet tracing one has to first switch to the proper network namespace or trace packets across multiple namespaces at once.

However, from the management-plane point of view this proposal is much considerably more difficult to implement than VRFs. Working with multiple namespaces from the same process (e.g. zedbox) is possible but quite challenging. While each process has its own "default" namespace where it has started, individual threads can be switched between namespaces as needed. However, frequent switching between namespaces adds some overhead and it makes development and debugging even harder than it already is. For this reason, most network-related software products, including strongSwan for example, are intentionally not able to manage multiple network namespaces from a single process instance.

...

In general, it is recommended to spawn a new child process for every network namespace that needs to be operated in. For this reason, this proposal will follow up on the “Bridge Managerdescribed here.

...

The following diagram shows how network instances can be isolated from each other using network namespaces. As it can be seen, not only the network configuration is spread across namespaces, but also the management plane is split into multiple processes (all of which increases complexity and overhead, thus making this proposal less appealing).

...

For VETHs, subnets 127.0.0.0/8 and 0.0.0.0/8 sadly failed the validation - routing does not work as expected/desired (even if the local table is tweaked in various ways). On the other hand, 169.254.0.0/16 and 240.0.0.0/4 can be routed between network namespaces and VRFs without issues. But for 169.254.0.0/16 we need to select a subnet that does not contain 169.254.169.254, which is already used for the HTTP server with cloud-init metadata. After some internal discussion, we are more inclined to allocate VETH IPs from the (most likely) forever-reserved Class E subnet 240.0.0.0/4.

...

No Format
PR 1:
* Build Linux kernel with VRF support
	= https://www.pivotaltracker.com/story/show/178785411

PR 2:
* LD-PRELOAD library for VRF-unaware processes
	= https://www.pivotaltracker.com/story/show/178785483
    - or test 'ip vrf exec' as an alternative

PR 3:
* Eden test for local networks with overlapping IP subnets
	= https://www.pivotaltracker.com/story/show/178785541
	- without VRFs this test will be failing (i.e. would not be merged until PR 4 is done)

PR 4:
* Local & Switch Network instance (Create/Modify/Delete)
	= https://www.pivotaltracker.com/story/show/178785641
* ACLs
	= https://www.pivotaltracker.com/story/show/178785656
* Flow collection
	= https://www.pivotaltracker.com/story/show/178785689
* Network instance metrics
	= https://www.pivotaltracker.com/story/show/178785716

PR 5:
* Eden test for VPN networks with overlapping traffic selectors
	= https://www.pivotaltracker.com/story/show/178785745
	- without VRFs this test will be failing (i.e. would not be merged until PR 6 is done)

PR 6:
* VPN Network instance
	= https://www.pivotaltracker.com/story/show/178785793

...