The streak was broken over Golden Week. Things settled down in the latter half of May, so I’m resuming. Related to lower-layer tracing, I’ll introduce papers on container overlay networks using eBPF.

The Japanese version of this article is available here.

vNetTracer: Efficient and Programmable Packet Tracing in Virtualized Networks

K. Suo, Y. Zhao, W. Chen and J. Rao, “vNetTracer: Efficient and Programmable Packet Tracing in Virtualized Networks,” 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS), Vienna, Austria, 2018, pp. 165-175, doi: 10.1109/ICDCS.2018.00026.

Overview
  • Proposes using eBPF for dynamic, non-intrusive tracing of packet delivery across boundaries and virtualized networks
  • A group that primarily researches container overlay networks and virtualized networks
Thoughts
  • It seems like PacketID propagation isn’t done system-wide and is only used for matching between sender and receiver – if so, couldn’t this be done without PacketID?
    • Can’t individual packets really be identified using only TCP/UDP header information?
    • Since analysis seems to be done offline anyway, analyzing headers at that point should eliminate communication overhead
  • They load a kernel module to reduce overhead, but how much reduction does it actually achieve?
    • Reducing disk operations during log storage is important, but how often do disk operations actually occur?
    • If it can still function without the kernel module, I’d want to consider the tradeoff between the effort of loading the kernel module and the overhead reduction
  • There are several eBPF-based tracing tools today, but was there really nothing back then (2018)?

Efficient Network Monitoring Applications in the Kernel with eBPF and XDP

M. Abranches, O. Michel, E. Keller and S. Schmid, “Efficient Network Monitoring Applications in the Kernel with eBPF and XDP,” 2021 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Heraklion, Greece, 2021, pp. 28-34, doi: 10.1109/NFV-SDN53031.2021.9665095.

Overview
  • Proposes a network monitoring framework that consolidates common packet processing tasks from all network analysis applications and executes them in kernel space using eBPF and XDP, launching applications only when necessary to reduce resource usage and overhead
  • Many of the authors research SDN and NFV
Thoughts
  • The novelty seems to lie in extracting common processing from multiple network monitoring applications and orchestrating those applications, but is the use of eBPF and XDP for that common processing really new? Weren’t individual monitoring apps already using eBPF and XDP?
  • I’d like to revisit this after building more knowledge about SDN and NFV

Bypass Container Overlay Networks with Transparent BPF-driven Socket Replacement

S. Choochotkaew, T. Chiba, S. Trent and M. Amaral, “Bypass Container Overlay Networks with Transparent BPF-driven Socket Replacement,” 2022 IEEE 15th International Conference on Cloud Computing (CLOUD), Barcelona, Spain, 2022, pp. 134-143, doi: 10.1109/CLOUD55607.2022.00033.

Overview
  • Proposes bypassing container overlay networks to reduce overhead by having a host-side agent use eBPF and ptrace to replace Pod network namespace communication with Host (default) network namespace communication
  • High usability because it doesn’t modify user processes; safe because it doesn’t require privilege escalation of user processes, preventing container escape attacks
  • A group at IBM Research Tokyo. The first author has several publications on containers, Kubernetes, and BPF.
Thoughts
  • They seem to replace sockets with mirrored ones at connection establishment time and bypass subsequent communication, but how would this work for connectionless communication?
    • They probably give up and go through the Pod’s network namespace as usual
  • Wouldn’t having the client/server O2H (proposed) daemons communicate with each other every time a connection is established introduce significant overhead?
    • Socket replacement apparently happens after the connection is established and communication begins, so the daemon-to-daemon communication might happen during that window
  • Couldn’t packet loss occur from replacing sockets mid-communication?
  • After reading the Slim paper, I wonder what the differences from Slim are
    • One possibility is that by using the Pod’s network namespace until the connection is established on the Host’s network namespace, connection establishment overhead is reduced
    • Using eBPF is just an implementation-level difference

Slim: OS Kernel Support for a Low-Overhead Container Overlay Network

D. Zhuo, K. Zhang, Y. Zhu, H. H. Liu, M. Rockett, A. Krishnamurthy, and T. Anderson, “Slim: OS kernel support for a Low-Overhead container overlay network,” in 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). Boston, MA: USENIX Association, Feb. 2019, pp. 331–344. [Online]. Available: https://www.usenix.org/conference/nsdi19/presentation/zhuo

Overview
  • Proposes Slim, a container overlay network that replaces Container Namespace sockets with Host Namespace sockets at connection establishment time
  • Reduces container overlay network overhead by having packets pass through the OS kernel’s network stack only once
Thoughts
  • Communicating from SlimSocket to SlimRouter for every connection establishment seems to introduce significant overhead
    • The overhead of connection establishment is discussed in the paper, and it’s apparently still faster overall than going through a conventional container overlay network
    • Short-lived connections would be considerably slow, so connection reuse becomes important
  • They use something called LD_PRELOAD. This could probably also be done with eBPF – how would they differ?
    • It’s probably less secure than eBPF at the very least
    • The mention that systems requiring static linking (such as applications written in Go) need binary patching is concerning (I don’t fully understand this)
      • It’s mentioned briefly but seems like a significant drawback
      • If the same thing could be achieved with eBPF, this drawback might be eliminated, but I’m not sure
  • The implementation repository was published, so I’d like to read it (https://github.com/danyangz/slim)

XMasq: Low-Overhead Container Overlay Network Based on eBPF

S. Lin, P. Cao, T. Huang, S. Zhao, Q. Tian, Q. Wu, D. Han, X. Wang, and C. Zhou, “Xmasq: Low-overhead container overlay network based on ebpf,” 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.05455

Overview
  • Proposes bypassing the container overlay network by using eBPF to cache information for each container pair during the first round trip, then rewriting packet headers for subsequent packets
  • The cached information primarily includes Host/Container MAC/IP addresses for source and destination, the container’s NIC index, and a key that uniquely identifies the container pair within a host pair
Thoughts
  • This seemed like a very innovative and interesting proposal
    • It was just submitted to arXiv at the beginning of the month, so I’m curious about future developments
  • Aren’t there side effects from writing the Restore Key into ID, DSCP, or Options?
  • If Container Live Migration is achieved by deleting the cache when a container’s placement host changes, is there any point in setting a cache refresh interval?
  • Each host needs to cache information about related container pairs, but wouldn’t the cache size become quite large as the number of containers increases?
    • Maybe it’s fine because any given container only communicates with a limited number of other containers?
  • Since it uses round trips to populate source and destination container information on each other’s hosts, this wouldn’t be applicable for UDP communication that unilaterally sends packets (with no response)
    • Though such communication patterns are probably quite rare
  • If access control rule matching results were cached in the Masq Map, there would be no need to check the Rule Map every time
  • Apparently, running traceroute from within a container would reveal underlay network information, so this might be difficult to use when tenants aren’t trusted
    • Slim was quite careful about this
  • The GitHub repository URL was listed but returned a 404
    • I wanted to see the implementation, so that’s unfortunate – I wonder if it will eventually be published