Paper Reviews: Distributed Tracing
Continuing with distributed tracing papers. The first two are classics from before Dapper. The last one is a recent paper that uses eBPF for lower-layer tracing.
X-Trace: A Pervasive Network Tracing Framework
R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” in Proceedings of the 4th USENIX Conference on Networked Systems Design & Implementation, ser. NSDI'07. USA: USENIX Association, 2007, p. 20.
Overview
- Proposes comprehensive tracing by propagating unified metadata across different applications and network layers and administrative domains, constructing a tree that shows the flow of requests
- The authors were networking researchers at UC Berkeley at the time. Each has since gone on to achieve remarkable things in various fields (OpenFlow, RAID, Mesos, etc.)
Thoughts
- As one of the earliest papers on annotation-based tracing, the problem setting is simpler and easier to understand compared to recent large-scale and complex systems
- Rather than implementing within a specific service infrastructure under a single administrator, they aim to support this across the entire internet and all protocols, crossing administrative boundaries – quite an ambitious vision
- Tracing across multiple applications (pushNext) is still mainstream and makes sense, but is there really a need for tracing across multiple layers (pushDown)?
- Especially since embedding metadata in application-layer protocols, TCP, and IP respectively would result in data duplication and significant overhead without much practical benefit
- Tracing is useful for diagnosing performance under normal conditions, but the anomaly diagnosis shown in the Usage Scenarios (particularly the DNS story) seems achievable through other means
- Anomalies in individual components should be discovered and reported by their respective administrators; there shouldn’t be a need for another administrator to learn about them first through traces
- While the desire to trace across administrative boundaries is understandable, in practice, disclosing even partial trace data to others seems difficult (from a security risk and confidential information leakage perspective)
- If you’re doing comprehensive tracing but each party collects and analyzes trace data independently, it defeats the purpose
- I wonder how this stands today? OpenTelemetry exists, but I haven’t heard of administrators sharing trace data with each other, so it’s probably still difficult
Causeway: Operating System Support For Controlling And Analyzing The Execution Of Distributed Programs
A. Chanda, K. Elmeleegy, A. L. Cox, and W. Zwaenepoel, “Causeway: Operating system support for controlling and analyzing the execution of distributed programs,” in Proceedings of the 10th Conference on Hot Topics in Operating Systems - Volume 10, ser. HOTOS'05. USA: USENIX Association, 2005, p. 18.
Overview
- Facilitates the development of meta-applications that require metadata propagation by supporting metadata propagation at the kernel level in distributed programs through Causeway
Thoughts
- Can this really eliminate application-level instrumentation?
- The metadata injection and access interface seems designed for meta-applications (actors) to call, but wouldn’t it be necessary to pass that metadata to the application?
- Is communication overhead not considered?
- The demand for protocol-independent metadata propagation with reduced application instrumentation has existed since this long ago, yet there still doesn’t seem to be a fundamental solution, suggesting it’s an extremely difficult problem
- The idea of propagation at lower layers could achieve more with today’s technology
- Kernel extensions could potentially be done with eBPF
- The paper is so old that it’s difficult to understand the assumed terminology, and it’s hard to gauge how novel it was at the time
Enhancing Packet Tracing of Microservices in Container Overlay Networks using eBPF
C. Lee, R. Yoshitani, and T. Hirotsu, “Enhancing packet tracing of microservices in container overlay networks using ebpf,” in Proceedings of the 17th Asian Internet Engineering Conference, ser. AINTEC ‘22. New York, NY, USA: Association for Computing Machinery, 2022, p. 53–61. [Online]. Available: https://doi.org/10.1145/3570748.3570756
Overview
- Proposes using eBPF to extend annotation-based tracing with latency measurement in container overlay networks
- A group at Hosei University researching distributed systems and tracing. The first author is from Toyota’s research lab and works on distributed systems, SDN, and NFV.
Thoughts
- When it comes to container overlay networks and eBPF, Cilium comes to mind, but it’s curious that there’s no mention of it at all
- It’s good that they support Flannel and Calico with VXLAN and IPIP analysis, but wouldn’t it be easier to extend Cilium?
- The problem that annotation-based tracing only considers the application layer is valid, and the idea of extending it to the infrastructure layer is interesting
- Since we’re already incurring the cost of propagating tracing context, it would be nice to find more uses for it
- I didn’t know there was a group in Japan doing this kind of research
- I’d like to read or re-implement their code as a way to study eBPF
