Bridges as a First-Class Citizen for Distributed Tracing
Enhanced Trace Models for More Effective Observability in Distributed Systems
Observability in distributed systems is treated as a second-class citizen, supporting ad hoc data collection that is often unstructured and challenging to correlate across sources. Additionally, modern day observability platforms (like OpenTelemetry) use data models that prioritize ease of use for the developers but lack the expressiveness needed for complex use cases. OpenTelemtry uses the span-based model, which captures caller-callee relationships between units of execution. Research has highlighted limitations of the span-based tracing model (e.g. slack analyses) and compared this with the enhanced event-based trace model that uses happens-before relationships to capture true dependencies between events. The event-based model requires more effort from the developer along with deep knowledge of how their system works, which is not practical in an industry setting.
We argue that any enhancements made to the trace model or collection process should preserve the practicality of using the span-based trace model. To have more principled tracing data, we propose the following research directions:
Enhance span-based traces with happens-before relationships. This will support more complex analyses including slack analysis.
Harness the power of ‘holes’ and hole coverings in tracing data. Currently, tracing infrastructures are plagued with data loss rendering the tracing data incomplete. These holes often go undetected since they are not explicitly marked. We are looking into ways to harness the power of these holes by downsampling data that is redundant while reducing unintentional data loss in regions that are unpredictable.
Change automated tools to use these enhanced data models that have holes, hole coverings, and additional happens-before relationships.
👤 Members
Tomislav Žabčić-Matić
Darby Huye
Max Liu
Ha Nguyen
Raja Sambasivan
📄 Related Publications
2024
ICPE
Systemizing and mitigating topological inconsistencies in Alibaba’s microservice call-graph datasets
Darby Huye, Lan Liu, and Raja R. Sambasivan
In ACM/SPEC International Conference on Performance Engineering, May 2024
Alibaba’s 2021 and 2022 microservice datasets are the only publicly available sources of request-workflow traces from a large-scale microservice deployment. They have the potential to strongly influence future research as they provide much-needed visibility into industrial microservices’ characteristics. We conduct the first systematic analyses of both datasets to help facilitate their use by the community. We find that the 2021 dataset contains numerous inconsistencies preventing accurate reconstruction of full trace topologies. The 2022 dataset also suffers from inconsistencies, but at a much lower rate. Tools that strictly follow Alibaba’s specs for constructing traces from these datasets will silently ignore these inconsistencies, misinforming researchers by creating traces of the wrong sizes and shapes. Tools that discard traces with inconsistencies will discard many traces. We present Casper, a construction method that uses redundancies in the datasets to sidestep the inconsistencies. Compared to an approach that discards traces with inconsistencies, Casper accurately reconstructs an additional 25.5% of traces in the 2021 dataset (going from 58.32% to 83.82%) and an additional 12.18% in the 2022 dataset (going from 86.42% to 98.6%).
@inproceedings{Huye2024,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets},booktitle={ACM/SPEC International Conference on Performance Engineering},publisher={ACM/SPEC},month=may,year={2024},doi={https://www.doi.org/10.1145/3629526.3645043},isbn={979-8-4007-0444-4/24/05},}
⚙️ Code and Datasets
2024
Code
Bridges: Trace Reconstruction with Data Loss
Darby Huye, Zhaoqi Zhang, Lan Liu, and 1 more author
Tools for reconstructing distributed traces in the presence of data loss. Uses ancestry data stored within span objects and supports Bloom filter, hash array, and hybrid ancestry modes for intelligent trace reconnection.
@software{Bridges2024,author={Huye, Darby and Zhang, Zhaoqi and Liu, Lan and Sambasivan, Raja R.},title={Bridges: Trace Reconstruction with Data Loss},year={2024},url={https://github.com/docc-lab/bridges},note={Code for the Bridges project}}