Principled Yet Practical Observability
Enhanced Trace Models for More Effective Observability in Distributed Systems
Observability in distributed systems is treated as a second-class citizen, supporting ad hoc data collection that is often unstructured and challenging to correlate across sources. Additionally, modern day observability platforms (like OpenTelemetry) use data models that prioritize ease of use for the developers but lack the expressiveness needed for complex use cases. OpenTelemtry uses the span-based model, which captures caller-callee relationships between units of execution. Research has highlighted limitations of the span-based tracing model (e.g. slack analyses) and compared this with the enhanced event-based trace model that uses happens-before relationships to capture true dependencies between events. The event-based model requires more effort from the developer along with deep knowledge of how their system works, which is not practical in an industry setting.
We argue that any enhancements made to the trace model or collection process should preserve the practicality of using the span-based trace model. To have more principled tracing data, we propose the following research directions:
- Enhance span-based traces with happens-before relationships. This will support more complex analyses including slack analysis.
- Harness the power of ‘holes’ and hole coverings in tracing data. Currently, tracing infrastructures are plagued with data loss rendering the tracing data incomplete. These holes often go undetected since they are not explicitly marked. We are looking into ways to harness the power of these holes by downsampling data that is redundant while reducing unintentional data loss in regions that are unpredictable.
- Change automated tools to use these enhanced data models that have holes, hole coverings, and additional happens-before relationships.
📄 Related Publications
2024
- ICPESystemizing and mitigating topological inconsistencies in Alibaba’s microservice call-graph datasetsIn ACM/SPEC International Conference on Performance Engineering, May 2024
👤 Members
Darby Huye
PhD, Tufts University
Max Liu
PhD, Tufts University
Raja Sambasivan
Assistant Professor, Tufts University
Zhaoqi (Roy) Zhang
Ph.D Student, Tufts University