Locating Regions of Uncertainty in Distributed Systems Using Aggregate Trace Data
Jiayi Dong, and Anshul Rastogi
2022
Distributed systems are central to countless applications in the modern world. These applications can have tens to thousands of components interacting, making it difficult to identify the source of performance problems. Distributed tracing is widely used to elucidate the interactions within a distributed system; however, instrumenting system codebases can be tedious, and collecting tracing data generates overhead. Optimally, minimal instrumentation is added to regions of the codebase that explains the majority of the system’s performance variation. We present a prototype application that highlights regions of performance uncertainty in a system, guiding developers to where instrumentation would most increase predictability. Using aggregate trace data, spans are ranked by uncertainty metrics, which are primarily the standard deviation and coefficient of variation of the exclusive latencies of an operation across multiple traces. We developed our prototype in Python and applied it to trace data extracted from HotROD. We evaluated our tool on four test scenarios where we injected latency into services in HotROD. Our tool highlights the service(s) with injected latency in all four test cases.