Automating Instrumentation Choices for Distributed Systems
Automating the choice of where to place instrumentation to diagnose performance problems in distributed applications.
Diagnosing performance problems in distributed applications is extremely challenging. A key reason is that it is difficult to know a priori where to place instrumentation — such as logs and performance counters — to help diagnose problems that may occur in the future. Enabling instrumentation everywhere at all times is impractical due to overhead, yet leaving too little in place forces engineers into long, manual trial-and-error cycles when problems arise.
This project develops automated frameworks that run alongside deployed distributed applications and dynamically choose which instrumentation to enable in response to newly-observed performance problems. Our key insight is that requests following the same execution path through a distributed system should perform similarly. When they do not — i.e., when high response-time variance is observed among structurally identical requests — we can use statistical decomposition of variance along critical-path trace edges to precisely localize where additional instrumentation is needed. This approach dramatically narrows the search space compared to exhaustive strategies, enabling problem localization while activating only a small fraction of available instrumentation points.
This work was supported by the National Science Foundation under award 2016178 (previously issued as award 1815323 at Boston University).
Developers use logs to diagnose performance problems in distributed applications. But, it is difficult to know a pri- ori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We summarize our work on the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed per- formance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed trac- ing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical- path portions of requests’ traces.
@article{Toslali2022,author={Toslali, Mert and Ates, Emre and Huye, Darby and Zhang, Zhaoqi and Liu, Lan and Puterman, Samantha and Coskun, Ayse K and Sambasivan, Raja R},title={VAIF: Variance-driven Automated Instrumentation Framework},journal={ACM SIGOPS Operating Systems Review},volume={56},number={1},pages={42--50},year={2022},}
2021
SoCC
Automating instrumentation choices for performance problems in distributed applications with VAIF
Mert Toslali, Emre Ates, Alex Ellis, and 6 more authors
Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applica- tions. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumen- tation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests’ traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.
@inproceedings{Toslali2021,author={Toslali, Mert and Ates, Emre and Ellis, Alex and Zhang, Zhaoqi and Huye, Darby and Liu, Lan and Puterman, Samantha and Coskun, Ayse K. and Sambasivan, Raja R.},title={Automating instrumentation choices for performance problems in distributed applications with VAIF},booktitle={ACM Symposium on Cloud Computing},publisher={ACM},pages={},year={2021},month=nov,keywords={}}
2019
SoCC
An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications
Emre Ates, Lily Sturmann, Mert Toslali, and 4 more authors
In ACM Symposium on Cloud Computing (SoCC), Nov 2019
Diagnosing performance problems in distributed applications is extremely challenging. A significant reason is that it is hard to know where to place instrumentation a priori to help diagnose problems that may occur in the future. We present the vision of an automated instrumentation framework, Pythia, that runs alongside deployed distributed applications. In response to a newly-observed performance problem, Pythia searches the space of possible instrumentation choices to enable the instru- mentation needed to help diagnose it. Our vision for Pythia builds on workflow-centric tracing, which records the order and timing of how requests are processed within and among a distributed application’s nodes (i.e., records their workflows). It uses the key insight that localizing the sources high perfor- mance variation within the workflows of requests that are ex- pected to perform similarly gives insight into where additional instrumentation is needed.
@inproceedings{Ates:2019th,author={Ates, Emre and Sturmann, Lily and Toslali, Mert and Krieger, Orran and Megginson, Richard and Coskun, Ayse K and Sambasivan, Raja R.},title={An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications},booktitle={ACM Symposium on Cloud Computing (SoCC)},publisher={ACM},pages={165-170},year={2019},month=nov,keywords={}}
Open-source implementation of VAIF (Variance-driven Automated Instrumentation Framework) and its predecessor Pythia. Automatically enables the instrumentation needed to diagnose performance problems in distributed applications.
@software{Pythia2024,author={Toslali, Mert and Ates, Emre and Ellis, Alex and Zhang, Zhaoqi and Huye, Darby and Liu, Lan and Puterman, Samantha and Coskun, Ayse K. and Sambasivan, Raja R.},title={{Pythia/VAIF}: Variance-driven Automated Instrumentation Framework},year={2024},url={https://github.com/docc-lab/pythia},note={Code for: Automating instrumentation choices for performance problems in distributed applications with VAIF (SoCC'21); VAIF: Variance-driven Automated Instrumentation Framework (OSR'22); An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications (SoCC'19)}}