Automating Instrumentation Choices for Distributed Systems

Automating the choice of where to place instrumentation to diagnose performance problems in distributed applications.

Diagnosing performance problems in distributed applications is extremely challenging. A key reason is that it is difficult to know a priori where to place instrumentation — such as logs and performance counters — to help diagnose problems that may occur in the future. Enabling instrumentation everywhere at all times is impractical due to overhead, yet leaving too little in place forces engineers into long, manual trial-and-error cycles when problems arise.

This project develops automated frameworks that run alongside deployed distributed applications and dynamically choose which instrumentation to enable in response to newly-observed performance problems. Our key insight is that requests following the same execution path through a distributed system should perform similarly. When they do not — i.e., when high response-time variance is observed among structurally identical requests — we can use statistical decomposition of variance along critical-path trace edges to precisely localize where additional instrumentation is needed. This approach dramatically narrows the search space compared to exhaustive strategies, enabling problem localization while activating only a small fraction of available instrumentation points.

This work was supported by the National Science Foundation under award 2016178 (previously issued as award 1815323 at Boston University).

👤 Members

Mert Toslali
Emre Ates
Zhaoqi (Roy) Zhang
Darby Huye
Alex Ellis
Max Liu
Ayse Coskun
Raja Sambasivan

đź“„ Related Publications

2022

  1. OSR
    VAIF: Variance-driven Automated Instrumentation Framework
    Mert Toslali, Emre Ates, Darby Huye, and 5 more authors
    ACM SIGOPS Operating Systems Review, 2022

2021

  1. SoCC
    Automating instrumentation choices for performance problems in distributed applications with VAIF
    Mert Toslali, Emre Ates, Alex Ellis, and 6 more authors
    In ACM Symposium on Cloud Computing, Nov 2021

2019

  1. SoCC
    An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications
    Emre Ates, Lily Sturmann, Mert Toslali, and 4 more authors
    In ACM Symposium on Cloud Computing (SoCC), Nov 2019

⚙️ Code and Datasets

2024

  1. Code
    Pythia/VAIF: Variance-driven Automated Instrumentation Framework
    Mert Toslali, Emre Ates, Alex Ellis, and 6 more authors
    2024
    Code for: Automating instrumentation choices for performance problems in distributed applications with VAIF (SoCC’21); VAIF: Variance-driven Automated Instrumentation Framework (OSR’22); An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications (SoCC’19)