D.O.C.C. Lab is a research group at Tufts University that focuses on the diagnosis, observability, and configuration of cloud systems. Our varied research explores: Automated debugging tools, Engineer-assisting tools and visualizations, User-friendly problem-solving tools, Addressing security and privacy issues in diagnosing complex system problems, Investigating usability aspects of diagnosis tools, and Developing principled distributed systems for enhanced debugging.
Darby successfully defended her dissertation today! She is the lab’s first PhD student to finish her PhD.
Jan 17, 2026
Tony’s work on a dynamically tunable key-value storage engine (TurtleKV) was accepted to VLDB 2026! This paper explores how to utilize the write-memory dimensions of the RUM space to dynamically improve read or write performance. You can read the ArXiV version here.
Mar 3, 2024
Raja received a NSF CAREER Award! Thank you to NSF for supporting our lab’s work!
Jan 22, 2024
Darby and Max’s paper on “Systemizing and mitigating topological inconsistencies in Alibaba’s microservice call-graph data” was accepted to ICPE’24!
May 20, 2023
Sarah Abowitz will be spending the summer interning at Dynatrace working on observability and privacy. Congrats Sarah!
@inproceedings{Astolfi2026,author={Astolfi, Anthony and Silai, Vidya and Huye, Darby and Liu, Lan and Sambasivan, Raja R. and Bater, Johes},title={Dynamic read \& write optimization with Turtle{KV}},booktitle={International Conference on Very Large Data Bases},publisher={VLDB Endowment},month=aug,year={2026}}
2024
ICPE
Systemizing and mitigating topological inconsistencies in Alibaba’s microservice call-graph datasets
Darby Huye, Lan Liu, and Raja R. Sambasivan
In ACM/SPEC International Conference on Performance Engineering, May 2024
Alibaba’s 2021 and 2022 microservice datasets are the only publicly available sources of request-workflow traces from a large-scale microservice deployment. They have the potential to strongly influence future research as they provide much-needed visibility into industrial microservices’ characteristics. We conduct the first systematic analyses of both datasets to help facilitate their use by the community. We find that the 2021 dataset contains numerous inconsistencies preventing accurate reconstruction of full trace topologies. The 2022 dataset also suffers from inconsistencies, but at a much lower rate. Tools that strictly follow Alibaba’s specs for constructing traces from these datasets will silently ignore these inconsistencies, misinforming researchers by creating traces of the wrong sizes and shapes. Tools that discard traces with inconsistencies will discard many traces. We present Casper, a construction method that uses redundancies in the datasets to sidestep the inconsistencies. Compared to an approach that discards traces with inconsistencies, Casper accurately reconstructs an additional 25.5% of traces in the 2021 dataset (going from 58.32% to 83.82%) and an additional 12.18% in the 2022 dataset (going from 86.42% to 98.6%).
@inproceedings{Huye2024,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets},booktitle={ACM/SPEC International Conference on Performance Engineering},publisher={ACM/SPEC},month=may,year={2024},doi={https://www.doi.org/10.1145/3629526.3645043},isbn={979-8-4007-0444-4/24/05},}
2023
ATC
Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows
Darby Huye, Yuri Shkuro, and Raja R. Sambasivan
The microservice architecture is a novel paradigm for building and operating distributed applications in many organizations. This paradigm changes many aspects of how distributed applications are built, managed, and operated in contrast to monolithic applications. It introduces new challenges to solve and requires changing assumptions about previously well-known ones. But, today, the characteristics of large-scale microservice architectures are invisible outside their organizations, depressing opportunities for research. Recent studies provide only partial glimpses and represent only single design points. This paper enriches our understanding of large-scale microservices by characterizing Meta’s microservice architecture. It focuses on previously unreported (or underreported) aspects important to developing and researching tools that use the microservice topology or traces of request workflows. We find that the topology is extremely heterogeneous, is in constant flux, and includes software entities that do not cleanly fit in the microservice architecture. Request work- flows are highly dynamic, but local properties can be predicted using service and endpoint names. We quantify the impact of obfuscating factors in microservice measurement and conclude with implications for tools and future-work opportunities.
@inproceedings{Huye2023,author={Huye, Darby and Shkuro, Yuri and Sambasivan, Raja R.},title={Lifting the veil on {M}eta's microservice architecture: {A}nalyses of topology and request workflows},booktitle={USENIX Annual Technical Conference},publisher={USENIX},pages={419-432},year={2023},month=jul,day={10},isbn={978-1-939133-35-9},}
2021
SoCC
Automating instrumentation choices for performance problems in distributed applications with VAIF
Mert Toslali, Emre Ates, Alex Ellis, and 6 more authors
Developers use logs to diagnose performance problems in distributed applications. However, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We present the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applica- tions. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumen- tation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the critical-path portions of requests’ traces. We evaluate VAIF by using it to localize performance problems in OpenStack and HDFS. We show that VAIF can localize problems related to slow code paths, resource contention, and problematic third-party code while enabling only 3-34% of the total tracing instrumentation.
@inproceedings{Toslali2021,author={Toslali, Mert and Ates, Emre and Ellis, Alex and Zhang, Zhaoqi and Huye, Darby and Liu, Lan and Puterman, Samantha and Coskun, Ayse K. and Sambasivan, Raja R.},title={Automating instrumentation choices for performance problems in distributed applications with VAIF},booktitle={ACM Symposium on Cloud Computing},publisher={ACM},pages={},year={2021},month=nov,keywords={}}
2016
SoCC
Principled workflow-centric tracing of distributed systems
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, and 3 more authors
Workflow-centric tracing captures the workflow of causally- related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource ac- counting and diagnosis. Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and the design choices we suggest are based on our experiences developing several previous workflow-centric tracing infrastructures.
@inproceedings{Sambasivan:2016bo,author={Sambasivan, Raja R. and Shafer, Ilari and Mace, Jonathan and Sigelman, Benjamin H. and Fonseca, Rodrigo and Ganger, Gregory R.},title={Principled workflow-centric tracing of distributed systems},booktitle={ACM Symposium on Cloud Computing},publisher={ACM},pages={401-414},year={2016},month=oct,keywords={}}
2011
NSDI
Diagnosing performance changes by comparing request flows
Raja R Sambasivan, Alice X Zheng, Michael De Rosa, and 6 more authors
In USENIX Conference on Networked Systems Design and Implementation (NSDI), Mar 2011
The causes of performance changes in a distributed system often elude even its developers. This paper de- velops a new technique for gaining insight into such changes: comparing request flows from two executions (e.g., of two system versions or time periods). Build- ing on end-to-end request-flow tracing within and across components, algorithms are described for identifying and ranking changes in the flow and/or timing of request pro- cessing. The implementation of these algorithms in a tool called Spectroscope is evaluated. Six case studies are presented of using Spectroscope to diagnose perfor- mance changes in a distributed storage service caused by code changes, configuration modifications, and compo- nent degradations, demonstrating the value and efficacy of comparing request flows. Preliminary experiences of using Spectroscope to diagnose performance changes within select Google services are also presented.
@inproceedings{Sambasivan2011vw,author={Sambasivan, Raja R and Zheng, Alice X and De Rosa, Michael and Krevat, Elie and Whitman, Spencer and Stroucken, Michael and Wang, William and Xu, Lianghong and Ganger, Gregory R},title={Diagnosing performance changes by comparing request flows},booktitle={USENIX Conference on Networked Systems Design and Implementation (NSDI)},publisher={USENIX Association},pages={43-56},year={2011},month=mar,keywords={}}