Empirically characterizing the topology and request-workflow properties of large-scale microservice deployments.
The microservice architecture has become the dominant paradigm for building large-scale distributed applications, yet the characteristics of real industrial deployments remain largely invisible to the research community. Most tools and testbeds used in academic research are built on assumptions about microservice topologies and request workflows that have never been validated against production systems — creating a risk that research findings and tools may not apply in practice.
This project addresses that gap through empirical characterization of large-scale microservice architectures. Our work includes analyses of Meta’s production microservice architecture (ATC 2023), where we find that the topology is massive (18,500+ services, 12 million instances), highly dynamic, and heterogeneous in ways that violate assumptions common in research testbeds and topology generators. We also find that request workflows are wide and shallow, and that many traces are partially unrecoverable due to rate limiting and uninstrumented services.
We complement this with a systematization of knowledge study (JSys 2022) comparing popular open-source microservice testbeds against industry practitioners’ perceptions. Through analysis of seven testbeds and interviews with twelve practitioners, we identify key mismatches: real deployments feature non-hierarchical topologies, mixed communication protocols, cycles, and hundreds to thousands of services — none of which are captured by existing testbeds. Finally, our ICPE 2024 paper addresses the widespread use of Alibaba’s publicly available microservice trace datasets, finding pervasive inconsistencies in their structure and introducing Casper, an algorithm that exploits structural redundancies to correctly reconstruct trace topologies that would otherwise be discarded or distorted.
👤 Members
Darby Huye
Vishwanath Seshagiri
Max Liu
Avani Wildani
Yuri Shkuro
Raja Sambasivan
đź“„ Related Publications
2024
ICPE
Systemizing and mitigating topological inconsistencies in Alibaba’s microservice call-graph datasets
Darby Huye, Lan Liu, and Raja R. Sambasivan
In ACM/SPEC International Conference on Performance Engineering, May 2024
Alibaba’s 2021 and 2022 microservice datasets are the only publicly available sources of request-workflow traces from a large-scale microservice deployment. They have the potential to strongly influence future research as they provide much-needed visibility into industrial microservices’ characteristics. We conduct the first systematic analyses of both datasets to help facilitate their use by the community. We find that the 2021 dataset contains numerous inconsistencies preventing accurate reconstruction of full trace topologies. The 2022 dataset also suffers from inconsistencies, but at a much lower rate. Tools that strictly follow Alibaba’s specs for constructing traces from these datasets will silently ignore these inconsistencies, misinforming researchers by creating traces of the wrong sizes and shapes. Tools that discard traces with inconsistencies will discard many traces. We present Casper, a construction method that uses redundancies in the datasets to sidestep the inconsistencies. Compared to an approach that discards traces with inconsistencies, Casper accurately reconstructs an additional 25.5% of traces in the 2021 dataset (going from 58.32% to 83.82%) and an additional 12.18% in the 2022 dataset (going from 86.42% to 98.6%).
@inproceedings{Huye2024,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets},booktitle={ACM/SPEC International Conference on Performance Engineering},publisher={ACM/SPEC},month=may,year={2024},doi={https://www.doi.org/10.1145/3629526.3645043},isbn={979-8-4007-0444-4/24/05},}
2023
ATC
Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows
Darby Huye, Yuri Shkuro, and Raja R. Sambasivan
The microservice architecture is a novel paradigm for building and operating distributed applications in many organizations. This paradigm changes many aspects of how distributed applications are built, managed, and operated in contrast to monolithic applications. It introduces new challenges to solve and requires changing assumptions about previously well-known ones. But, today, the characteristics of large-scale microservice architectures are invisible outside their organizations, depressing opportunities for research. Recent studies provide only partial glimpses and represent only single design points. This paper enriches our understanding of large-scale microservices by characterizing Meta’s microservice architecture. It focuses on previously unreported (or underreported) aspects important to developing and researching tools that use the microservice topology or traces of request workflows. We find that the topology is extremely heterogeneous, is in constant flux, and includes software entities that do not cleanly fit in the microservice architecture. Request work- flows are highly dynamic, but local properties can be predicted using service and endpoint names. We quantify the impact of obfuscating factors in microservice measurement and conclude with implications for tools and future-work opportunities.
@inproceedings{Huye2023,author={Huye, Darby and Shkuro, Yuri and Sambasivan, Raja R.},title={Lifting the veil on {M}eta's microservice architecture: {A}nalyses of topology and request workflows},booktitle={USENIX Annual Technical Conference},publisher={USENIX},pages={419-432},year={2023},month=jul,day={10},isbn={978-1-939133-35-9},}
2022
JSys
[SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices
Vishwanath Seshagiri, Darby Huye, Lan Liu, and 2 more authors
Industrial microservice architectures vary so wildly in their characteristics, such as size or communication method, that comparing systems is difficult and often leads to confusion and misinterpretation. In contrast, the academic testbeds used to conduct microservices research employ a very constrained set of design choices. This lack of systemization in these key design choices when developing microservice architectures has led to uncertainty over how to use experiments from testbeds to inform practical deployments and indeed whether this should be done at all. We conduct semi-structured interviews with industry participants to understand the representativeness of existing testbeds’ design choices. Surprising results included the presence of cycles in industry deployments, as well as a lack of clarity about the presence of hierarchies. We then systematize the possible design choices we learned about from the interviews, and identify important mismatches between our interview results and testbeds’ designs that will inform future, more representative testbeds.
@article{Seshagiri2022,author={Seshagiri, Vishwanath and Huye, Darby and Liu, Lan and Wildani, Avani and Sambasivan, Raja R},title={[SoK] Identifying Mismatches Between Microservice Testbeds and Industrial Perceptions of Microservices},journal={Journal of Systems Research},volume={2},number={1},year={2022},}
Code for CASPER, a trace-construction technique that exploits redundancies in Alibaba’s microservice call-graph datasets to mitigate topological inconsistencies and reconstruct accurate request-workflow traces.
@software{CASPER2024,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={{CASPER}: Alibaba Microservice Call-graph Reconstruction},year={2024},url={https://github.com/docc-lab/casper},note={Code for: Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets (ICPE'24)}}
Pre-shuffled version of Alibaba’s 2021 microservice call-graph dataset, reorganized so that all rows for a given trace ID reside in a single file. Enables direct use with CASPER and other trace-analysis tools.
@dataset{AlibabaTraces2021,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={Alibaba 2021 Microservice Call-Graph Traces (Pre-shuffled)},year={2024},doi={10.7910/DVN/RXIC9Z},url={https://doi.org/10.7910/DVN/RXIC9Z},publisher={Harvard Dataverse},note={Data for: Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets (ICPE'24)}}
Pre-shuffled version of Alibaba’s 2022 microservice call-graph dataset, reorganized so that all rows for a given trace ID reside in a single file. Enables direct use with CASPER and other trace-analysis tools.
@dataset{AlibabaTraces2022,author={Huye, Darby and Liu, Lan and Sambasivan, Raja R.},title={Alibaba 2022 Microservice Call-Graph Traces (Pre-shuffled)},year={2024},doi={10.7910/DVN/T53HGF},url={https://doi.org/10.7910/DVN/T53HGF},publisher={Harvard Dataverse},note={Data for: Systemizing and mitigating topological inconsistencies in Alibaba's microservice call-graph datasets (ICPE'24)}}
2023
Dataset
Distributed Traces from Meta’s Microservices Architecture
Darby Huye, Yuri Shkuro, and Raja R. Sambasivan
2023
Licensed CC BY-NC 4.0. Data for: Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows (USENIX ATC’23)
Summary dataset of distributed request-workflow traces collected from Meta’s production microservices infrastructure. Released alongside the ATC’23 paper characterizing Meta’s microservice topology and request workflows.
@dataset{MetaTraces2023,author={Huye, Darby and Shkuro, Yuri and Sambasivan, Raja R.},title={Distributed Traces from {Meta}'s Microservices Architecture},year={2023},url={https://github.com/facebookresearch/distributed_traces},publisher={Meta},note={Licensed CC BY-NC 4.0. Data for: Lifting the veil on Meta's microservice architecture: Analyses of topology and request workflows (USENIX ATC'23)}}