Maha Alsayasneh - On the Identification of Performance Bottlenecks in Multi-tier Distributed Systems

14:00

Friday

May

2020

Thesis defence

Organized by:

Maha Alsayasneh

Speaker:

Maha Alsayasneh

Teams:

ERODS

The jury members are:

Pierre Sens, professeur, Sorbonne Université, reviewer
Daniel Hagimont, professeur, INPT/ENSEEIHT, reviewer,
Sihem Amer-Yahia, directrice de recherche, CNRS, Université Grenoble Alpes, examiner
Noël De Palma, professeur, Université Grenoble Alpes, thesis director

Today’s distributed systems are made of various software components with complex interactions and a large number of configuration settings. Pinpointing the performance bottlenecks is generally a non-trivial task, which requires human expertise as well as trial and error. Moreover, the same software stack may exhibit very different bottlenecks depending on factors such as the underlying hardware, the application logic, the configuration settings, and the operating conditions. This work aims to (i) investigate whether it is possible to identify a set of key metrics that can be used as reliable and general indicators of performance bottlenecks, (ii) identify the characteristics of these indicators, and (iii) build a tool that can automatically and accurately determine if the system reaches its maximum capacity in terms of throughput.
In this thesis, we present three contributions. First, we present an analytical study of a large number of realistic configuration setups of multi-tier distributed applications, more specifically focusing on data processing pipelines. By analyzing a large number of metrics at the hardware and at the software level, we identify the ones that exhibit changes in their behavior at the point where the system reaches its maximum capacity. We consider these metrics as reliable indicators of performance bottlenecks. Second, we leverage machine learning techniques to build a tool that can automatically identify performance bottlenecks in the data processing pipeline. We consider different machine learning methods, different selections of metrics, and different cases of generalization to new setups. Third, to assess the validity of the results obtained considering the data processing pipeline for both the analytical and the learning-based approaches, the two approaches are applied to the case of a Web stack.

From our research, we draw several conclusions. First, it is possible to identify key metrics that act as reliable indicators of performance bottlenecks for a multi-tier distributed system. More precisely, identifying when the server has reached its maximum capacity can be identified based on these reliable metrics. Contrary to the approach adopted by many existing works, our results show that a combination of metrics of different types is required to ensure reliable identification of performance bottlenecks in a large number of setups. We also show that approaches based on machine learning techniques to analyze metrics can identify performance bottlenecks in a multi-tier distributed system. The comparison of different models shows that the ones based on the reliable metrics identified by our analytical study are the ones that achieve the best accuracy. Furthermore, our extensive analysis shows the robustness of the obtained models that can generalize to new setups, to new numbers of clients, and to both new setups and new numbers of clients. Extending the analysis to a Web stack confirms the main findings obtained through the study of the data processing pipeline. These results pave the way towards a general and accurate tool to identify performance bottlenecks in distributed
systems.