Yanlei Diao - Big and Fast Data Analysis through Integrated Algorithm Design


Meeting room 306 is badged access: just tell the welcome you come for the seminar by giving the number of the room.

Bio: Yanlei Diao joined Ecole Polytechnique as Professor of Computer Science in 2015. She is also a tenured professor at the University of Massachusetts Amherst, USA. Her research interests lie in information architectures and data management systems, with a focus on big data analytics, data stream processing, interactive data analysis, uncertain data management, and sensor and scientific data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. She is Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, member of the SIGMOD and PVLDB Executive Committees, and member of SIGMOD Software Systems Award Committee. In the past, she was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

Big and fast data analytics aims to scale analytics to large datasets on many machines, and at the same time, return timely insights and results with low-latency. Many recent applications such as Internet of Things, data center monitoring, system and application monitoring present such a demand for perpetual, low-latency analytics to support time-critical tasks and decisions. Our ERC project aims to provide an algorithmic foundation for designing such big and fast data analysis systems.

In this talk, I discuss two research topics related to system design: parallelism and optimization. First, there is a fundamental tension between data parallelism (for scale) and pipeline parallelism (for low latency) when the data size exceeds the memory size. We propose a new approach to intelligently use memory based on newly designed stream algorithms for analytical problems. This approach has allowed us to maximize the degree of parallelism that can be achieved for data analytics. Second, to run analytics, today's big data systems including cloud computing are best effort only – they cannot provide guarantees of user performance objectives in terms of latency, throughout, and cost. To enable strong performance guarantees, we propose to transform such systems into a principled optimization framework that adapts cluster and cloud computing to meet user performance requirements.