Eric Simon - Building high-quality analytics in enterprise big data landscapes: challenges and perspectives

Organized by: 
The team of the LIG Keynote Speeches: Nicolas Peltier, Renaud Lachaize, Dominique Vaufreydaz
Eric Simon (SAP - Big Data division)
Eric Simon (SAP - Big Data division)

Eric Simon is currently Chief Scientist in SAP Big Data division. He was previously chief architect in the Enterprise Information Management division, and development manager of the advanced metadata management and semantic services for SAP HANA platform. Prior to SAP, he led the development of Data Federator, a federated database system that enabled “multi-source universes” in Business Objects BOE Enterprise XI 4.0 flagship product, the most acclaimed feature by customers when the product was released. Eric joined Business Objects in 2005 with the acquisition of Medience, a start-up he co-founded in 2001, which specialized in data federation technology with innovative distributed query processing and schema mapping capabilities. Prior to that, Eric was a “director of research” at INRIA, which he joined in 1985. There, he started the research project on the mediation system Le Select whose technology was transferred to Medience. Eric published research results in various topics like database integrity, concurrency control, object-oriented programming, deductive databases, query optimization, data integration, and data cleaning. He was the recipient for two best paper awards at the VLDB and ACM OOPSLA conferences. He is the co-author of several patents at Bell Labs, Medience, Business Objects, and SAP. Eric regularly serves on the Program Committee of international database research conferences and was the industrial PC chair of ACM SIGMOD in 2008. Eric received both a PhD and Habilitation in Computer Science from University of Paris VI, France in 1986 and 1992 respectively.


Increasingly large amounts of data become available to enterprises through multiple channels and can be managed on premise or in-cloud at affordable costs. In this context, data-driven applications rapidly emerged as a new paradigm: they are reactive to real-time events or perception of relevant changes in the Enterprise data, embed predefined best practice business logic, and use predictive models to learn from big data what action to take. Their development mainly relies on big data analytics pipelines that process various kinds of data via rich compositions of data engineering tasks (e.g., data cleaning and feature generation) and machine learning tasks (e.g., model building and evaluation) executed on big data computing platforms federating heterogeneous computing environments. Despite significant previous efforts to strengthen some self-service analytics by automating data preparation or model building, the development of high quality big data analytics pipelines remains a labor intensive, time consuming, and error prone activity that requires highly skilled users. In this talk we shall review the following root causes that impede the development of high-quality analytics by “citizen data scientists”: (1) data engineering and machine learning tasks are executed as independent, sequential operations, although the overall process of creating high quality analytics is highly iterative, and critical decisions of the former affect the performance of the latter; (2) lack of tools to assess data quality and repair big data analytics pipelines when data quality issues are detected; (3) no support for continuously monitoring the quality of big data analytics pipelines deployed in production when data characteristics evolve; and (4) very limited facilities to reuse existing big data pipelines due to a lack of understanding of data dependencies, and description of model behavior. We shall illustrate these problems using examples issued from real business use cases, and motivate underlying open research problems at the crossroads of data cleaning, data integration, machine learning, and scalable big data processing.