Speaker: Andrey Balmin
Location: Bourns A125
Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can setup data transformation pipelines in an interactive, self-service, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: “always on” query engine and data prep applications, and on-demand batch execution of transformation pipelines. We standardized on Apache Spark and Spark SQL for all three applications, due to its scalability, as well as, flexibility and extensibility of the Spark's Catalyst compiler. All applications share much of the compilation and execution code, except for sampling, caching, and result extraction.
In this talk we will, first, introduce Workday Prism Analytics and describe its Spark-based interactive and batch data processing components. We will then describe the data prep transformations, and their compilation into Spark DataFrames, through Spark-SQL Catalyst plans, in both interactive and batch mode. We will focus on some challenges we encountered while compiling and executing complex pipelines and queries. For example, Spark SQL compilation times exceeded execution time for some low-latency queries. And compiled plans grew dangerously for data prep pipelines with multiple self-joins and self-unions. We will describe caching, sampling, and query compilation techniques that allow us to support interactive user experience. This includes a join co-sampling component that improves system usability when joining large datasets. Finally, we will conclude with an overview of the open challenges that we plan to tackle in the future.
Dr. Andrey Balmin is a Sr. Principal Engineer at Workday, where he is building the self-service Prism Analytics platform, continuing the work he began at Platfora (which was acquired by Workday in 2016). Prior to this, he was a Research Staff Member at IBM Almaden Research Center where he focused on search and query processing of semi-structured and graph-structured data in Data Warehousing and, later, Big Data platforms. He holds a Ph.D. degree in Computer Science from UC San Diego.