Prelude
The field of advanced analytics -- the intersection of data management and machine learning (ML) -- is undoubtedly booming. It seems that not a month goes by without news of a new startup in this already crowded space: Alation, Dato, Palantir, and Scaled Inference, only to name a few. Established companies such as IBM, Microsoft, Oracle, Pivotal, etc. are also investing heavily in building ML-related infrastructure software to improve support for moneyed enterprise users, while Baidu, Google, Facebook, and other Web companies are also building vast internal ML infrastructure. However, after a spurt of research in the early 2000s on scalable in-RDBMS data mining and ML, academic and research efforts in this space by the data management community seem to have stagnated and fragmented. As data management researchers, do we have anything new and non-trivial to contribute to this booming field other than just to build faster or more scalable toolkits that implement individual ML algorithms? Or is all the interesting work in this space under the sole purview of engineers in industry?
Introduction
In our vision paper, we make the case for a new, cohesive, and potentially impactful direction of research in advanced analytics for the data management community. We start with the observation that building ML modes in practice is seldom a one-shot "slam dunk". Based on numerous conversations with analysts at both enterprise and Web companies, we found that using ML is often an iterative exploratory process that involves three key challenging practical tasks -- feature engineering (FE), algorithm selection (AS), and parameter tuning (PT), collectively called model selection. Surprisingly, these three tasks have largely been overlooked by the data management community even though these are often the most time-consuming tasks for ML users. Consequently, there is little end-to-end systems support for this iterative process of model selection, which causes pain to analysts and wastes system resources. In our paper, we envision a unifying abstract framework for building systems that can support the end-to-end process of model selection and identify the research challenges posed by such systems.
Summary
To make the process of model selection easier and faster, we envision a unifying abstract framework based on the idea of a Model Selection Triple (MST) -- a combination of choices for FE, AS, and PT. While a large body of work in ML focuses on various theoretical aspects of model selection, in practice, analysts typically use an iterative exploratory process that combines the ML techniques and their domain-specific expertise. Nevertheless this iterative process has structure. We divide it into three phases -- Steering, Execution, and Consumption. In the Steering phase, they specify the precise MST they want to explore. The Execution phase runs the MST and obtains the results. In the Consumption phase, they post-process the results to steer the next iteration. They then modify their MST and iterate. Overall, the model selection process is iterative and exploratory because the space of MSTs is usually infinite and analysts have no way of telling a priori which MST will yield "satisfactory" accuracy and/or insights. Alas, most existing ML systems force analysts to explore only one MST per iteration, which overburdens the analysts and wastes system time.
We envision a new class of analytics systems we call Model Selection Management Systems (MSMS) that enables analysts to explore a logically related set of MSTs per iteration. The abstraction of sets of MSTs acts as a "narrow" waist that helps us decouple the higher layers (how to specify them) from the lower layers (how to implement and optimize them) in order to make it easier to build an MSMS. Moreover, we explain how some existing research systems such as Columbus, MLBase, etc., can be subsumed by our unified framework. However, realizing our vision immediately poses the challenge of how to enable analysts to easily specify sets of MSTs and how to handle them efficiently. We discuss how repurposing three key ideas from database research -- declarativity, optimization, and provenance -- could help us improve the respective phase of an iteration. This can help reduce both the number of iterations and the time per iteration of the model selection process, thus making it easier and faster. However, as we identify in the paper, applying these three ideas to the model selection process raises several non-obvious research challenges:
- What should the declarative interfaces to enable analysts to intuitively specify a set of MSTs look like? We discuss the tradeoffs in the design decisions involved.
- What are the new system optimization opportunities that can exploit the set-oriented nature of specifying MSTs to reduce the runtime of an iteration? We discuss a slew of new optimization ideas that combine data management and ML techniques.
- How to perform provenance management for ML to enable analysts to consume results easily and improve the iterations? We explain the need for a notion of "ML provenance" and discuss its applications.
I invite you to read our 6-page vision paper. Questions, suggestions, and comments -- critical or otherwise -- are welcome!
Links
Vision paper.
Project webpage.
Associated survey to explain the gaps in the existing landscape of ML systems that motivated our vision.
No comments:
Post a Comment