The Data Dossier: 2015

This is a post about my vision paper that appears in ACM SIGMOD Record, Dec 2015. It is titled Model Selection Management Systems: The Next Frontier of Advanced Analytics. It is co-authored with my advisors, Jeff Naughton and Jignesh Patel, and our collaborator from Microsoft, Robert McCann, who is an applied researcher and practitioner of ML.

Prelude

The field of advanced analytics -- the intersection of data management and machine learning (ML) -- is undoubtedly booming. It seems that not a month goes by without news of a new startup in this already crowded space: Alation, Dato, Palantir, and Scaled Inference, only to name a few. Established companies such as IBM, Microsoft, Oracle, Pivotal, etc. are also investing heavily in building ML-related infrastructure software to improve support for moneyed enterprise users, while Baidu, Google, Facebook, and other Web companies are also building vast internal ML infrastructure. However, after a spurt of research in the early 2000s on scalable in-RDBMS data mining and ML, academic and research efforts in this space by the data management community seem to have stagnated and fragmented. As data management researchers, do we have anything new and non-trivial to contribute to this booming field other than just to build faster or more scalable toolkits that implement individual ML algorithms? Or is all the interesting work in this space under the sole purview of engineers in industry?

Introduction

In our vision paper, we make the case for a new, cohesive, and potentially impactful direction of research in advanced analytics for the data management community. We start with the observation that building ML modes in practice is seldom a one-shot "slam dunk". Based on numerous conversations with analysts at both enterprise and Web companies, we found that using ML is often an iterative exploratory process that involves three key challenging practical tasks -- feature engineering (FE), algorithm selection (AS), and parameter tuning (PT), collectively called model selection. Surprisingly, these three tasks have largely been overlooked by the data management community even though these are often the most time-consuming tasks for ML users. Consequently, there is little end-to-end systems support for this iterative process of model selection, which causes pain to analysts and wastes system resources. In our paper, we envision a unifying abstract framework for building systems that can support the end-to-end process of model selection and identify the research challenges posed by such systems.

Summary

To make the process of model selection easier and faster, we envision a unifying abstract framework based on the idea of a Model Selection Triple (MST) -- a combination of choices for FE, AS, and PT. While a large body of work in ML focuses on various theoretical aspects of model selection, in practice, analysts typically use an iterative exploratory process that combines the ML techniques and their domain-specific expertise. Nevertheless this iterative process has structure. We divide it into three phases -- Steering, Execution, and Consumption. In the Steering phase, they specify the precise MST they want to explore. The Execution phase runs the MST and obtains the results. In the Consumption phase, they post-process the results to steer the next iteration. They then modify their MST and iterate. Overall, the model selection process is iterative and exploratory because the space of MSTs is usually infinite and analysts have no way of telling a priori which MST will yield "satisfactory" accuracy and/or insights. Alas, most existing ML systems force analysts to explore only one MST per iteration, which overburdens the analysts and wastes system time.

We envision a new class of analytics systems we call Model Selection Management Systems (MSMS) that enables analysts to explore a logically related set of MSTs per iteration. The abstraction of sets of MSTs acts as a "narrow" waist that helps us decouple the higher layers (how to specify them) from the lower layers (how to implement and optimize them) in order to make it easier to build an MSMS. Moreover, we explain how some existing research systems such as Columbus, MLBase, etc., can be subsumed by our unified framework. However, realizing our vision immediately poses the challenge of how to enable analysts to easily specify sets of MSTs and how to handle them efficiently. We discuss how repurposing three key ideas from database research -- declarativity, optimization, and provenance -- could help us improve the respective phase of an iteration. This can help reduce both the number of iterations and the time per iteration of the model selection process, thus making it easier and faster. However, as we identify in the paper, applying these three ideas to the model selection process raises several non-obvious research challenges:

What should the declarative interfaces to enable analysts to intuitively specify a set of MSTs look like? We discuss the tradeoffs in the design decisions involved.
What are the new system optimization opportunities that can exploit the set-oriented nature of specifying MSTs to reduce the runtime of an iteration? We discuss a slew of new optimization ideas that combine data management and ML techniques.
How to perform provenance management for ML to enable analysts to consume results easily and improve the iterations? We explain the need for a notion of "ML provenance" and discuss its applications.

Solving the above challenges requires new research in data processing algorithms, systems, and theory at the intersection of data management, ML, and HCI. In our recent research, we have already started tackling some of the technical questions. I hope the data management community joins us in our quest to make the end-to-end process of using machine learning for data-powered applications easier and faster!

I invite you to read our 6-page vision paper. Questions, suggestions, and comments -- critical or otherwise -- are welcome!

Links

Vision paper.
Project webpage.
Associated survey to explain the gaps in the existing landscape of ML systems that motivated our vision.

So, I am finally joining the bandwagon of maintaining a research blog! Since my research interests revolve around data, this blog will carry articles and other posts that have something to do with data - how we store, query, analyze, understand, and more generally, use (and abuse!) data. From a "traditional" perspective, this would span the topics of data management (including databases), statistical machine learning, optimization, systems, and HCI. For those interested in tech blogs about data, there are many other excellent blogs as well (for example, this, this, and this). My posts are probably going to be a bit more eclectic. I expect to have five major kinds of posts (in no particular order):

1. Articles about my research in a language that is simpler/less technical than my papers.

2. Articles about other people's research that I find interesting/important.

3. Articles about general data applications and software that I find interesting/important.

4. Rants and raves about research events or my general experience as a researcher.

5. Poems about research in this field! :)

Better late than never to blog about data in this new golden age of data management and analysis! Feedback and discussions on my posts, online and offline, are always welcome. Thanks!

The Data Dossier

Sunday, December 6, 2015

Vision Paper: Model Selection Management Systems

Thursday, August 13, 2015

A poem on the Relational Data Model

Hello World!

About Me

Blog Archive