The Data Dossier: 2019

I returned last week from two awesome conferences in the Bay Area: Strata Data 2019 and SysML 2019. This post is about what I learned from these trips about the world of data+ML+systems, both in research and in practice. I will also give shout outs to talks/papers I found most interesting.

Strata Data 2019

This was my first major practitioners conference. It featured 100s of talks with 1000s of attendees from 100s of companies, spanning enterprise companies, small and large Web companies, non-profits, and more organizations. I went there to give a talk on my group's research and our open source software to accelerate ML over multi-table data. But a larger goal was to take the pulse on industrial practice and network with the data scientists, ML engineers, and other researchers who were there. Some observations about stuff I found most interesting at this conference:

ML has come of age:
I already knew this, but the conference made it bluntly clear that systems for ML workloads, including for the end-to-end lifecycle, have exploded in popularity. Almost every session had plenty of talks from companies on how they are using ML for different business-critical tasks, what ML+data tools they were using, or by ML tools developers on what their tools do. The ML hype is dead--ML, including deep learning, is now serious business on the critical path of almost every company there!

Interestingly, this crowd seems to be well aware that ML algorithmics itself is just one part of the puzzle. The data systems infrastructure for sourcing features, building models, and for deployment/monitoring of models in production all seem disjointed and thus, a focus of further work. But almost every company I spoke to on these issues are rolling out their own in-house platforms. This leads to me to my next observation.

ML platforms craze:
There were many "AI startups" and also larger companies at the expo that claim to be building a "data science/ML platform" of some sort, including H2O and SAS. It did leave me wondering how reusable and pluggable the components of such platforms are with existing infrastructure, especially given the incredible heterogeneity of the datasets and ML uses cases across companies.

But one setting where automated platforms for end-to-end data preparation, feature extraction, and model building on structured data are indeed graining traction is SalesForce. They gave a few interesting talks on Einstein, their AutoML platform that is apparently used by 1000s of their customers. Most of these are enterprise companies that cannot afford to have a data scientist on their own. Thus, they give their datasets to Einstein, specify the prediction targets and some objectives, and let it build models for various common tasks such as sales forecasting, fraud detection, etc. To me, it seems SalesForce is quite a bit ahead of its rivals, certainly for structured data. They also open sourced an interesting library for automating data prep and feature extraction: TransmogrifAI.

Another interesting open source ML platform presented there was Intel's "Analytics Zoo" to integrate TensorFlow-based modeling workflows in the Spark environment. It also includes some pre-trained deep net models and useful packaged "verticals" for different ML applications.

Serverless:
I finally managed to learn more about serverless computing thanks to a tutorial. The speaker gave a fascinating analogy that made total sense: buying your own servers is like buying your own car; regular cloud computing is like getting a rental car; serverless is like using Lyft/Uber.

Model serving has become the killer app for serverless due to its statelessness. But apparently, data prep/ETL workflows and more stateful MapReduce workflows are also increasingly being deployed on serverless infrastructure. The benefits of fine-grained resource elasticity and heterogeneity offered by serverless can help reduce resource costs. But the con is that software complexity goes up. Indeed, the speaker noted a caveat that most ML training workloads and other communication/state-intensive workloads are perhaps not (yet) a good fit for serverless. All this reminded me of this interesting CIDR'19 paper by Prof. Joe Hellerstein and co. Nevertheless, I think disaggregated and composable resource management, a generalization of serverless, seems like the inevitable evolution of the cloud.

Everything for ML and ML for Everything?:
Prof. Shafi Goldwasser of MIT gave an interesting keynote on how the worlds of ML and cryptography are coming together to enable secure ML applications. She mentioned some open research questions on both adapting ML to be more amenable to crypto primitives and creating new crypto techniques that are ML-aware. It is official folks: almost all other areas of computing (call it X) are working on "X for ML and ML for X" research! Heck, I even saw physicists working on "ML for physics and physics for ML"! :)

SysML 2019

This was the first archival year for this conference on "systems + ML" research. This whitepaper details the intellectual focus of SysML. There were about 32 talks, many from major schools and companies, including several from Google AI. One of my students, Supun Nakandala, presented a demonstration on Krypton, our tool for explaining CNN predictions more quickly using query optimization techniques. But apart from our paper, I came to SysML to get a feel for how this new community is shaping up and to network with the other attendees. I was on the PC; I found it to be refreshingly new experience to interact with folks from so many different areas under one roof: ML/AI, architecture, PL/compilers, databases, systems, HCI, etc.! The program naturally reflects the eclecticism of this emerging interdisciplinary community. Some observations about stuff I found most interesting at this conference:

Pertinent industrial interest:
There was large number of ML engineers and researchers from Google, Facebook, Microsoft, Apple, etc. I also saw DeepMind and Tesla for the first time at a systems conference. This underscores an important aspect of research at this growing intersection: visibility among the "right" industrial audience. Most ML systems research so far has been published at SOSP/OSDI/NSDI, SIGMOD/VLDB, ISCA/HPCA, etc. But such broad conferences typically attract a more generic industrial presence that may or may not be pertinent for ML systems research. For instance, companies usually only send their relational DBMS or data warehousing folks to SIGMOD/VLDB, not ML systems folks. SysML has clearly found a long ignored sweet spot that is also growing rapidly.
Pipelining and parallelism on steroids:
There were 4 main groups of papers: faster systems, new ML algorithms, ML debugging, and new ML programming frameworks. I will focus on the first, third, and fourth groups. The first group was largely from the networked/operating systems and architecture folks. The papers showed the power of two key systems tricks that long ago proved impactful in the RDBMS context: pipelining and parallel operators.

Many papers aimed to reduce the network/communication overhead of distributed ML (e.g., Parameter Servers) by pipelining communication parts of the state with computation over parts of it. This is like hiding memory/IO latency on single-node systems. While the ideas were interesting, the performance gains underwhelmed me (~30% is the largest?!). But then again, there is a cultural/expectations gap between the networked systems folks and the database systems folks. :)

There were many papers on hardware-software co-designed stacks, mainly for vision. But I found this paper from Stanford particularly interesting. It shows that to maximize resource efficiency, we need different kinds of parallelism for different operators within a deep net. I suspect such auto-tuned hybrid parallelism may have implications for other data processing systems too.
Debugging/programming frameworks for ML:
These were a welcome relief from so many low-level systems papers! The only "data management for ML" paper was this one from Google that highlights issues in validating and debugging data-related issues in production ML settings. I was already familiar with this work and the TFX team. Such loosely coupled schema-guided approaches are crucial for dynamic heterogeneous environments where neither the data sources nor the model serving environments are under the ML engineer's control. Another interesting paper in this space was on enabling the software engineering practice of "continuous integration" for ML models. They reduce the labeled data requirements for reliably testing the accuracy of new ML models committed to a shared code repository.

Finally, the paper I enjoyed reading the most was this one on TensorFlow.js. It studies ML training and inference in a peculiar setting: browsers. They give many remarkable example applications that use this framework, including in ML education with interesting pedagogical implications for teaching ML to non-CS folks. More touchingly, another application built a deep net-powered gestural interface to convert sign language videos to speech. It is heartening to see that the SysML community cares about more than just building faster ML systems or improving business metrics--democratizing ML is a much broader goal!

The Data Dossier

Tuesday, April 9, 2019

Conferences: Strata Data and SysML 2019

Strata Data 2019

SysML 2019

About Me

Blog Archive