Introduction
Machine learning (ML) is critical for deriving insights from data in numerous data-driven applications. An overwhelming majority of such applications dealing with structured data have data in the so-called normalized form. In other words, features needed for ML are spread across multiple tables, connected by database dependencies such as primary key-foreign key (PK-FK) dependencies. Since almost all ML tools expect the entire feature vector to be present in a single table, data scientists are forced to join the base tables to materialize/denormalize a single table. A more detailed example of such scenarios was discussed in my previous post.
As data management folks, we know all too well that joins often introduce redundancy in the data, which could blow up the size of the intermediate single table. This also causes runtime wastage because the ML over the single table often has significant computational redundancy. In fact, in one real-world instance at a major Web company that I spoke to, such joins before ML by an analytics team brought down the cluster they shared with other teams because they shot through their allocated storage! Such runtime and storage bottlenecks caused by this problem of "learning after joins" will likely get exacerbated as more analytics workloads move to the cloud, where every bit of extra time or storage purchased requires more $$$. Of course, this problem seems obvious in hindsight but I did not realize its severity until after speaking to practitioners.
Enter Factorized Machine Learning
We highlighted the above problem in our first paper, "Learning Generalized Models over Normalized Data," which appeared at SIGMOD'15 (PDF). The crux of our solution is the "factorized learning" technique, whose core insight is stunningly simple: decompose the ML computations and push them through joins to the base tables! Essentially, this applies the classical RDBMS idea of pushing selections/projections/aggregates through joins to ML. Of course, ML is not one algorithm and there are far too many to cover in a single piece of work. So, for tractability sake, we picked a simple but popular set of ML algorithms to study the feasibility and efficiency of our idea: generalized linear models (GLMs). These models, which include linear regression and logistic regression, essentially construct a hyperplane (the co-efficient vector w) in the d-dimensional feature space for regression or classification. An optimal hyperplane is defined by a loss function, denoted F, over the training data. GLMs are typically trained using gradient descent methods. For this initial work, we focused on the simple batch gradient descent (BGD) method, which is similar in its data access pattern to conjugate gradient and L-BFGS (I will discuss stochastic gradient and coordinate methods later).
Learning a GLM with BGD requires computing the gradient, ∇ F, of the GLM's loss function iteratively. At each iteration, fixing the co-efficient vector w, this is simply a vectorial aggregation over the training examples in which each term multiplies the feature vector x with a GLM-specific scalar function (denoted g) of the inner product of w with x: wTx. Now, the feature vector is actually a concatenation of feature sub-vectors from the base tables created by the join. Denoting the base tables as S and R, we can view x as [xS, xR]. The decomposition now becomes clear: wTx = wTSxR + wTSxR. But there is more: the gradient computation requires post-multiplying the entire feature vector with the scalar output of g. This means the push-down is more complicated than for SQL aggregates like SUM or AVG. Given these observations, the factorized learning technique virtually constructs itself, with the following figure illustrating the relational-style "query" plan/dataflow:
- One sequential pass over R to compute the partial inner products over R's features; store these in an in-memory associative array HS indexed by the ID of the R tuples (the join attribute)
- One sequential pass over S to compute the partial inner products over the features in S, while simultaneously looking up into HS to retrieve and add the pre-computed partial inner products, apply g over the full inner product, and update an entry in another in-memory associative array HR indexed by the foreign key (the join attribute) to store the GROUP BY SUM of g; also, multiply g with xS and aggregate these to obtain the portion of ∇ F over S's features.
- A second sequential pass over R to multiply the scalars in HR with xR and aggregate these to obtain the portion of ∇ F. over R's features.
The gradient value is technically unaffected, which means ML accuracy is not affected. But equally importantly, the above formulation lets us leverage decades of work on scalable relational dataflows, including RDBMSs, Hive, and Spark. Almost all of these systems let us define a user-defined aggregate (UDA) on top of a table. We thus implement each of the three steps as a UDA to realize factorized learning without affecting the scalability of the ML algorithm. Our prototype, named Orion, was built on top of PostgreSQL and Hive (subsequently extended to Spark and in-memory R). Orion is open-sourced on GitHub. Empirically, we found that factorized learning provides speed-ups over the regular materialized execution that are commensurate with the amount of redundancy introduced by the join.
Of course, several practical questions remain: Why not do a lazy join with an index (short response: it does not avoid computational redundancy but it creates a new performance trade-off)? What if HR/HS do not fit in memory (short response: we propose alternatives to scale with new trade-offs)? What if more than two tables are joined (short response: an optimal extension to star schema joins is NP-Hard and we introduce an effective heuristic)? What if the data is distributed (short response: factorized learning is easy to distribute)? All the technical details are in our SIGMOD'15 paper.
Extensions as a Means to Learn Advising
Apart from machine learning, this research was also a human learning experience for me! :) Orion provided the perfect setup to study extensions and generalizations with junior students both to teach them how to do research and for me to learn the ropes of being an advisor from my advisors. I realize I was fortunate in this regard because not many PhD students get such an opportunity. Anyway, with many brilliant undergrad and Masters students at Wisconsin, we extended factorized learning to a bunch of other classes of ML models:
- Boqun Yan and Mona Jalal helped build an R-based toolkit named Santoku that extended the idea to probabilistic classifiers like Naive Bayes and decision trees, which ultimately rely on a collection of SQL-style aggregates to estimate probabilities. Mona and I demonstrated Santoku at VLDB'15 (PDF and code on GitHub).
- Boqun Yan further worked on "factorizing" GLMs when using SGD. It turns out that there is no computational redundancy caused by joins, since SGD updates w after each example! The only savings are in IO/data movement reductions, which at scale could introduce a runtime-accuracy trade-off due to data re-ordering (a form of biased sampling). Mini-batch SGD bridges the gap between SGD and BGD.
- Zhiwei Fan worked on factorizing GLMs when using stochastic coordinate descent (SCD). Unlike SGD, SCD does have computational redundancy and factorized learning helps avoid it. However, it turns out the overheads caused by individual co-ordinate update dominate the savings in many cases. Once again, block CD bridges the gap between SCD and BGD.
- Fengan Li worked on extending factorized learning to clustering algorithms such as K-Means, DBSCAN, and HAC. Each presented its own new trade-offs due to differences in their data access patterns.
- Microsoft, who generously supported my dissertation work at the Gray Lab in Madison, helped me explore collaborations with many Microsoft teams that also faced the learning after joins problem. In particular, Robert McCann from their Web security team (who later became a co-author) gave me the opportunity to prototype our idea on the Cosmos platform and play with some large real datasets.
- In the background, I was also mulling over a "generic" way to extend factorized learning to other ML algorithms rather than take a piecemeal approach. My internship at IBM Almaden's SystemML team in 2011 showed that linear algebra provides a common substrate to express many ML algorithms and scaling linear algebra automatically scales all these ML algorithms. But I was not sure if there would be enough interest to invest more time into this idea.
Tales from SIGMOD and VLDB
Our paper had its share of reviewer troubles (don't we all?). It first got rejected at VLDB with the key complaints being that "no RDBMS vendor will implement this" (seriously!) and that it was "too straightforward" (if only I had a nickel ...). However, I trusted my advisors' sagely advice that a problem-setting paper like this will face such troubles (apparently, Jim Gray's now-famous data cube paper faced similar troubles!). True to their advice, after a rewrite of just the introduction, our paper got accepted to SIGMOD, with the reviewers now commending the paper as "elegant" and "exactly the kind of paper database folks should be writing about ML"! :) Ironically (or presciently?), our paper was scheduled for the "Advanced Query Optimization" session at SIGMOD. Apparently, ML has finally become a part of the bread-and-butter of our community!
At SIGMOD in Australia, the work was well-received and generated some excitement. In particular, Molham Aref, the head of the innovative database/analytics start-up LogicBlox, really liked our idea and explained how some of his company's retail customers faced this problem often. Molham later put me in touch with the rest of his top-notch team at LogicBlox and was incredibly gracious in advertising my work there. Interacting with them validated the utility of this line of work even more. I connected them to one my students, Zhiwei, who did an internship at LogicBlox and worked on some nice problems, including applications of factorized learning, with real customer datasets. Another development was that the IBM SystemML folks also seemed interested in this work. Matthias Boehm asked me if we can extend this factorize learning idea to ... wait for it ... linear algebra! :) The soft-spoken but incredibly deep Dan Suciu also asked if we have formalized the scope of the "language" that our factorized ML idea applies to. All of this feedback encouraged me to push forward with the idea of integration with linear algebra (more below).
At VLDB in Hawaii, I got to meet the super-nice Dan Olteanu, the creator of "factorized databases," which is the namesake for our technique. Interestingly, Dan was aware of our work and stayed in touch with me on this topic. Eventually, his group came out with an interesting new algorithm for linear regression over factorized databases, which generalizes the factorized learning idea to a wide class of join queries (their SIGMOD'16 paper). As an interesting co-incidence (or was it?), their paper was presented in the same session as my Hamlet paper! Another interesting co-incidence: Dan Olteanu had started working part time for LogicBlox and spread the word about my the Hamlet paper! Small world? :) The indefatigable Peter Bailis also attended our demo and suggested that I start blogging about my cool new research ideas. Of course, since I enjoy writing, I did not think twice about starting this blog although I do wish I blogged more often!
Morpheus: Generalizing Factorized Learning using Linear Algebra
Pre-occupied with teaching Wisconsin's undergrad database course and preparing for my job search, I was on the look out for another student to help generalize factorized ML using linear algebra (LA) as a formal representation language for ML. The intuition for this project is simple: many data scientists use LA and LA-based frameworks such as R and Matlab to use ML algorithms and write new ones. The data management community has embraced LA and R as key environments for ML workloads, with tools such as IBM's SystemML, Oracle R Enterprise, and SparkR providing physical data independence for LA scripts: the "data frame" or matrix can be a larger-than-memory HDFS file or an in-RDBMS table. Thus, such tools provide a mechanism to let data scientists focus on their LA scripts rather than worry about scalability or distribution. Alas, when given normalized data, even these LA tools expect a single denormalized "data matrix," leading to the learning after joins problem again.
Lingjiao Chen, whom I met at guest lecture I gave about Orion in Jeff's graduate database class, seemed the prefect fit for this new ambitious goal of solving the above problem: factorize linear algebra! Our insight is simple: since ML algorithms are just scripts in LA that use individual basic and derived LA operators, if we factorize these LA operators, we can automatically factorize many ML algorithms in one go. This could dramatically reduce the development overhead inherent in manually extending factorized learning to different ML algorithms. We set about creating a new framework and system that factorized linear algebra; we call our framework Morpheus and we recently released a pre-print of the Morpheus paper on arXiv. The following figure illustrates the relationship between Morpheus and prior work on factorized ML, including my own work on Orion.
Our approach presented three main technical challenges: generality, closure, and efficiency, which I discuss next. Unlike relational algebra, LA has far too many popular operators. We solve this issue by identifying a subset of LA that is large enough to let us cover a wide variety of ML algorithms, but small enough to be tractable. Also, unlike Orion, we wanted to handle both PK-FK joins and more general "M:N" joins. We do so by introducing a new logical multi-matrix data type to LA: the normalized matrix, which helps represent normalized data and encodes the join relationship using an indicator matrix. Essentially, we bring the classical database notion of logical data independence to LA sytems!
Closure means that we want to rewrite an LA script only into a different LA script; this ensures that we are not forced to modify the internals of the LA system used, which could make practical adoption easier. We do so by devising an extensive framework of algebraic rewrite rules to transform LA operators over a given denormalized matrix into a set of LA operators over the normalized matrix. Herein lies the meat of this work: some operators are trivial to rewrite and are reminescent of pushing relational aggregates through joins but some others present novel rewrite opportunities with no known counterparts in relational optimization. The paper goes into these technical details. Finally, we also identify new efficiency trade-offs in the rewrites and present a heuristic decision rule to predict if the rewrites will not improve runtime performance.
Overall, Morpheus can automatically factorize numerous ML algorithms. We illustrate several popular ones in the paper: logistic regression with gradient descent (as done in our Orion paper), linear regression (three algorithmic variants of this, including Dan Oltenau's SIGMOD'16 algorithm), K-Means clustering, and Gaussian non-negative matrix factorization for feature extraction. On the real-world normalized datasets from the Hamlet paper, we found Morpheus to yield speed-ups of even over an order of magnitude. On synthetic data for M:N joins, we see over two orders of magnitude speed-ups. All of this without forcing data scientists to learn factorized ML!
We have prototyped Morpheus as an easy-to-use library in R as well as on Oracle R Enterprise but the framework itself is abstract and applicable to any LA system. We plan to open-source Morpheus on R soon. Distributed extensions on top of SystemML and TensorFlow are also on their way.
Concluding Remarks
This whole story of the factorized learning idea, its follow-ons and extensions, and ultimately, its generalization is instructive (at least to me) of what is often a common theme in research: ideas rarely form and grow in isolation. Speaking and iterating with practitioners helps obtain new perspectives, evaluate the efficacy of research ideas in practice, obtain new insights, and identify new problems to work on for the future. And it is often satisfying to go beyond just a research or vision paper; building real prototype systems, open sourcing them, and demonstrating them is a valuable part of the process. And finally, research, at least in academia, is not just about ideas, it is also about people: the students, the advisors, the student-advisors (I was one!), the collaborators, the reviewers (ugh!), the practitioners you talk to, the people you meet at conferences, and everybody else you brainstorm with or obtain feedback from.
From a technical perspective, this line of work shows the continued importance of data management-inspired ideas in the context of ML no matter where the ML computations are done. Far from being the "plumbers" that just build faster ML tools, I think the data management community could drive an ever-closer union of these two fields to open up new problems that neither community addresses on its own but solving which could substantially benefit numerous data-driven applications and maybe even create new ones!
Bringing this post to a close, I summarize that simple insight that drove this whole line of work on factorized machine learning, or more generally, on learning "over joins" rather than "after joins" in one succinct stanza:
A sum of sums is the same sum,
Factorizing avoids repeating some.
So, learn after joins no more,
Do learn over joins herefore!
I invite you to read our papers on this topic (Orion, Santoku, and Morpheus) and/or check out the code (Orion and Santoku; Morpheus coming soon). Questions, suggestions, and comments -- critical or otherwise -- are welcome!