Friday, April 13, 2018

The Culture Wars of Data Management

Scientists are people. Despite their fervent protestations of objectivity, all scientists are prone to conflating subjective experience with objective truth, at least once in a while. Einstein himself infamously dissed on quantum mechanics initially but then contributed to its growth. The "whole truth" is an elusive and enigmatic beast. Consequently, people (including scientists) often fight "tribal culture wars" under the illusion that they (and those who agree with them) are right and all others are wrong. Perhaps nothing captures this issue more eloquently than the timeless parable of the blind men and the elephant. This issue arises over and over again in all fields of human endeavor, including the sciences. In this post, I examine how this issue is affecting the "data management" research community. My motivation is to offer a contemplative analysis of the state of affairs, especially for the benefit of students and new researchers, not prescribe definitive solutions.

This post is partly inspired by Pedro Domingo's article, "The Five Tribes of Machine Learning." He explains why there are deep "tribal" divisions within the machine learning (ML) community in terms of both the topics they pursue and their research culture. For instance, explosive battles often take place between the "Bayesians" (think statistics) and the "Connectionists" (think deep learning), two of ML's oldest tribes.

The data management field too has similar divisions, albeit not as strongly partitioned by tribes. It all came to a head last year when some prominent members of this community issued an ultimatum to SIGMOD and VLDB, the top data management research conferences. Criticizing the repeated unfair treatment of systems-oriented papers by reviewers (e.g., wrongly deeming a lack of theoretical analysis as low "technical novelty" or "technical depth," or crass dismissive statements such as "just engineering"), they threatened to fork off and start a new systems-oriented data management conference. Such brinkmanship is neither new nor unique to this field. In fact, break ups often occur on account of "unfairness," e.g., IMC splitting from SIGCOMM, ICLR vs CVPR, Mobisys vs MobiCom, the list is long indeed! From my own experience, I agree with the core claim of the ultimatum. Apparently, the SIGMOD/VLDB leadership also agreed. The PC chairs of VLDB then emailed the whole PC (which I was on) with guidelines on how to fairly evaluate systems papers. Problem solved? Haha, no.

Why write this post? Who is it for?

It may appear puerile to put labels on entire "research cultures." But such cultural gaps are ubiquitous, even in physics. Acknowledging and understanding such gaps is the first step to either mitigating them or making peace with them. Without such understanding, I reckon the data management community might be on a death spiral to more fragmentation (or maybe that is inevitable due to other factors such as size, who knows?). I see such labels as a way of acknowledging the different cultures, tolerating them, and even celebrating the intellectual diversity. More importantly, it is crucial for students and new researchers to be aware of such gaps and why they exist. This could help them make more sense of paper rejections or even negative in-person interactions with other researchers. From my experience, many established researchers transcend the cultural gaps and often bridge multiple cultures. Even among those that stay within one culture, many are not antagonistic to the others. I just wish everyone would be more like these folks. I hope this post nudges more people, especially newcomers, towards that ideal world.

The Four Canonical Cultures (as I see it)

Based on my interactions with dozens of researchers, reading hundreds of papers, and reviewing for SIGMOD/VLDB for the last two years, I see at least 4 "canonical" cultures (to overload a classic DB term!). Unlike Pedro's tribes of ML, which are partitioned by areas (e.g., Connectionists study neural networks), it is misleading to delineate the divisions within the data management community by areas/topics because many areas, e.g., query optimization, have all four cultures (or at least more than one) represented. Thus, my split is based on the inherent expectations of each research culture, their methodologies, the non-data management fields they get inspiration from (including technical vocabulary and "hammers"), and the style and content of their papers. Such differences are more insidious than areas, which is perhaps why the "culture wars" of data management are more damaging than the tribals wars of ML. These cultures are not mutually exclusive; in fact, many researchers have successfully hybridized these cultures and all 2^4 possible combinations are represented at SIGMOD/VLDB to varying extents. Anyway, here are the 4 canonical cultures I see:
  • Formalist
  • Systemist
  • Heuristicist
  • Humanist

The Formalist Culture

The Formalists bring a theory-oriented view to data management. Their culture has been influential since the founding years of the data management field, going back to Ted Codd himself. Formalists draw inspiration from theoretical fields such as logic, math, statistics, and theoretical CS, especially combinatorial optimization, formal methods/PL theory, and ML theory. They seek a high degree of formal rigor in the characterization and communication of ideas, especially in papers. Common elements in Formalist papers are non-trivial theorems and proofs, typically of the "hardness" of problems (couched in the language of computational complexity theory), approximation or randomized algorithms, and formal analyses of the complexity and quality of the algorithms. Many pride themselves in the theoretical sophistication of their results.

Some Formalists also publish in non-data management venues such as SODA, LICS, and NIPS, but curiously, not so much in STOC/FOCS. Database theory (think PODS/ICDT) is a major part of this culture, but not all Formalists are database theoreticians, i.e., some publish regularly at SIGMOD/VLDB too. Some Formalists have a reputation for expecting a high degree of rigor in all data management papers.

The Systemist Culture

The Systemists bring a systems-oriented view to data management. Their culture has also been influential since the founding years. Many pioneers of this field, including Charlie Bachman, Michael Stonebraker, Jim Gray, and David DeWitt can be considered as Systemists. They draw inspiration from the computer systems fields broadly writ (operating/distributed systems, compilers and software engineering, networking, computer architecture, etc.). Interestingly, their ideas have reshaped some of those other fields, e.g., the concept of ACID and distributed systems. Most Systemists seek a high degree of real-world practicality in ideas and care much less (compared to Formalists) for rigor in data management papers. Some even dislike seeing theorems. Common elements in Systemist papers are system architecture diagrams, discussions of system design decisions, and analyses of system trade-offs, typically with extensive experiments. They typically use system runtime or throughput metrics on synthetic workloads and datasets (e.g., TPC benchmarks). Scalability is often a key concern. Many pride themselves in the practical adoption of their systems, including via startups they found.

Some Systemists and Formalists have a reputation for waging "wars" against each other's culture or even other fields. The war between Bachman and Codd on the (de)merits of the relational data model is one example, as is the war between Stonebraker and Ullman on whether database theory research is needed at all. More recently, some Systemists have waged wars against the distributed systems field, e.g., this now-infamous blog post. Curiously, most Systemists tend not to publish in non-data management computer systems venues such as NSDI/OSDI/SOSP/ASPLOS/ISCA/etc.

The Heuristicist Culture

The Heuristicists draw inspiration from the fields of artificial intelligence (ML, natural language processing, etc.), but also logic, math, statistics, and theoretical CS. This culture started growing mostly from the 1990s. Popular topics where this culture is well-represented are data mining, Web/text data analysis, data cleaning, and data integration. Heuristicists design practical heuristic algorithms, typically without a high degree of rigor in the exploration of ideas (compared to Formalists) and without deep systems-oriented trade-off analyses (compared to Systemists). But there is substantial diversity in this culture. Many papers use rigor in communication and thus, seem closer to the Formalists. Others focus on complex algorithmic architectures and thus, seem closer to the Systemists. Many researchers bridge this culture with the Formalists, especially on the topic of data integration. But many problems in such topics are so "messy" that theoretical work alone goes only so far. So, many researchers also bridge this culture with the Systemists. Common elements in Heuristicist papers are math notation but no non-trivial theorems, large algorithm boxes or diagrams, and extensive experiments, typically with real-world datasets and workloads. They typically focus on quality metrics such as accuracy, precision, recall, AuROC, etc. Runtime or scalability metrics are less common.

Data cleaning and data integration have become central themes in data management research. But a large chunk of data mining researchers broke up with the SIGMOD/VLDB community and joined the newly created SIGKDD community along with applied AI/ML researchers. However, many data mining researchers still publish routinely at SIGMOD/VLDB along with SIGKDD/ICDM/etc. Similarly, many Web/text data researchers also publish at both SIGMOD/VLDB and WWW/AAAI/etc.

The Humanist Culture

The Humanist culture has also been around since the early years (e.g., query-by-example), but this culture too started growing mostly only from the 1990s. This culture draws inspiration from the fields of human computer interaction, programming languages, and cognitive science, but also parts of theoretical CS and computer systems. This culture puts humans that work with data at the center of the research world. Topics in which this culture is well-represented are new abstract programming/computation models for managing/processing data, new query interfaces, interactive data exploration/analysis tools, and data visualization. Common elements in Humanist papers are terms such as usability and productivity, user studies, and even links to demo videos of the tools they build. Remarkably, many Humanist papers often overlap with one or more of the other cultures, which perhaps makes this culture the least dogmatic and most eclectic. Humanist papers often have multiple metrics, including quality, system runtime, and human effort (measured with "lines of code", human interaction time, interviews, etc.). Many Humanists also publish at CHI/UIST.

This culture seems to be growing, primarily by attracting people from the other cultures. Working on human-centered problems has a long history in this field, going back to Codd himself. Many researchers often seem to forget the fact that the relational model itself was created not to solve a "hard" theoretical problem, improve a system's performance, or design a heuristic algorithm, but rather to improve the productivity of humans that queried structured data.

Culture Wars and Extremism

By "wars," I mean the endless intellectual tussle for suzerainty over the research field. The most common way such wars cause damage is unfair negative evaluations of research papers because one is unable to see the merits of a paper from a different culture or a cultural hybrid. Many "extremist" Formalists and Heuristicists (and Formalist+Heuristicists) often dismiss many Systemist papers as "just engineering." Many extremist Formalists also dismiss many Heuristicist papers as "too ad hoc" or "too flaky." Many extremist Systemists dismiss many Formalist papers as "not practical" or "too mathy." Many extremist Formalists and Systemists (and Formalist+Systemists) dismiss many Humanist papers as "fluff" and "soft science." Many extremist Formalists, Heuristicists, and Systemists (and many cultural hybrids) also often dismiss many Systemist or Heuristicist papers that explore new important problems and propose initial solutions as "too straightforward" or "not novel" by conflating simplicity, an oft-exalted virtue in the real world, with a lack of novelty. Overall, one often ends up wrongly judging a fish by its ability to climb a tree. Sadly, such wars sometimes force researchers to add needlessly contrived content to papers just to appease such extremists.

Such tribal culture wars and extremism, whether deliberate or not, detract from fair and honest critiques that actually help advance the science. Personally, I find such wars ridiculous. So, please allow me to amuse myself (and hopefully, you) by ridiculing the extremist bigots of each culture that ridiculously diss other cultures and glorify only their own using the provocative meme of "X as seen by Y" (see this one about programmer wars first, if you do not know this meme). Since a 16x16 matrix is too big for me to construct, I restrict myself to the canonical 4x4. :) Hopefully, this will help students/researchers realize that they are not alone in getting caustic comments. I also hope this will cause people to think twice before engaging in such ridiculous wars themselves in the future. Behold, I present to you the culture wars of data management!

(Optional) Explanatory Caption (might be painfully obvious for some). In row-major order from the top-left. First row: Einstein (geniuses, of course), FSM (mushy false gods), Crashed truck (what a disaster!), and Big fat hacker/engineer. Second row: Lucius Malfoy (snooty pure-blood evil wizards destined to be defeated), Justice League (superheroes saving humans), Awkward nerd, and Zealots (close-minded bigots serving the devil). Third row: Kung Fu Panda 3 (googly-eyed admiration), Pretend-Superman, Star Trek (the future beckons), and Airplane pilots (so many moving parts!). Fourth row: Irrelevant contrived junk peddlers, Kids with fun toys (so cute!), Calvin-and-Hobbes's games (so naive!), and The One (the prophetic savior).

Making matters worse are "civil wars" within cultures, especially the Systemists. Some Systemists are so obsessed with relational DBMSs that they pooh pooh any new data systems. Such insularity has caused much grief in the last decade, especially due to the rise of "Big Data" systems (think MapReduce/Hadoop or Spark) and "NoSQL" systems (think BigTable) from the distributed systems field. Of course, most Systemists acknowledge their mistakes and change their minds over time, but not without causing serious damage to the field. There is also a mini civil war among the Formalists between the logic and discrete math-oriented sub-culture and statistics/ML theory-oriented sub-culture. With so many culture wars going on, I fear research on "data management for ML"/"ML systems," which is the area my own research focuses on, will be driven away from SIGMOD/VLDB. This area is increasingly considered important for the wider CS landscape and thus, for the data management community too. I will be raising these issues (among others) at a panel discussion at the DEEM Workshop at SIGMOD 2018. Rest assured, I will return to blog about how that goes.

Should the Four Cultures Stick Together or Break Up?

The data management/database/data systems community is not a "monoculture"; it never was and it never will be (almost surely). As a vertical slice of CS, it is "multicultural" and will always have high intellectual diversity. The four cultures may be irreconcilable, but they are complementary and can co-exist. The benefits of inter-cultural tolerance, cultural hybridization, and trans-cultural work are clear: cross-pollination of problems and ideas, trans-cultural partnerships and collaboration, infusion of ideas from each culture's favored non-data management fields, export of ideas from one culture via another to another CS field or non-CS disciplines, and so on. This sort of inter-cultural amity and partnership was/is the norm at the Database Group of the University of Wisconsin-Madison, where I went to graduate school, and the Database Lab of the University of California, San Diego, where I am on the faculty now, and many other database groups. There is a long tradition of research cultural hybridization and trans-cultural research that is practiced and even celebrated by many researchers, senior and junior alike.

Yet, to claim such culture wars do not exist is to be the proverbial ostrich that buries its head in the sand, while pretending we can all just sing kumbaya, pat ourselves on our backs, and get along with stiff upper lips is to be utterly naive. We have to acknowledge and accommodate the vast cultural gaps amicably lest the centrifugal forces caused by unfair research evaluation and tribal cultural wars spin out of control. This form of multiculturalism is a source of strength for SIGMOD/VLDB and sets them apart from related communities such as STOC, OSDI, ISCA, SIGKDD, and CHI. Of course, it will also defeat the point of inter-cultural amity if everyone is expected to work across cultures; it should be left to each researcher to pick a culture or cultural hybrid for their work based on their skills and taste.

Instead of trying to shame or bully any of the cultural groups into conformity to another culture, all groups should practice mutual respect. An analogy I can draw is with movie genres: one cannot coerce a person that only likes arthouse dramas to like big-budget blockbusters or vice versa. All 4 cultures and all cultural hybrids bring something different and valuable to the table of data management research. It will be nothing short of a catastrophe for SIGMOD/VLDB if the Systemists (or indeed, any other group) leave. While such a pyrrhic war has been averted for now, I fear it might turn in to a Cold War that poisons the field further. Unfortunately, the current peer review processes of SIGMOD/VLDB are broken and are amplifying the centrifugal forces. Thankfully, they are aware of this issue and working to fix them soon. I am reasonably confident they will course correct but I doubt it will happen quickly or easily.

All that said, I could be wrong and perhaps the cultures are better off breaking up. As such, attending a conference is no longer needed for following new research thanks to the Web. Cutting-edge "data management" work is no longer published at SIGMOD/VLDB only; NSDI, OSDI, SIGKDD, NIPS, and more share the pie. Inter-cultural research exchange can happen across communities too. There are precedents for both area-based and culture-based splits/spinoffs: PODS (for database theory), CIDR (for "visiony" data systems), SoCC (for cloud computing), and SysML (for ML systems). Who can be 100% sure it is "bad" to have more splits/spinoffs?

Even if SIGMOD/VLDB remain multicultural, should they become more "federated" instead of "unitary" in terms of the research tracks? It is clearly unfair to have a random SODA (or SIGKDD) reviewer evaluate a random OSDI (or CHI) paper and vice versa--and yet, this is roughly the kind of unfairness caused by the culture wars that continue unabated within the SIGMOD/VLDB community. Is it not a form of "emotional abuse" of students to subject them to such toxic subjective culture wars instead of a resolute focus on the objective (de)merits of ideas? Does the SIGMOD/VLDB community really want to cut such a sorry figure against NSDI/OSDI/SOSP or NIPS/ICML or CHI or other communities that compete for bright students? I think the SIGMOD/VLDB community should openly and honestly tackle such contentious questions without acting like this or this. But I do not know the best avenue for spurring such conversations: a short workshop, a townhall or panel discussion at SIGMOD/VLDB, or something else, who knows?

Concluding Remarks

Scientists are people. It is crucial to practice objectivity and apply solid quality filters for research evaluation. But it is also crucial to be aware of one's own blind spots. Unknown unknowns are also impossible to fathom for anyone, no matter how knowledgeable--none of the blind men can say anything about parts of the elephant they cannot reach. This is all the more true for scientists at the cutting edge of the ever-expanding universe of knowledge. Also crucial is respecting subjective differences on intellectual practice, since the leap from objective facts to judgmental interpretation is far too often a leap of faith guided by subjective experience--even if two of the blind men grasp the same tail fur, one might call it soft on his hand and the other, coarse. Perhaps I am being a naive idealist, but I think it is imperative that we all inject ourselves with large doses of scientific humility, curiosity, civility, empathy, and magnanimity. Watching this video regularly might help! :) Unless there is credible verifiable evidence to the contrary, every scientist must be willing to admit that they could be wrong. Every one of us is (intellectually speaking) "blind" in some way, and we will remain blind no matter how much we learn. But we can all be less blind by talking with others that have an earnest world-view that is different from ours, not talking down to them.

PS: I do not claim that these 4 cultures are the only ones in data management. I am sure as the field keeps growing, we might see the addition of new cultures or the reorganization of existing ones. Perhaps I will write another post on this topic 20 years from now! If you have any thoughts on this topic or this post, please do leave your comments below.

ACKs: Special thanks to Julian McAuley for being a sounding board for this post. I also thank the following people (in alphabetical order of surnames) for giving me feedback on this post, including suggestions that strengthened it: Peter Bailis, Spyros Blanas, Vijay Chidambaram, Joe Hellerstein, Paris Koutris, Sam Madden, Yannis Papakonstantinou, Jignesh Patel, Christopher Re, Victor Vianu, Eugene Wu, and a few others that did not want to be acknowledged. My ACKs do not necessarily mean they endorse any of the opinions expressed in this post, which are my own.

Reactions (Contributed Paragraphs)

The post elicited a wide range of reactions from data management researchers. Some researchers have kindly contributed a paragraph to register their reactions here, both with and without attribution. I expect to have more contributions in due course. If you do not want to comment publicly, feel free to email me your reaction so that I can add it here as an anonymous contribution. I hope all this contributes to stirring more conversations on this important topic.

Jignesh Patel:
Interesting! Note Systemists don't abhor theory, they abhor theory for the sake of theory. Classic systems paper have some formalism, but just what is required to understand the system implication (think of the classic Gray locking paper) I don't think Systemists wage wars. Like true system people, they are quick to identify a practical problems, and speak up. Having said all that, you do bring out an important problem of tribal wars that is killing the community. Also, we actually have a really nice database systems community where the senior folks actually take criticism quite well. Arguments are welcome! Don't know for sure about the theory folks, but I think they are pretty open-minded too. Overall, we have a pretty nice community, and largely the right things happen in the long-run. Ok--I'm tenured and I'm not as pessimistic as other are perhaps as a result?

I liked how you have broken up the community into 4 tribes. I agree with most of it. However, on a higher-level feedback, here's what I think: I feel like this is perhaps triggered by bad/unfair reviews which every one of us has dealt with. However, I personally see the poor reviewing issue (i.e., lack of tolerance) only a symptom of the bigger problem which probably has a lot less with "culture" and has more to do with a "mafia" mentality in the community. Awards/Recognitions/Opportunities are not distributed based on merits. Rather, advisors/friends look out for their own advisees/friends. For example, the more famous the advisor, the more opportunities for their students. That means most everyone else is simply an "academic orphan" in the community. Not all but the majority of senior people in the community who are in control of how recognitions/opportunities are distributed do NOT act on what they preach. In public, most of them talk about the importance of "impactful work" but when you read their own letters they are doing nothing more than "bean counting" when it comes to assessing impact. The traditional data management community is rapidly losing its relevance, and as such, everyone is trying to come up with a definition of what's a fake problem and what's real/worthy. How does this relate to low-quality reviews? The data management community is a hostile environment due to its contentious and unfair interworkings. In a hostile environment individuals aren't acting rationally, let alone fairly. For example, even PC chairs and area chairs are not always people who are deemed by the majority of the community as reasonable or even insightful.

Sunday, January 1, 2017

Research Story: Factorized Machine Learning

This post is a follow-up to my earlier one on whether joins are even needed for ML over multi-table datasets. When joins are indeed needed, they often become a major bottleneck in the end-to-end process of building ML models. I tackled this problem in my dissertation by proposing the paradigm of factorized machine learning. This problem also recently attracted the attention of a few other researchers. This post covers this recent line of work, presents the interesting stories behind its progression, and discusses some recent results.


Machine learning (ML) is critical for deriving insights from data in numerous data-driven applications. An overwhelming majority of such applications dealing with structured data have data in the so-called normalized form. In other words, features needed for ML are spread across multiple tables, connected by database dependencies such as primary key-foreign key (PK-FK) dependencies. Since almost all ML tools expect the entire feature vector to be present in a single table, data scientists are forced to join the base tables to materialize/denormalize a single table. A more detailed example of such scenarios was discussed in my previous post.

As data management folks, we know all too well that joins often introduce redundancy in the data, which could blow up the size of the intermediate single table. This also causes runtime wastage because the ML over the single table often has significant computational redundancy. In fact, in one real-world instance at a major Web company that I spoke to, such joins before ML by an analytics team brought down the cluster they shared with other teams because they shot through their allocated storage! Such runtime and storage bottlenecks caused by this problem of "learning after joins" will likely get exacerbated as more analytics workloads move to the cloud, where every bit of extra time or storage purchased requires more $$$. Of course, this problem seems obvious in hindsight but I did not realize its severity until after speaking to practitioners.

Enter Factorized Machine Learning

We highlighted the above problem in our first paper, "Learning Generalized Models over Normalized Data," which appeared at SIGMOD'15 (PDF). The crux of our solution is the "factorized learning" technique, whose core insight is stunningly simple: decompose the ML computations and push them through joins to the base tables! Essentially, this applies the classical RDBMS idea of pushing selections/projections/aggregates through joins to ML. Of course, ML is not one algorithm and there are far too many to cover in a single piece of work. So, for tractability sake, we picked a simple but popular set of ML algorithms to study the feasibility and efficiency of our idea: generalized linear models (GLMs). These models, which include linear regression and logistic regression, essentially construct a hyperplane (the co-efficient vector w) in the d-dimensional feature space for regression or classification. An optimal hyperplane is defined by a loss function, denoted F, over the training data. GLMs are typically trained using gradient descent methods. For this initial work, we focused on the simple batch gradient descent (BGD) method, which is similar in its data access pattern to conjugate gradient and L-BFGS (I will discuss stochastic gradient and coordinate methods later).

Learning a GLM with BGD requires computing the gradient, ∇ F, of the GLM's loss function iteratively. At each iteration, fixing the co-efficient vector w, this is simply a vectorial aggregation over the training examples in which each term multiplies the feature vector x with a GLM-specific scalar function (denoted g) of the inner product of w with x: wTx. Now, the feature vector is actually a concatenation of feature sub-vectors from the base tables created by the join. Denoting the base tables as S and R, we can view x as [xS, xR]. The decomposition now becomes clear: wTx = wTSxR + wTSxR. But there is more: the gradient computation requires post-multiplying the entire feature vector with the scalar output of g. This means the push-down is more complicated than for SQL aggregates like SUM or AVG. Given these observations, the factorized learning technique virtually constructs itself, with the following figure illustrating the relational-style "query" plan/dataflow:

  1. One sequential pass over R to compute the partial inner products over R's features; store these in an in-memory associative array HS indexed by the ID of the R tuples (the join attribute)
  2. One sequential pass over S to compute the partial inner products over the features in S, while simultaneously looking up into HS to retrieve and add the pre-computed partial inner products, apply g over the full inner product, and update an entry in another in-memory associative array HR indexed by the foreign key (the join attribute) to store the GROUP BY SUM of g; also, multiply g with xS and aggregate these to obtain the portion of ∇ F over S's features.
  3. A second sequential pass over R to multiply the scalars in HR with xR and aggregate these to obtain the portion of ∇ F. over R's features.

The gradient value is technically unaffected, which means ML accuracy is not affected. But equally importantly, the above formulation lets us leverage decades of work on scalable relational dataflows, including RDBMSs, Hive, and Spark. Almost all of these systems let us define a user-defined aggregate (UDA) on top of a table. We thus implement each of the three steps as a UDA to realize factorized learning without affecting the scalability of the ML algorithm. Our prototype, named Orion, was built on top of PostgreSQL and Hive (subsequently extended to Spark and in-memory R). Orion is open-sourced on GitHub. Empirically, we found that factorized learning provides speed-ups over the regular materialized execution that are commensurate with the amount of redundancy introduced by the join.

Of course, several practical questions remain: Why not do a lazy join with an index (short response: it does not avoid computational redundancy but it creates a new performance trade-off)? What if HR/HS do not fit in memory (short response: we propose alternatives to scale with new trade-offs)? What if more than two tables are joined (short response: an optimal extension to star schema joins is NP-Hard and we introduce an effective heuristic)? What if the data is distributed (short response: factorized learning is easy to distribute)? All the technical details are in our SIGMOD'15 paper.

Extensions as a Means to Learn Advising

Apart from machine learning, this research was also a human learning experience for me! :) Orion provided the perfect setup to study extensions and generalizations with junior students both to teach them how to do research and for me to learn the ropes of being an advisor from my advisors. I realize I was fortunate in this regard because not many PhD students get such an opportunity. Anyway, with many brilliant undergrad and Masters students at Wisconsin, we extended factorized learning to a bunch of other classes of ML models:
  • Boqun Yan and Mona Jalal helped build an R-based toolkit named Santoku that extended the idea to probabilistic classifiers like Naive Bayes and decision trees, which ultimately rely on a collection of SQL-style aggregates to estimate probabilities. Mona and I demonstrated Santoku at VLDB'15 (PDF and code on GitHub).
  • Boqun Yan further worked on "factorizing" GLMs when using SGD. It turns out that there is no computational redundancy caused by joins, since SGD updates w after each example! The only savings are in IO/data movement reductions, which at scale could introduce a runtime-accuracy trade-off due to data re-ordering (a form of biased sampling). Mini-batch SGD bridges the gap between SGD and BGD.
  • Zhiwei Fan worked on factorizing GLMs when using stochastic coordinate descent (SCD). Unlike SGD, SCD does have computational redundancy and factorized learning helps avoid it. However, it turns out the overheads caused by individual co-ordinate update dominate the savings in many cases. Once again, block CD bridges the gap between SCD and BGD.
  • Fengan Li worked on extending factorized learning to clustering algorithms such as K-Means, DBSCAN, and HAC. Each presented its own new trade-offs due to differences in their data access patterns.
  • Microsoft, who generously supported my dissertation work at the Gray Lab in Madison, helped me explore collaborations with many Microsoft teams that also faced the learning after joins problem. In particular, Robert McCann from their Web security team (who later became a co-author) gave me the opportunity to prototype our idea on the Cosmos platform and play with some large real datasets.
  • In the background, I was also mulling over a "generic" way to extend factorized learning to other ML algorithms rather than take a piecemeal approach. My internship at IBM Almaden's SystemML team in 2011 showed that linear algebra provides a common substrate to express many ML algorithms and scaling linear algebra automatically scales all these ML algorithms. But I was not sure if there would be enough interest to invest more time into this idea.

Tales from SIGMOD and VLDB

Our paper had its share of reviewer troubles (don't we all?). It first got rejected at VLDB with the key complaints being that "no RDBMS vendor will implement this" (seriously!) and that it was "too straightforward" (if only I had a nickel ...). However, I trusted my advisors' sagely advice that a problem-setting paper like this will face such troubles (apparently, Jim Gray's now-famous data cube paper faced similar troubles!). True to their advice, after a rewrite of just the introduction, our paper got accepted to SIGMOD, with the reviewers now commending the paper as "elegant" and "exactly the kind of paper database folks should be writing about ML"! :) Ironically (or presciently?), our paper was scheduled for the "Advanced Query Optimization" session at SIGMOD. Apparently, ML has finally become a part of the bread-and-butter of our community!

At SIGMOD in Australia, the work was well-received and generated some excitement. In particular, Molham Aref, the head of the innovative database/analytics start-up LogicBlox, really liked our idea and explained how some of his company's retail customers faced this problem often. Molham later put me in touch with the rest of his top-notch team at LogicBlox and was incredibly gracious in advertising my work there. Interacting with them validated the utility of this line of work even more. I connected them to one my students, Zhiwei, who did an internship at LogicBlox and worked on some nice problems, including applications of factorized learning, with real customer datasets. Another development was that the IBM SystemML folks also seemed interested in this work. Matthias Boehm asked me if we can extend this factorize learning idea to ... wait for it ... linear algebra! :) The soft-spoken but incredibly deep Dan Suciu also asked if we have formalized the scope of the "language" that our factorized ML idea applies to. All of this feedback encouraged me to push forward with the idea of integration with linear algebra (more below).

At VLDB in Hawaii, I got to meet the super-nice Dan Olteanu, the creator of "factorized databases," which is the namesake for our technique. Interestingly, Dan was aware of our work and stayed in touch with me on this topic. Eventually, his group came out with an interesting new algorithm for linear regression over factorized databases, which generalizes the factorized learning idea to a wide class of join queries (their SIGMOD'16 paper). As an interesting co-incidence (or was it?), their paper was presented in the same session as my Hamlet paper! Another interesting co-incidence: Dan Olteanu had started working part time for LogicBlox and spread the word about my the Hamlet paper! Small world? :) The indefatigable Peter Bailis also attended our demo and suggested that I start blogging about my cool new research ideas. Of course, since I enjoy writing, I did not think twice about starting this blog although I do wish I blogged more often!

Morpheus: Generalizing Factorized Learning using Linear Algebra

Pre-occupied with teaching Wisconsin's undergrad database course and preparing for my job search, I was on the look out for another student to help generalize factorized ML using linear algebra (LA) as a formal representation language for ML. The intuition for this project is simple: many data scientists use LA and LA-based frameworks such as R and Matlab to use ML algorithms and write new ones. The data management community has embraced LA and R as key environments for ML workloads, with tools such as IBM's SystemML, Oracle R Enterprise, and SparkR providing physical data independence for LA scripts: the "data frame" or matrix can be a larger-than-memory HDFS file or an in-RDBMS table. Thus, such tools provide a mechanism to let data scientists focus on their LA scripts rather than worry about scalability or distribution. Alas, when given normalized data, even these LA tools expect a single denormalized "data matrix," leading to the learning after joins problem again.

Lingjiao Chen, whom I met at guest lecture I gave about Orion in Jeff's graduate database class, seemed the prefect fit for this new ambitious goal of solving the above problem: factorize linear algebra! Our insight is simple: since ML algorithms are just scripts in LA that use individual basic and derived LA operators, if we factorize these LA operators, we can automatically factorize many ML algorithms in one go. This could dramatically reduce the development overhead inherent in manually extending factorized learning to different ML algorithms. We set about creating a new framework and system that factorized linear algebra; we call our framework Morpheus and we recently released a pre-print of the Morpheus paper on arXiv. The following figure illustrates the relationship between Morpheus and prior work on factorized ML, including my own work on Orion.

Our approach presented three main technical challenges: generality, closure, and efficiency, which I discuss next. Unlike relational algebra, LA has far too many popular operators. We solve this issue by identifying a subset of LA that is large enough to let us cover a wide variety of ML algorithms, but small enough to be tractable. Also, unlike Orion, we wanted to handle both PK-FK joins and more general "M:N" joins. We do so by introducing a new logical multi-matrix data type to LA: the normalized matrix, which helps represent normalized data and encodes the join relationship using an indicator matrix. Essentially, we bring the classical database notion of logical data independence to LA sytems!

Closure means that we want to rewrite an LA script only into a different LA script; this ensures that we are not forced to modify the internals of the LA system used, which could make practical adoption easier. We do so by devising an extensive framework of algebraic rewrite rules to transform LA operators over a given denormalized matrix into a set of LA operators over the normalized matrix. Herein lies the meat of this work: some operators are trivial to rewrite and are reminescent of pushing relational aggregates through joins but some others present novel rewrite opportunities with no known counterparts in relational optimization. The paper goes into these technical details. Finally, we also identify new efficiency trade-offs in the rewrites and present a heuristic decision rule to predict if the rewrites will not improve runtime performance.

Overall, Morpheus can automatically factorize numerous ML algorithms. We illustrate several popular ones in the paper: logistic regression with gradient descent (as done in our Orion paper), linear regression (three algorithmic variants of this, including Dan Oltenau's SIGMOD'16 algorithm), K-Means clustering, and Gaussian non-negative matrix factorization for feature extraction. On the real-world normalized datasets from the Hamlet paper, we found Morpheus to yield speed-ups of even over an order of magnitude. On synthetic data for M:N joins, we see over two orders of magnitude speed-ups. All of this without forcing data scientists to learn factorized ML!

We have prototyped Morpheus as an easy-to-use library in R as well as on Oracle R Enterprise but the framework itself is abstract and applicable to any LA system. We plan to open-source Morpheus on R soon. Distributed extensions on top of SystemML and TensorFlow are also on their way.

Concluding Remarks

This whole story of the factorized learning idea, its follow-ons and extensions, and ultimately, its generalization is instructive (at least to me) of what is often a common theme in research: ideas rarely form and grow in isolation. Speaking and iterating with practitioners helps obtain new perspectives, evaluate the efficacy of research ideas in practice, obtain new insights, and identify new problems to work on for the future. And it is often satisfying to go beyond just a research or vision paper; building real prototype systems, open sourcing them, and demonstrating them is a valuable part of the process. And finally, research, at least in academia, is not just about ideas, it is also about people: the students, the advisors, the student-advisors (I was one!), the collaborators, the reviewers (ugh!), the practitioners you talk to, the people you meet at conferences, and everybody else you brainstorm with or obtain feedback from.

From a technical perspective, this line of work shows the continued importance of data management-inspired ideas in the context of ML no matter where the ML computations are done. Far from being the "plumbers" that just build faster ML tools, I think the data management community could drive an ever-closer union of these two fields to open up new problems that neither community addresses on its own but solving which could substantially benefit numerous data-driven applications and maybe even create new ones!

Bringing this post to a close, I summarize that simple insight that drove this whole line of work on factorized machine learning, or more generally, on learning "over joins" rather than "after joins" in one succinct stanza:

A sum of sums is the same sum,
Factorizing avoids repeating some.
So, learn after joins no more,
Do learn over joins herefore!

I invite you to read our papers on this topic (Orion, Santoku, and Morpheus) and/or check out the code (Orion and Santoku; Morpheus coming soon). Questions, suggestions, and comments -- critical or otherwise -- are welcome!

Saturday, June 25, 2016

Research Paper: To Join or Not to Join?

There was once a paper,
Which got badly rejected at PODS.
Boldly, it went forth to SIGMOD,
Against it were overwhelming odds.
Branded "extremely controversial,"
Yet "profound," it waited a long while.
At last, deliverance was in sight,
Though many a reviewer it did rile!

This is a post about my SIGMOD 2016 paper titled To Join or Not to Join? Thinking Twice about Joins before Feature Selection. It is co-authored with my advisors, Jeff Naughton and Jignesh Patel, and our ML collaborator, Jerry Zhu.


Machine learning (ML) is all the rage in CS right now. The "traditional" view of training data in ML is that it is a single table with all the features about the entity to model. However, in most real-world applications, datasets are often multi-table and connected by primary key-foreign key (KFK) relationships. Thus, data scientists often join the base tables to gather as many features as they can for ML. Here is a concrete example (from the paper):

Example: Consider a data scientist at an insurance company modeling their customers using ML to prevent customer churn. They have a table with customer features such as their age, gender, income, employer, etc. One of those features, the employer's ID is a foreign key (FK) that refers to a separate table with information about companies and other organization. Given this situation, the data scientist instinctively joins these two tables to create a single table in which the feature vectors from the customer and employer tables are concatenated. They then use it to train their favorite ML model, e.g., Naive Bayes, logistic regression, neural networks, etc.

Joins before ML do not just arise in insurance; they are ubiquitous and arise in practically every data-driven application. Recommendation systems join tables with ratings, users, and movies. Web security systems join tables about login events, user accounts, and webpages. The list goes on and on. In all these cases, feature selection is typically performed implicitly or explicitly along with ML to help improve accuracy. For example, explicit greedy subset search techniques such as backward or forward selection are popular for Naive Bayes, while implicit L1 or L2 regularization are popular for logistic regression and neural networks.

In this paper, we ask a simple but radical question: Do we really "need" these joins before ML? The motivation for this question is a simple information theoretic result we prove in the paper: the features brought in by such KFK joins are redundant. In the ML literature, many feature selection methods explicitly search for redundant features and remove them. This motivates us to short-circuit the feature selection process using database schema information: simply avoid KFK joins before ML! Clearly, this could help improve runtime performance because several features are avoided before feature selection without even looking at the data. Avoiding the join processing cost also saves some time but this component is usually minor. Empirically, we saw speed-ups of up to two orders of magnitude (186x) on real-world datasets.

But the million dollar question for avoiding joins is: Will avoiding a KFK join reduce ML accuracy? This paper is an in-depth exploration of this rather surprising and counter-intuitive question that combines a statistical learning theory-based analysis with a database-style practical approach to avoiding joins in the real world.


Our first technical results show that information theory may not be perspicacious enough to explain whether avoiding joins would reduce accuracy or not. Thus, we resort to applying statistical learning theory, which explains that the prediction (test) error of an ML classifier can be decomposed into three components: bias, variance, and noise. Informally speaking, the bias quantifies how complex an ML model's hypothesis space is -- more complex models have lower bias. The variance quantifies the instability of the training process -- more complex models have higher variance (conversely, fixing the model, having more training examples lowers the variance). At the heart of ML is a trade-off between bias and variance, which is often illustrated as follows: as the model complexity increases, the bias decreases but the variance increases, which leads to higher test errors than train errors -- a situation known colloquially as "overfitting."

Thus, our question now becomes: What is the effect of avoiding a KFK join on bias and variance? The paper goes into the technical details and arrives at the following answer: Avoiding a join will not increase the bias but it might increase the variance. The crux of why this happens is a simple two-fold reason, which I explain briefly next. First, the foreign key (the employer's ID) encodes all information about the employer because all employer features are merely functions of the employer's ID. In database parlance, this is also called a functional dependency. Thus, any ML classifier, which is simply a function of the feature vector, can be recast as another function that only uses the employer's ID and avoids other employer features. Essentially, a function of a function is just another function! Thus, the hypothesis space of the model does not shrink when avoiding the join, which means the bias will not increase. In contrast, it might turn out that some specific employer feature/features is/are sufficient to represent the true concept. But usually, the foreign key has a much larger domain than all features it refers to (for example, there are only 50 states, but there are millions of employers). Thus, the hypothesis will be far larger if the employer ID is used a representative of all employer features, which get avoided. Since the number of training examples is still the same whether or not we avoid the join, this implies that the variance might shoot up if we avoid the join.

How do we use the above theoretical analysis to help data scientists in practice? One practical way, which we take, is to bound the potential increase in variance and enable data scientists to apply a threshold on it; essentially, this creates a simple decision rule. We call this process avoiding the join "safely." Basically, the data scientist tolerates a certain risk of slightly higher test error in return for potentially much faster runtime performance. The paper goes into the details of how we adapt classical VC dimension-based bounds on variance from the learning theory literature for our purpose.

After a lot of analysis and conservative simplifications, we arrive at a stunningly simple and easy-to-check criterion in the end: the Tuple Ratio (TR) rule. Essentially, we only need to compare the ratio of the number of tuples in the tables that are joined: in our example, this is the ratio of the number of customers to the number of employers. If this ratio is above a certain threshold, then it is "safe" to avoid the join. A key distinguishing aspect of our approach is that our TR threshold is tied only to the VC dimension of an ML model and is independent of the dataset instance (unlike traditional ML hyper-paramaters, which require tuning per instance). We tuned our rule using a simulation study and found that for an error tolerance of 0.001, a reasonable value for the TR threshold is 20.

An extensive empirical validation using 7 real-world datasets (14 joins between them) shows that the TR rule with a threshold of 20 works out of the box for both Naive Bayes and logistic regression combined with several different feature selection methods! Overall, it correctly avoids 7 out of 11 joins that are safe to avoid and correctly does not avoid 3 out of 3 joins that are not safe to avoid. The remaining 4 cases are missed opportunities to avoid joins, which happens because our decision rule is conservative.

Concluding Remarks

Our paper opens up a new connection between learning theory and joins, which are fundamental relational operations that arise during the process of feature engineering for ML over multi-table datasets. Much work remains to be done, however, and this paper is only the beginning. Our ideas assume that the ML model has a finite VC dimension (specifically, we assume a linear VC dimension). It is not clear what happens if we use models with higher (potentially infinite) VC dimensions, e.g., neural networks, RBF-kernel SVMs, or decision trees. We also assume discrete feature spaces. It is not clear what happens with numeric features spaces. Several such interesting questions remain unanswered and answering them requires new research that truly lies at the intersection of data management and machine learning, both of which are critical for today's data-driven world.

Interestingly, the SIGMOD reviewers found our work "extremely controversial," as the opening poem alludes to. In fact, we got six reviewers instead of the usual three! In the first round, two gave us "champion" accepts, while the rest, including an external ML expert, gave us a reject. After the revision round, however, all six accepted our paper. The key bone of controversy was whether foreign keys can indeed be used as features for ML. The short answer is: it depends but in most applications, foreign keys are routinely used as features for ML. The paper explains this issue in more detail.

To Join or Not to Join,
That is the question.
Tuple Ratio, we did coin,
Use at your discretion!

I invite you to read our research paper. Questions, suggestions, and comments -- critical or otherwise -- are welcome! Controversy is even more welcome! :)


Project webpage.
Research paper at SIGMOD 2016.
Technical report.
Code on GitHub.
Real datasets.

Sunday, December 6, 2015

Vision Paper: Model Selection Management Systems

This is a post about my vision paper that appears in ACM SIGMOD Record, Dec 2015. It is titled Model Selection Management Systems: The Next Frontier of Advanced Analytics. It is co-authored with my advisors, Jeff Naughton and Jignesh Patel, and our collaborator from Microsoft, Robert McCann, who is an applied researcher and practitioner of ML.


The field of advanced analytics -- the intersection of data management and machine learning (ML) -- is undoubtedly booming. It seems that not a month goes by without news of a new startup in this already crowded space: Alation, Dato, Palantir, and Scaled Inference, only to name a few. Established companies such as IBM, Microsoft, Oracle, Pivotal, etc. are also investing heavily in building ML-related infrastructure software to improve support for moneyed enterprise users, while Baidu, Google, Facebook, and other Web companies are also building vast internal ML infrastructure. However, after a spurt of research in the early 2000s on scalable in-RDBMS data mining and ML, academic and research efforts in this space by the data management community seem to have stagnated and fragmented. As data management researchers, do we have anything new and non-trivial to contribute to this booming field other than just to build faster or more scalable toolkits that implement individual ML algorithms? Or is all the interesting work in this space under the sole purview of engineers in industry?


In our vision paper, we make the case for a new, cohesive, and potentially impactful direction of research in advanced analytics for the data management community. We start with the observation that building ML modes in practice is seldom a one-shot "slam dunk". Based on numerous conversations with analysts at both enterprise and Web companies, we found that using ML is often an iterative exploratory process that involves three key challenging practical tasks -- feature engineering (FE), algorithm selection (AS), and parameter tuning (PT), collectively called model selection. Surprisingly, these three tasks have largely been overlooked by the data management community even though these are often the most time-consuming tasks for ML users. Consequently, there is little end-to-end systems support for this iterative process of model selection, which causes pain to analysts and wastes system resources. In our paper, we envision a unifying abstract framework for building systems that can support the end-to-end process of model selection and identify the research challenges posed by such systems.


To make the process of model selection easier and faster, we envision a unifying abstract framework based on the idea of a Model Selection Triple (MST) -- a combination of choices for FE, AS, and PT. While a large body of work in ML focuses on various theoretical aspects of model selection, in practice, analysts typically use an iterative exploratory process that combines the ML techniques and their domain-specific expertise. Nevertheless this iterative process has structure. We divide it into three phases -- Steering, Execution, and Consumption. In the Steering phase, they specify the precise MST they want to explore. The Execution phase runs the MST and obtains the results. In the Consumption phase, they post-process the results to steer the next iteration. They then modify their MST and iterate. Overall, the model selection process is iterative and exploratory because the space of MSTs is usually infinite and analysts have no way of telling a priori which MST will yield "satisfactory" accuracy and/or insights. Alas, most existing ML systems force analysts to explore only one MST per iteration, which overburdens the analysts and wastes system time.

We envision a new class of analytics systems we call Model Selection Management Systems (MSMS) that enables analysts to explore a logically related set of MSTs per iteration. The abstraction of sets of MSTs acts as a "narrow" waist that helps us decouple the higher layers (how to specify them) from the lower layers (how to implement and optimize them) in order to make it easier to build an MSMS. Moreover, we explain how some existing research systems such as Columbus, MLBase, etc., can be subsumed by our unified framework. However, realizing our vision immediately poses the challenge of how to enable analysts to easily specify sets of MSTs and how to handle them efficiently. We discuss how repurposing three key ideas from database research -- declarativity, optimization, and provenance -- could help us improve the respective phase of an iteration. This can help reduce both the number of iterations and the time per iteration of the model selection process, thus making it easier and faster. However, as we identify in the paper, applying these three ideas to the model selection process raises several non-obvious research challenges:

  • What should the declarative interfaces to enable analysts to intuitively specify a set of MSTs look like? We discuss the tradeoffs in the design decisions involved.
  • What are the new system optimization opportunities that can exploit the set-oriented nature of specifying MSTs to reduce the runtime of an iteration? We discuss a slew of new optimization ideas that combine data management and ML techniques.
  • How to perform provenance management for ML to enable analysts to consume results easily and improve the iterations? We explain the need for a notion of "ML provenance" and discuss its applications.
Solving the above challenges requires new research in data processing algorithms, systems, and theory at the intersection of data management, ML, and HCI. In our recent research, we have already started tackling some of the technical questions. I hope the data management community joins us in our quest to make the end-to-end process of using machine learning for data-powered applications easier and faster!

I invite you to read our 6-page vision paper. Questions, suggestions, and comments -- critical or otherwise -- are welcome!


Vision paper.
Project webpage.
Associated survey to explain the gaps in the existing landscape of ML systems that motivated our vision.

Thursday, August 13, 2015

A poem on the Relational Data Model

We start off with a poem on the Relational Data Model, which truly transformed computing and our modern civilization!

The critical core of data applications,
Due to you, pointers we no longer chase.
Letting us get data with just declarations,
Incomparable you are in your grace.

Infusing a really revolutionary rigor,
Into a field that was fraught with pain.
Birthing a community filled with vigor,
And engaging our minds again and again.

Schools and banks, airlines and apps,
Almost all services benefit from you.
Yea, sans you, our economy might collapse,
'Tis indeed true and not just ballyhoo!

Rejecting you, some chose to walk away,
Hoping against hope you are not needed.
Repeating history and suffering dismay,
They crawled back to you and conceded.

Of all models, you, we did select,
To join forces with you was sensible.
For no matter what future we project,
You seem, in aggregate, indispensable.

Hello World!

So, I am finally joining the bandwagon of maintaining a research blog! Since my research interests revolve around data, this blog will carry articles and other posts that have something to do with data - how we store, query, analyze, understand, and more generally, use (and abuse!) data. From a "traditional" perspective, this would span the topics of data management (including databases), statistical machine learning, optimization, systems, and HCI. For those interested in tech blogs about data, there are many other excellent blogs as well (for example, this, this, and this). My posts are probably going to be a bit more eclectic. I expect to have five major kinds of posts (in no particular order):

1. Articles about my research in a language that is simpler/less technical than my papers.

2. Articles about other people's research that I find interesting/important.

3. Articles about general data applications and software that I find interesting/important.

4. Rants and raves about research events or my general experience as a researcher.

5. Poems about research in this field! :)

Better late than never to blog about data in this new golden age of data management and analysis! Feedback and discussions on my posts, online and offline, are always welcome. Thanks!