The Data Dossier: The Culture Wars of Data Management

Scientists are people. Despite their fervent protestations of objectivity, all scientists are prone to conflating subjective experience with objective truth, at least once in a while. Einstein himself infamously dissed on quantum mechanics initially but then contributed to its growth. The "whole truth" is an elusive and enigmatic beast. Consequently, people (including scientists) often fight "tribal culture wars" under the illusion that they (and those who agree with them) are right and all others are wrong. Perhaps nothing captures this issue more eloquently than the timeless parable of the blind men and the elephant. This issue arises over and over again in all fields of human endeavor, including the sciences. In this post, I examine how this issue is affecting the "data management" research community. My motivation is to offer a contemplative analysis of the state of affairs, especially for the benefit of students and new researchers, not prescribe definitive solutions.

This post is partly inspired by Pedro Domingo's article, "The Five Tribes of Machine Learning." He explains why there are deep "tribal" divisions within the machine learning (ML) community in terms of both the topics they pursue and their research culture. For instance, explosive battles often take place between the "Bayesians" (think statistics) and the "Connectionists" (think deep learning), two of ML's oldest tribes.

The data management field too has similar divisions, albeit not as strongly partitioned by tribes. It all came to a head last year when some prominent members of this community issued an ultimatum to SIGMOD and VLDB, the top data management research conferences. Criticizing the repeated unfair treatment of systems-oriented papers by reviewers (e.g., wrongly deeming a lack of theoretical analysis as low "technical novelty" or "technical depth," or crass dismissive statements such as "just engineering"), they threatened to fork off and start a new systems-oriented data management conference. Such brinkmanship is neither new nor unique to this field. In fact, break ups often occur on account of "unfairness," e.g., IMC splitting from SIGCOMM, ICLR vs CVPR, Mobisys vs MobiCom, the list is long indeed! From my own experience, I agree with the core claim of the ultimatum. Apparently, the SIGMOD/VLDB leadership also agreed. The PC chairs of VLDB then emailed the whole PC (which I was on) with guidelines on how to fairly evaluate systems papers. Problem solved? Haha, no.

Why write this post? Who is it for?

It may appear puerile to put labels on entire "research cultures." But such cultural gaps are ubiquitous, even in physics. Acknowledging and understanding such gaps is the first step to either mitigating them or making peace with them. Without such understanding, I reckon the data management community might be on a death spiral to more fragmentation (or maybe that is inevitable due to other factors such as size, who knows?). I see such labels as a way of acknowledging the different cultures, tolerating them, and even celebrating the intellectual diversity. More importantly, it is crucial for students and new researchers to be aware of such gaps and why they exist. This could help them make more sense of paper rejections or even negative in-person interactions with other researchers. From my experience, many established researchers transcend the cultural gaps and often bridge multiple cultures. Even among those that stay within one culture, many are not antagonistic to the others. I just wish everyone would be more like these folks. I hope this post nudges more people, especially newcomers, towards that ideal world.

The Four Canonical Cultures (as I see it)

Based on my interactions with dozens of researchers, reading hundreds of papers, and reviewing for SIGMOD/VLDB for the last two years, I see at least 4 "canonical" cultures (to overload a classic DB term!). Unlike Pedro's tribes of ML, which are partitioned by areas (e.g., Connectionists study neural networks), it is misleading to delineate the divisions within the data management community by areas/topics because many areas, e.g., query optimization, have all four cultures (or at least more than one) represented. Thus, my split is based on the inherent expectations of each research culture, their methodologies, the non-data management fields they get inspiration from (including technical vocabulary and "hammers"), and the style and content of their papers. Such differences are more insidious than areas, which is perhaps why the "culture wars" of data management are more damaging than the tribals wars of ML. These cultures are not mutually exclusive; in fact, many researchers have successfully hybridized these cultures and all 2^4 possible combinations are represented at SIGMOD/VLDB to varying extents. Anyway, here are the 4 canonical cultures I see:

Formalist
Systemist
Heuristicist
Humanist

The Formalist Culture

The Formalists bring a theory-oriented view to data management. Their culture has been influential since the founding years of the data management field, going back to Ted Codd himself. Formalists draw inspiration from theoretical fields such as logic, math, statistics, and theoretical CS, especially combinatorial optimization, formal methods/PL theory, and ML theory. They seek a high degree of formal rigor in the characterization and communication of ideas, especially in papers. Common elements in Formalist papers are non-trivial theorems and proofs, typically of the "hardness" of problems (couched in the language of computational complexity theory), approximation or randomized algorithms, and formal analyses of the complexity and quality of the algorithms. Many pride themselves in the theoretical sophistication of their results.

Some Formalists also publish in non-data management venues such as SODA, LICS, and NIPS, but curiously, not so much in STOC/FOCS. Database theory (think PODS/ICDT) is a major part of this culture, but not all Formalists are database theoreticians, i.e., some publish regularly at SIGMOD/VLDB too. Some Formalists have a reputation for expecting a high degree of rigor in all data management papers.

The Systemist Culture

The Systemists bring a systems-oriented view to data management. Their culture has also been influential since the founding years. Many pioneers of this field, including Charlie Bachman, Michael Stonebraker, Jim Gray, and David DeWitt can be considered as Systemists. They draw inspiration from the computer systems fields broadly writ (operating/distributed systems, compilers and software engineering, networking, computer architecture, etc.). Interestingly, their ideas have reshaped some of those other fields, e.g., the concept of ACID and distributed systems. Most Systemists seek a high degree of real-world practicality in ideas and care much less (compared to Formalists) for rigor in data management papers. Some even dislike seeing theorems. Common elements in Systemist papers are system architecture diagrams, discussions of system design decisions, and analyses of system trade-offs, typically with extensive experiments. They typically use system runtime or throughput metrics on synthetic workloads and datasets (e.g., TPC benchmarks). Scalability is often a key concern. Many pride themselves in the practical adoption of their systems, including via startups they found.

Some Systemists and Formalists have a reputation for waging "wars" against each other's culture or even other fields. The war between Bachman and Codd on the (de)merits of the relational data model is one example, as is the war between Stonebraker and Ullman on whether database theory research is needed at all. More recently, some Systemists have waged wars against the distributed systems field, e.g., this now-infamous blog post. Curiously, most Systemists tend not to publish in non-data management computer systems venues such as NSDI/OSDI/SOSP/ASPLOS/ISCA/etc.

The Heuristicist Culture

The Heuristicists draw inspiration from the fields of artificial intelligence (ML, natural language processing, etc.), but also logic, math, statistics, and theoretical CS. This culture started growing mostly from the 1990s. Popular topics where this culture is well-represented are data mining, Web/text data analysis, data cleaning, and data integration. Heuristicists design practical heuristic algorithms, typically without a high degree of rigor in the exploration of ideas (compared to Formalists) and without deep systems-oriented trade-off analyses (compared to Systemists). But there is substantial diversity in this culture. Many papers use rigor in communication and thus, seem closer to the Formalists. Others focus on complex algorithmic architectures and thus, seem closer to the Systemists. Many researchers bridge this culture with the Formalists, especially on the topic of data integration. But many problems in such topics are so "messy" that theoretical work alone goes only so far. So, many researchers also bridge this culture with the Systemists. Common elements in Heuristicist papers are math notation but no non-trivial theorems, large algorithm boxes or diagrams, and extensive experiments, typically with real-world datasets and workloads. They typically focus on quality metrics such as accuracy, precision, recall, AuROC, etc. Runtime or scalability metrics are less common.

Data cleaning and data integration have become central themes in data management research. But a large chunk of data mining researchers broke up with the SIGMOD/VLDB community and joined the newly created SIGKDD community along with applied AI/ML researchers. However, many data mining researchers still publish routinely at SIGMOD/VLDB along with SIGKDD/ICDM/etc. Similarly, many Web/text data researchers also publish at both SIGMOD/VLDB and WWW/AAAI/etc.

The Humanist Culture

The Humanist culture has also been around since the early years (e.g., query-by-example), but this culture too started growing mostly only from the 1990s. This culture draws inspiration from the fields of human computer interaction, programming languages, and cognitive science, but also parts of theoretical CS and computer systems. This culture puts humans that work with data at the center of the research world. Topics in which this culture is well-represented are new abstract programming/computation models for managing/processing data, new query interfaces, interactive data exploration/analysis tools, and data visualization. Common elements in Humanist papers are terms such as usability and productivity, user studies, and even links to demo videos of the tools they build. Remarkably, many Humanist papers often overlap with one or more of the other cultures, which perhaps makes this culture the least dogmatic and most eclectic. Humanist papers often have multiple metrics, including quality, system runtime, and human effort (measured with "lines of code", human interaction time, interviews, etc.). Many Humanists also publish at CHI/UIST.

This culture seems to be growing, primarily by attracting people from the other cultures. Working on human-centered problems has a long history in this field, going back to Codd himself. Many researchers often seem to forget the fact that the relational model itself was created not to solve a "hard" theoretical problem, improve a system's performance, or design a heuristic algorithm, but rather to improve the productivity of humans that queried structured data.

Culture Wars and Extremism

By "wars," I mean the endless intellectual tussle for suzerainty over the research field. The most common way such wars cause damage is unfair negative evaluations of research papers because one is unable to see the merits of a paper from a different culture or a cultural hybrid. Many "extremist" Formalists and Heuristicists (and Formalist+Heuristicists) often dismiss many Systemist papers as "just engineering." Many extremist Formalists also dismiss many Heuristicist papers as "too ad hoc" or "too flaky." Many extremist Systemists dismiss many Formalist papers as "not practical" or "too mathy." Many extremist Formalists and Systemists (and Formalist+Systemists) dismiss many Humanist papers as "fluff" and "soft science." Many extremist Formalists, Heuristicists, and Systemists (and many cultural hybrids) also often dismiss many Systemist or Heuristicist papers that explore new important problems and propose initial solutions as "too straightforward" or "not novel" by conflating simplicity, an oft-exalted virtue in the real world, with a lack of novelty. Overall, one often ends up wrongly judging a fish by its ability to climb a tree. Sadly, such wars sometimes force researchers to add needlessly contrived content to papers just to appease such extremists.

Such tribal culture wars and extremism, whether deliberate or not, detract from fair and honest critiques that actually help advance the science. Personally, I find such wars ridiculous. So, please allow me to amuse myself (and hopefully, you) by ridiculing the extremist bigots of each culture that ridiculously diss other cultures and glorify only their own using the provocative meme of "X as seen by Y" (see this one about programmer wars first, if you do not know this meme). Since a 16x16 matrix is too big for me to construct, I restrict myself to the canonical 4x4. :) Hopefully, this will help students/researchers realize that they are not alone in getting caustic comments. I also hope this will cause people to think twice before engaging in such ridiculous wars themselves in the future. Behold, I present to you the culture wars of data management!

(Optional) Explanatory Caption (might be painfully obvious for some). In row-major order from the top-left. First row: Einstein (geniuses, of course), FSM (mushy false gods), Crashed truck (what a disaster!), and Big fat hacker/engineer. Second row: Lucius Malfoy (snooty pure-blood evil wizards destined to be defeated), Justice League (superheroes saving humans), Awkward nerd, and Zealots (close-minded bigots serving the devil). Third row: Kung Fu Panda 3 (googly-eyed admiration), Pretend-Superman, Star Trek (the future beckons), and Airplane pilots (so many moving parts!). Fourth row: Irrelevant contrived junk peddlers, Kids with fun toys (so cute!), Calvin-and-Hobbes's games (so naive!), and The One (the prophetic savior).

Making matters worse are "civil wars" within cultures, especially the Systemists. Some Systemists are so obsessed with relational DBMSs that they pooh pooh any new data systems. Such insularity has caused much grief in the last decade, especially due to the rise of "Big Data" systems (think MapReduce/Hadoop or Spark) and "NoSQL" systems (think BigTable) from the distributed systems field. Of course, most Systemists acknowledge their mistakes and change their minds over time, but not without causing serious damage to the field. There is also a mini civil war among the Formalists between the logic and discrete math-oriented sub-culture and statistics/ML theory-oriented sub-culture. With so many culture wars going on, I fear research on "data management for ML"/"ML systems," which is the area my own research focuses on, will be driven away from SIGMOD/VLDB. This area is increasingly considered important for the wider CS landscape and thus, for the data management community too. I will be raising these issues (among others) at a panel discussion at the DEEM Workshop at SIGMOD 2018. Rest assured, I will return to blog about how that goes.

Should the Four Cultures Stick Together or Break Up?

The data management/database/data systems community is not a "monoculture"; it never was and it never will be (almost surely). As a vertical slice of CS, it is "multicultural" and will always have high intellectual diversity. The four cultures may be irreconcilable, but they are complementary and can co-exist. The benefits of inter-cultural tolerance, cultural hybridization, and trans-cultural work are clear: cross-pollination of problems and ideas, trans-cultural partnerships and collaboration, infusion of ideas from each culture's favored non-data management fields, export of ideas from one culture via another to another CS field or non-CS disciplines, and so on. This sort of inter-cultural amity and partnership was/is the norm at the Database Group of the University of Wisconsin-Madison, where I went to graduate school, and the Database Lab of the University of California, San Diego, where I am on the faculty now, and many other database groups. There is a long tradition of research cultural hybridization and trans-cultural research that is practiced and even celebrated by many researchers, senior and junior alike.

Yet, to claim such culture wars do not exist is to be the proverbial ostrich that buries its head in the sand, while pretending we can all just sing kumbaya, pat ourselves on our backs, and get along with stiff upper lips is to be utterly naive. We have to acknowledge and accommodate the vast cultural gaps amicably lest the centrifugal forces caused by unfair research evaluation and tribal cultural wars spin out of control. This form of multiculturalism is a source of strength for SIGMOD/VLDB and sets them apart from related communities such as STOC, OSDI, ISCA, SIGKDD, and CHI. Of course, it will also defeat the point of inter-cultural amity if everyone is expected to work across cultures; it should be left to each researcher to pick a culture or cultural hybrid for their work based on their skills and taste.

Instead of trying to shame or bully any of the cultural groups into conformity to another culture, all groups should practice mutual respect. An analogy I can draw is with movie genres: one cannot coerce a person that only likes arthouse dramas to like big-budget blockbusters or vice versa. All 4 cultures and all cultural hybrids bring something different and valuable to the table of data management research. It will be nothing short of a catastrophe for SIGMOD/VLDB if the Systemists (or indeed, any other group) leave. While such a pyrrhic war has been averted for now, I fear it might turn in to a Cold War that poisons the field further. Unfortunately, the current peer review processes of SIGMOD/VLDB are broken and are amplifying the centrifugal forces. Thankfully, they are aware of this issue and working to fix them soon. I am reasonably confident they will course correct but I doubt it will happen quickly or easily.

All that said, I could be wrong and perhaps the cultures are better off breaking up. As such, attending a conference is no longer needed for following new research thanks to the Web. Cutting-edge "data management" work is no longer published at SIGMOD/VLDB only; NSDI, OSDI, SIGKDD, NIPS, and more share the pie. Inter-cultural research exchange can happen across communities too. There are precedents for both area-based and culture-based splits/spinoffs: PODS (for database theory), CIDR (for "visiony" data systems), SoCC (for cloud computing), and SysML (for ML systems). Who can be 100% sure it is "bad" to have more splits/spinoffs?

Even if SIGMOD/VLDB remain multicultural, should they become more "federated" instead of "unitary" in terms of the research tracks? It is clearly unfair to have a random SODA (or SIGKDD) reviewer evaluate a random OSDI (or CHI) paper and vice versa--and yet, this is roughly the kind of unfairness caused by the culture wars that continue unabated within the SIGMOD/VLDB community. Is it not a form of "emotional abuse" of students to subject them to such toxic subjective culture wars instead of a resolute focus on the objective (de)merits of ideas? Does the SIGMOD/VLDB community really want to cut such a sorry figure against NSDI/OSDI/SOSP or NIPS/ICML or CHI or other communities that compete for bright students? I think the SIGMOD/VLDB community should openly and honestly tackle such contentious questions without acting like this or this. But I do not know the best avenue for spurring such conversations: a short workshop, a townhall or panel discussion at SIGMOD/VLDB, or something else, who knows?

Concluding Remarks

Scientists are people. It is crucial to practice objectivity and apply solid quality filters for research evaluation. But it is also crucial to be aware of one's own blind spots. Unknown unknowns are also impossible to fathom for anyone, no matter how knowledgeable--none of the blind men can say anything about parts of the elephant they cannot reach. This is all the more true for scientists at the cutting edge of the ever-expanding universe of knowledge. Also crucial is respecting subjective differences on intellectual practice, since the leap from objective facts to judgmental interpretation is far too often a leap of faith guided by subjective experience--even if two of the blind men grasp the same tail fur, one might call it soft on his hand and the other, coarse. Perhaps I am being a naive idealist, but I think it is imperative that we all inject ourselves with large doses of scientific humility, curiosity, civility, empathy, and magnanimity. Watching this video regularly might help! :) Unless there is credible verifiable evidence to the contrary, every scientist must be willing to admit that they could be wrong. Every one of us is (intellectually speaking) "blind" in some way, and we will remain blind no matter how much we learn. But we can all be less blind by talking with others that have an earnest world-view that is different from ours, not talking down to them.

PS: I do not claim that these 4 cultures are the only ones in data management. I am sure as the field keeps growing, we might see the addition of new cultures or the reorganization of existing ones. Perhaps I will write another post on this topic 20 years from now! If you have any thoughts on this topic or this post, please do leave your comments below.

ACKs: Special thanks to Julian McAuley for being a sounding board for this post. I also thank the following people (in alphabetical order of surnames) for giving me feedback on this post, including suggestions that strengthened it: Peter Bailis, Spyros Blanas, Vijay Chidambaram, Joe Hellerstein, Paris Koutris, Sam Madden, Yannis Papakonstantinou, Jignesh Patel, Christopher Re, Victor Vianu, Eugene Wu, and a few others that did not want to be acknowledged. My ACKs do not necessarily mean they endorse any of the opinions expressed in this post, which are my own.

Reactions (Contributed Paragraphs)

The post elicited a wide range of reactions from data management researchers. Some researchers have kindly contributed a paragraph to register their reactions here, both with and without attribution. I expect to have more contributions in due course. If you do not want to comment publicly, feel free to email me your reaction so that I can add it here as an anonymous contribution. I hope all this contributes to stirring more conversations on this important topic.

Jignesh Patel:
Interesting! Note Systemists don't abhor theory, they abhor theory for the sake of theory. Classic systems paper have some formalism, but just what is required to understand the system implication (think of the classic Gray locking paper) I don't think Systemists wage wars. Like true system people, they are quick to identify a practical problems, and speak up. Having said all that, you do bring out an important problem of tribal wars that is killing the community. Also, we actually have a really nice database systems community where the senior folks actually take criticism quite well. Arguments are welcome! Don't know for sure about the theory folks, but I think they are pretty open-minded too. Overall, we have a pretty nice community, and largely the right things happen in the long-run. Ok--I'm tenured and I'm not as pessimistic as other are perhaps as a result?

Anonymous:
I liked how you have broken up the community into 4 tribes. I agree with most of it. However, on a higher-level feedback, here's what I think: I feel like this is perhaps triggered by bad/unfair reviews which every one of us has dealt with. However, I personally see the poor reviewing issue (i.e., lack of tolerance) only a symptom of the bigger problem which probably has a lot less with "culture" and has more to do with a "mafia" mentality in the community. Awards/Recognitions/Opportunities are not distributed based on merits. Rather, advisors/friends look out for their own advisees/friends. For example, the more famous the advisor, the more opportunities for their students. That means most everyone else is simply an "academic orphan" in the community. Not all but the majority of senior people in the community who are in control of how recognitions/opportunities are distributed do NOT act on what they preach. In public, most of them talk about the importance of "impactful work" but when you read their own letters they are doing nothing more than "bean counting" when it comes to assessing impact. The traditional data management community is rapidly losing its relevance, and as such, everyone is trying to come up with a definition of what's a fake problem and what's real/worthy. How does this relate to low-quality reviews? The data management community is a hostile environment due to its contentious and unfair interworkings. In a hostile environment individuals aren't acting rationally, let alone fairly. For example, even PC chairs and area chairs are not always people who are deemed by the majority of the community as reasonable or even insightful.

2 comments:

UnknownApril 15, 2018 at 5:44 AM
There is zero risk of a high-diversity, high-churn part of area chairs and reviewers creating a more balanced research curation process. We've achieved that at major conferences in our field (e.g. COLING 2018, ACL 2017) - and quickly - but it takes will and boldness. If those can't be found, then the sleepwalking continues.

The Data Dossier

Friday, April 13, 2018

The Culture Wars of Data Management