Friday, May 14, 2021

On Rejections in Academia


"Within light there is darkness, but do not try to understand that darkness;
Within darkness there is light, but do not look for that light."
-- Sandokai


Being an academic for almost 5 years now, one of the most fascinating things about academia for me is how rarely people talk about the rejections and failures they faced in their careers. Paradoxically, almost everyone says they faced rejections at least as frequently (if not more) as acceptances on papers, proposals, and honors/awards. And yet, most academic CVs have pages upon pages of publications, grants, and honors/awards and not a word on the rejections. I fear this status quo may be perpetuating survivorship bias and/or impostor syndrome, in turn exacerbating mental health issues endemic in academia and especially among people from underrepresented and/or marginalized groups. I have been speaking about these issues at UCSD, e.g., this oSTEM panel discussion. I hope this post, and my new CV addendum described below, is a step toward changing this status quo.

In my opinion, acceptances and rejections are two sides of the same coin of intellectual progress even though the latter are clearly painful. As long as I have the confidence that an evaluation process is likely fair and scrupulous, I take rejections in my stride (but if it is not, I do not sit silently either!). Likewise, competition is an inevitable reality in all walks of life. Interestingly, "curriculum vitae" literally means "course of life" in Latin. To me, success and failure are both integral parts of life and equanimity is crucial for self-growth. So, why then do academics stay silent on rejections in their CVs? How do we find out the reason(s)? Run a basic Twitter poll, of course! :) About 5 in 7 picked "waste of time/space" as their reason. The rest picked "embarassment", "fear", or "something else" but I did not see any specific reasons outlined for such fears or otherwise. The "waste" interpretation is likely rooted in academic incentive structures explicitly valuing only acceptances and ignoring rejections.

Of course, detailed lists of paper rejections may be boring. But summary statistics on rejections are certainly interesting, feasible, and IMO appropriate to share in a CV, the "living" documentation of one's professional life. Thankfully, I already maintain spreadsheets with details on my paper submissions and proposal/award applications. With simple spreadsheet formulae, I created the following summary. I have added them as an addendum in my CV. I will update these periodically, just like the rest of my CV. I hope this data "schema" is helpful for more people considering starting such a practice.

NB: For the sake of simplicity, "paper" below includes all forms of peer-reviewed publications: full research papers at conferences, demo papers, workshop papers, posters/abstracts, and journal papers. An (eventually) accepted paper or awarded proposal will be counted more than once across these statistics if it was not accepted or awarded directly in the first attempt. I will also note that all but one of my accepted VLDB / SIGMOD papers went through a revision decision.


Summary Statistics on Paper Decisions


Summary Statistics on Proposal (Grant / Gift) Decisions


Rejections on Major Competitive Honors or Awards


2021:
Nonprofit: Sloan Research Fellowship
Industry: Google Research Scholar Award, Microsoft Research Faculty Fellowship, VMware Systems Research Award

2020:
Industry: Amazon Research Award, Sony Focused Research Award

2019:
Federal: NSF CAREER Award
Industry: Bloomberg Data Science Research Grant, Google Faculty Research Award, Microsoft Investigator Fellowship, Microsoft Research Faculty Fellowship, NetApp Faculty Fellowship

2018:
Federal: NSF CRII Award
Industry: Amazon Research Award, Bloomberg Data Science Research Grant, Facebook Research Award

2017:
Federal: NSF CAREER Award


Some Salient Aspects from the Above Summaries


To be honest, I did not expect to glean much from this exercise going in, except perhaps for "Ugh, why am I re-living all these rejections?!" :-/ Instead, many salient aspects stood out for me. Make of these what you want.

1) I was pleasantly surprised to see paper Rejects were not as dominant as I thought, although they were not far behind Accepts each year. Even if Rejects were dominant, I think seeing Accepts alongside like this will help me retain a sense of proportion.

2) Related to the above, it appears Revise decisions at VLDB / SIGMOD consistently played a positive role in mitigating the stark Accept-Reject dichotomy. As I said earlier, all but one of my accepts at VLDB/SIGMOD came after a revision. In fact, Sam Madden had highlighted this point in his comment on my recent blog post on the DB venues. So, yeah, I empathize with my non-DB area CS peers still fighting for their "Right to Revision", e.g., see this blog post. ¡Viva la RevisiĆ³n! :)

3) Given the above, it is not surprising to me that my students (and I) preferred/prefer to submit our research to VLDB / SIGMOD. That said, we did dip our toes into many nearby venues--SysML/MLSys, KDD, ICML, and SOSP--albeit with no success at any of them so far. :-/

4) I'll admit it did put a smile on my face to see that two-thirds of my papers got accepted in the first attempt itself. :) I did not tally full research papers separately in that histogram but even so that bar remains the largest. At the other extreme, 2 papers faced Reject 3 times and 1 paper faced it 4 times (!) before an eventual Accept--all still at the "top tier" (VLDB / SIGMOD). This was only possible due to the amazing perseverance of my students. While I knew that persistence is crucial, I am thankful to my students for inspiring and reinforcing my own resolve through these rejections.

5) My proposal rejection statistics are clearly worse. But many senior faculty have told me it does not get better. :-/ Proposal acceptance rates in the US are a function of the (im)balance of competition vs research budgets. Interestingly, my submission counts to industry are higher than federal agencies although both numbers are high. I faced similar rejection rates at both of these types of sources.

6) My long list of award/honor rejections looks sad indeed. Then again, I went after most major ones, and more volume begets more rejections (and acceptances). In hindsight, I am not sure if pursuing all of them was prudent. For instance, Sloan and MSR demand bespoke summary proposals that require non-trivial time to craft. And one needs senior nominators and letter writers. Was all that time/effort (mine, nominator's, and letter writers') worth it given the very high rejection rates? I do not know. :-/

7) Finally, if you are wondering which of these gazillion rejections was the most painful for me, I do have an answer: NSF CAREER Award. I tweeted all about it: first rejection, second rejection, and third attempt being funded. I cannot but wonder if "pity" played a role at NSF in the end but hey, it is funding all the same. I accepted all outcomes with humility, and I grew from all experiences. This one did lead me to learn a lot about how NSF rolls. Later on, NSF course-corrected on its CAREER criteria and I tweeted about that too, expressing my joy on how it will help US academia.

Anyway, I hope the above conversation on rejections, my summary statistics, and the highlights I spoke about were interesting and/or useful to you. I recognize some of it may come across as "humble brag" but hey, I will take humble bragging any day vs moping about my rejections. :) Feel free to let me know in the comments section if you have any thoughts/questions on all the above or have your own experiences to share on this whole topic.

Monday, March 8, 2021

The End of the DB Culture Wars and the New Boom

About 3 years ago I wrote this "blockbuster" blog post about what I saw as research "culture wars" within the database/data management/data systems (DB) community. If you have not read that post, please go read it first--I promise you won't be bored. :) I am now back to re-examine the state of affairs. As before, my motivation is to offer a contemplative analysis, especially for the benefit of students and new researchers, not prescribe conclusive takes.

TL;DR: My post was read in a wide array of research communities. The reactions were overwhelmingly affirmative. In just 2-3 years, SIGMOD/VLDB made praiseworthy and clinically precise changes to their peer review processes to tackle many issues, including the ones I spoke about. To my own (pleasant) surprise, I now think the grand culture wars of DB Land have ended. Not only did SIGMOD/VLDB not disintegrate, I believe they breathed new life into themselves to remain go-to venues for data-intensive computing. I explain some of the changes that I believe were among the most impactful. Of course, there are always more avenues for continual improvement on deeper quality aspects of reviews. I also offer my take on how the fast-growing venue, MLSys, is shaping up. Finally, as I had suspected in my previous post, the set of canonical DB cultures has grown to add a new member: The Societalists.


A Brief History of DB Culture Wars and My Blog Post


Circa 2017, the DB research community faced an identity crisis on "unfair" treatment of "systemsy" work at SIGMOD/VLDB. Thanks to some strategic brinkmanship by some leading names in DB systems research, the issue came to the attention of the wider DB research community. There were fears of a new Balkanization in DB land. Into this vexed territory walked I, a new faculty member at UC San Diego, just 1.5 years in. Concerned by the prospect of a prolonged civil war, I wrote my blog post to offer my unvarnished take on what I saw as the heart of the issue: lack of appreciation in peer review for the inherent multiculturalism of the DB area.

After much thought, I systematically decomposed "core" DB research into 4 canonical cultures. These are not based on the topics they study. Rather, they are set apart by their main progress metrics, their methodologies, technical vocabularies, nearest neighbors among non-DB CS areas, and overall style/content of their papers.


As I noted, hybridization among these cultures has always been common in the DB area and that continues to be strong. For instance, Systemist+Formalist and Systemist+Humanist hybrids are popular. However, SIGMOD/VLDB had sleepwalked into a peer review situation where people were made to review papers across cultures they were not trained to evaluate fairly. The analogy I gave was this: asking a random SODA (or KDD) reviewer to evaluate a random OSDI (or CHI) paper. I ran my post by multiple DB folks, both junior and senior. Many shared most of the concerns.

Curiously, some junior folks did caution me that my post may land me in trouble with the "powers that be." I brushed that off and went public anyway. :) In hindsight, I think I did so in part because the issue was so pressing and because my message was (IMO) important for raising awareness.

I was blown away by the reactions I got after publishing and tweeting about my post. Many young researchers and students in the DB area affirmed my post's message. Folks from many nearby areas such as NLP, Data Mining, Semantic Web, and HCI emailed me to thank me for writing it. Apparently, their areas had faced the same issue. They recalled peer review mechanisms their areas' venues had adopted to mitigate this issue. Remarkably, I even got affirmative emails from folks in disciplines far away from CS, including biology, public health, and physics!

Coincidentally, I was invited to speak at the SIGMOD 2018 New Researcher Symposium. I made this topic as a key part of my talk. There were raised eyebrows and gasps, of course, but the message resonated. In the panel discussion, I noted how SIGMOD/VLDB are no longer the sole center of the data universe and how they must compete well against other areas' venues, e.g., NSDI and NeurIPS. Interestingly, the head of SIGMOD was at the panel too and addressed the issues head-on. This matter was also discussed at the SIGMOD business lunch. Clearly, SIGMOD/VLDB were siezed of this matter. That was encouraging to me.

Fascinatingly, never once did I hear from anyone that what I wrote hurt the "reputation" of SIGMOD/VLDB. I was still invited to SIGMOD/VLDB PCs every year. Indeed, one my ex-advisors, Jignesh Patel, had offered public comment on my post that the DB community has always welcomed constructive criticism. Clearly he was proven right.


Key Changes to SIGMOD/VLDB Peer Review Processes


Over the last few years, I got multiple vantage points to observe the PC process changes introduced by SIGMOD/VLDB and assess their impact. The first vantage point is as a reviewer: I served on the PCs of both SIGMOD and VLDB back-to-back for all of the last 3-4 years. The second is as an author: I have been consistently submitting and publishing at both venues all these years. The third is as an Associate Editor (metareviewer) for VLDB'21. I did not play a role in any decision making on these changes though. I now summarize the key changes that I believe were among the most helpful. I do not have access to objective data; my take is based on the (hundreds of) examples I got to see.

  • Toronto Paper Matching System (TPMS):
    In the past, papers were matched primarily by manual bidding based on titles/abstracts, an error-prone and unscalable process. TPMS directly tackled the most notorious issue I saw due to that flawed matching process: mismatch between the expertise/culture of a reviewer and the intellectual merits of a given paper. The pertinence and technical depth of reviews went up significantly.

  • Naming the Research Cultures:
    In the past, reviewers may misunderstand and fight over papers that were hybrids or from cultures other than their own. SIGMOD gave authors the option to explicitly label their paper as Systemist. VLDB went further and delineated multiple "flavors" of papers, straddling "foundations and algorithms", "systems", and "information architectures". Reviewers are also asked to identify these so that during discussion they can ensure that a fish is not judged by its ability to climb a tree.

  • More Policing and Power to Metareviewers:
    Any community needs policing to lay down the law. Peer review is no exception. In the past, I have seen uncivil terms like "marketing BS" and culture war terms like "just engineering" and "too much math" in reviews of my papers. SIGMOD/VLDB heads and PC Chairs of the last few years went the extra mile in creating more thorough and stricter guidelines for reviewers. Metareviews/Associate Editors were empowered to demand edits to reviews if needed. The civility, fairness, and technical focus of reviews went up significantly.

    Of course, deeper quality issues will likely never disappear fully. SIGMOD/VLDB can benefit from continual monitoring and refinement of these aspects. There is always room for improvement. Innovative ideas on this front must be welcomed. But having submitted papers to SOSP, MLSys, ICML, and KDD, it is clear to me that no venue has solved this problem fully. Different research communities must be open-minded to learn from each others' processes.

Amazingly, the above changes were achieved without major surgery to the PC structure. Both SIGMOD and VLDB retained a monolithic PC instead of a federated approach for the different cultures, which some other areas' venues follow. In case of contentious or low-confidence reviews, extra reviewers were added elastically. Full PC meetings, which some other areas' venues follow, turned out to be not really needed after such extra oversight was added. But as a disclaimer, I have not seen other areas' venues as an insider, except for serving on the PC of MLSys/SysML twice. I now think full PC meetings are likely an overkill, especially for multicultural areas like DB. That said, I may not know of peculiar issues in other areas that make them employ federated PCs, full PC meetings, etc.

None of the above would have been possible without the phenomenal efforts of the PC chairs of SIGMOD/VLDB over the last few years and the executive boards. I got to interact often with quite a few PC chairs. I was amazed by how much work it was and how deeply they cared about fixing what needed to be fixed for the benefit of all stakeholders: authors, reviewers, and metareviewers.


Healthy Competition and Peacetime DB Boom


So, why did SIGMOD/VLDB evolve so rapidly? Well, apart from the banal deontological answer of "right thing to do" I contend there is also a fun utilitarian answer: competition! :) Both Darwinian evolution and Capitalism 101 teach us that healthy competition is helpful because it punishes complacency and smugness, spurs liveliness and innovation, and ultimately prevents stagnation and decay. This is literally why even the stately NSF dubs proposal calls "competitions." Research publication venues are not exempt from this powerful phenomenon.

I suspect SIGMOD/VLDB had an "uh oh" moment when SysML (now MLSys) was launched as a new venue for research at the intersection of computing systems and ML. In fact, this matter was debated at this epic panel discussion at the DEEM Workshop at SIGMOD'18. The DEEM sub-community is at the intersection of the interests of SIGMOD/VLDB and MLSys. Later I attended the first archival SysML in 2019 and wrote about my positive experience as an attendee afterward. Full disclosure: I was invited, as a PC member, to be a part of its founding whitepaper.

MLSys is undoubtedly an exciting new venue with an eclectic mix of a variety of CS areas: ML/AI, compilers, computer architecture, systems, networking, security, DB, HCI, and more, although the first 4-5 areas in this list seem more dominant there. But having served on its PC twice and submitted once, it is clear to me that it is a work in progress. It has the potential to be a top-tier venue, its stated goal. But I am uncertain how it can escape the local optimum that its closest precedent, SoCC, is stuck in. Most faculty in the systems, DB, and networking areas I know still do not view a SoCC paper as on par with an SOSP/SIGMOD/SIGCOMM paper. MLSys also seems "overrun" by the Amazons, Googles, and Facebooks of the world. While industry presence is certainly important, my sense is the "cachet" of a research publication venue is set largely by academics based on longer-term scientific merits, not industry peddling their product features. Unless, of course, some papers published there end up being highly influential scientifically. I guess time will tell. Anyway, healthy competition is welcome in this space. Even the systems area has new competition: JSys, modeled on PVLDB and JMLR.

Apropos competition from MLSys, SIGMOD/VLDB did not sit idly by. VLDB'21 launched its "counter-offensive" with the exciting new Scalable Data Science (SDS) category under the Research Track. Data Science, which has a broader scope than just ML, is widely regarded in the DB world and beyond as an important interdisciplinary arena for DB reseach to amplify its impact on the world. A key part of SDS rationale is to attract impactful work on data management and systems for ML and Data Science. OK, full disclosure again: VLDB roped me in as an inaugural Associate Editor of SDS to help shape it for VLDB'21. And I am staying on in that role for VLDB'22 as well. SIGMOD too has something similar now. It is anyone's guess as to how all this will play out this decade. But I, for one, am certainly excited to see all these initiatives across communities!

The above said, based on recent SIGMOD/VLDB/CIDR I attended, it is clear to me that "peacetime" among the DB cultures has also ushered in a new research boom. Apart from the "DB for ML/Data Science" line I mentioned above, the reverse direction of "ML for DB" is also booming. Both of these lines of work are inherently cross-cultural: Systemist+Heuristicist, Formalist+Heuristicist, Systemist+Heuristicist+Humanist, etc. Cloud-native data systems, data lake systems, emerging hardware for DBMSs, etc. are all booming areas. Much of that work will also be cross-cultural, involving theory, program synthesis, applied ML, etc. All will benefit from the multiculturalism of SIGMOD/VLDB.


A Fifth Canonical Culture: The Societalists


I concluded my "culture wars" post by noting that new DB cultures may emerge. It turns out a small fifth culture was hiding in plain sight but it is now growing: The Societalists. I now explain them in a manner similar to the other cultures.

The Societalists draw inspiration from the fields of humanities and social sciences (ethics, philosophy, economics, etc.) and law/policy but also logic, math, statistics, and theoretical CS. This culture puts the well-being of society at the center of the research world, going beyond individual humans. They differ from Humanists in that their work is largely moot if the world had only one human, while Humanist work is still very useful for that lone human. Popular topics where this culture is well-represented are data privacy and security, data microeconomics, and fairness/accountability/responsibility in data-driven decision making. Societalists study phenomena/algorithms/systems that are at odds with societal values/goals and try to reorient them to align instead. Many Societalists blend this culture with another culture, especially Formalists and Heuristicists. Common elements in Societalist papers are terms such as utility, accuracy tradeoffs, ethical dilemmas, "impossibility" results, bemoaning the state of society, and occasionally, virtue signalling. :) Societalist papers, not unlike Humanist papers, sometimes have multiple metrics, including accuracy, system runtime, (new) fairness metrics, and legal compliance. Many Societalists also publish at FAccT, AAAI, and ICML/NeurIPS.

This culture is now growing, primarily by attracting people from the other DB cultures, again akin to Humanists. I believe the fast-growing awareness of the importance of diversity, equity, and inclusion issues in CS and Data Science is contributing to this growth. Also important is the societal aspiration, at least in some democracies, of ensuring that data tech, especially ML/AI but also its intersection with DB, truly benefit all of society and do not cause (inadvertent) harm to individuals and/or groups, including those based on attributes with legal non-discrimination protection.

Since the grand culture wars of DB Land are now over (hopefully!), I do not see the need to expand my 4x4 matrix to 5x5--yay! But I will summarize the 5 canonical cultures with their most "self-aggrandizing" self-perception here for my records. :)

Concluding Remarks


I am quite impressed by how rapidly SIGMOD/VLDB evolved in response to the issues they faced recently, especially research culture wars. I believe they have now set themselves up nicely to house much of the ongoing boom in data tech research. They have shown that even "old" top-tier venues can be nimble and innovative in peer review processes. Well, VLDB already showed that a decade ago with its groundbreaking (for CS) one-shot revision and multi-deadline ideas. Some of that has influenced many other top-tier and other venues of multiple CS areas: UbiComp, CSCW, NSDI, EuroSys, CCS, Oakland, Usenix Security, NDSS, CHES, and more. Elsewhere the debate continues, e.g., see this blog post on ML/AI venues and this one on systems venues. Of course, eternal vigilance is a must to ensure major changes do not throw the baby out with the bathwater and also to monitor for new issues. But I, for one, think it is only fitting that venues that publish science use the scientific method to experiment with potentially highly impactful new ideas for their peer review processes and assess their impact carefully. Build on ideas that work well and discard those that do not; after all, isn't this how science itself progresses?


ACKs: I would like to thank many folks who gave me feedback on this post, including suggestions that strengthened it. An incomplete list includes Jignesh Patel, Joe Hellerstein, Sam Madden, Babak Salimi, Jeff Naughton, Yannis Papakonstantinou, Victor Vianu, Sebastian Schelter, Paris Koutris, Juan Sequeda, Julian McAuley, and Vijay Chidambaram. My ACKs do not necessarily mean they endorse any of the opinions expressed in this post, which are my own. Some of them have also contributed public comments below.



Reactions (Contributed Comments)


This post too elicited a wide range of reactions from data management researchers. Some have kindly contributed public comments to register their reactions here, both with and without attribution. I hope all this contributes to continuing the conversations on this important topic of peer review processes in computing research venues.



Jignesh Patel:


This looks good. One comment is that labels are what they are--sometimes too defining and unintentionally polarizing. Having said that, two key comments:

1. Increasingly we have researchers in our community who span across even these labels. This is a really good sign.

2. The view here is somewhat narrow of what the "world of research" looks like and does not include collaborations across fields like biology and humanities. Systemist, in your terminology here, are generally the ones that bridge outside the CS field for databases and that is quite crucial if you take a large societal perspective.

Keep on blogging.



Joe Hellerstein:


Enjoyable read!

One of the reasons I went into databases was that it's a crosscut of CS in its nature. It was driven from user needs, math foundations and system concerns from the first. And more recently into societal-scale concerns as you wisely point out. I think this makes our field more interesting and broad than many. And in some sense it underlies a lot of what you're writing about.

It has some downsides. We're sometimes "ahead of our time" and perceived as nichey, then later have to suffer through mainstream CS rediscovering our basic knowledge later. And we can get diffuse as a community, as you've worried about in the past. But on balance I agree that times are pretty good and we're lucky to have had some community leaders innovate on process. PVLDB was a big win in particular.

Another thought. One viewpoint that's perhaps missing from your assessment is the longstanding connection between industry and academia in our field. This is much more widespread in CS now, but it's deep in our academic culture and continues to be a strength. Fields that have found applicability more recently could probably learn from some things we do in both directions.

Third thought: there's an underappreciated connection between databases and programming/sw-engineering/developer experience. Jim Gray understood this: e.g., the transaction concept as a general programming contract outside the scope of database storage and querying per se. State management is at the heart of all hard programming problems, and we have more to give there than I think we talk about. This is of increasing interest to me.



Sam Madden:


I do agree that review quality is up in the DB community--reviews are longer and more detailed--although I am still frustrated by the number of reviewers insisting on pseudoformalisms in systems papers (having just struggled with one of my students to add such nonsense in response to a reviewer comment).

I agree that going away from bidding to a matching system has been important, but from my point of view the single biggest innovation has been the widespread acceptance of a frequent and extensive revision process--reviewers are much more likely to give revisions. Having (mostly unsuccessfully) submitted a bunch of ML papers in the last two years, the SIGMOD/VLDB process is MUCH better, despite those ML papers using a similar review matching system. I attribute this largely to the revision process.



Babak Salimi:


I found it fascinating the way you broke down the research cultures in the SIGMOD/VLDB community into four tribes in your old post. Also, I appreciate the fact that you acknowledged the advent of an emerging line of work in databases that is society-centered.

Ever since I've started publishing at and reviewing for SIGMOD/VLDB in 2018, the key changes in the reviewing process you referred to were in place. So I can only imagine the mind-blowing scale of the culture war in the DB community prior to that time. While I can see those changes played a constructive role in deescalating the war, based on my anecdotal experiences, I think the war is far from over. As a matter of fact, each of these, now, five tribes have their own subcultures and communities. There is huge difference between expectations of a Societalist-Formalist reviewer, than a Societalist-Heuristicist one. I still see reviewers butcher submissions based on these cultural differences. I see decent submissions with novel and interesting contributions treated unfairly because of lack of theoretical analysis (read it as w/o theorems); I see submissions that look into problems with little practical relevance cherished because they are full of nontrivial theorems; I see interesting foundational papers rejected because of lack of evidence of scalability/applicability; I see impactful system-orianted papers referred to as a bundle of engineering tricks or amalgamate of existing ideas.

The steps taken to mitigate the situation were crucial and effective, but I think the war is still ongoing, but maybe now within these subcultures. To alleviate the problems, we need to go beyond refining the submission distribution policy. We need to educate the reviewers with regard to these cultural differences and provide them with detailed rubrics and guidelines to make the reviewing process more objective. More importantly, we need to devise a systematic approach to evaluate the review qualities and hold reviewers accountable for writing irresponsible and narrow-minded reviews.



Juan Sequeda:


I'm thrilled to see the changes going on at SIGMOD/VLDB which serve as an example for other CS communities. I also agree on the Societalist as a new culture. I'm eager to see the outcomes of researchers mixing within these cultures. In particular, I'm interested in a Humanist + Societalist mix because that is going to push data management to unchartered territories.



Julian McAuley:


As an outsider I wasn't aware of the "culture wars" outside of your posts. Reviewers having different views on things or coming from different backgrounds sounds pretty typical of many communities and usually isn't a catastrophic event. Sometimes it leads to the formation of new conferences (e.g. ICLR) though to call that a "war" is a bit of a stretch. Maybe I just find the term "war" overused (e.g. "War on X") but I understand that's a term you've used previously and the intention is to cause a stir.

I mostly think twitter/blogs are quite separate from "real life". I am incredulous that anyone would warn you that blogging about something would risk your professional reputation (at least for what seems like a fairly innocuous opinion)! Twitter (for e.g.) is full of people with deliberately controversial/outrageous/inflammatory opinions, most of which are just separate from their professional life. Maybe that one Googler got fired for writing about innate gender differences or whatever but I can't really think of any academics who've suffered negative career consequences from blogs / tweets. Long story short I maybe wouldn't sound so surprised that the community didn't implode and that your career wasn't left in shambles!

Does part of this succumb to "the documentary impulse", i.e., the human desire to form narratives around events? The alternative is simply that the community shifted in interests, some new conferences formed, and reviewing practices changed a little (as they do frequently at big conferences). Again maybe I'm put off by the "war" term. Of course I'm being too dense: I realize that it's a blog and that establishing a narrative is the point. Take, e.g., NeurIPS: old conference, various splinter conferences have emerged, the review process changes every year in quite experimental ways. The makeup of the community has also changed drastically over the years. Would you say ML is undergoing or has undergone a similar culture war? If it has it's been pretty tame and never felt like an existential threat.

Likewise, what's the evidence that it's over? I'm sure in the next three years there'll be some new splinter conferences, and some more changes to the review procedure. Being in a state of constant change seems like the norm.



Anonymous:


I personally think that reviewing can still be improved a lot at most conferences, so I wouldn't try to imply that it's "fixed", but it's hard to make it much better without a lot of work for people in the community. One of the ways to improve it would be to have more reviewers per paper, but that's obviously more work for the reviewers, so they have to agree to do it. Another way is to have more senior people as reviewers.

My most informative reviews are usually from SOSP/OSDI/NSDI/SIGCOMM, who have 5-6 reviewers per paper for papers that make it past the "first round", but those conferences are also a lot of work for the people involved, and it's questionable whether this extra reviewing really improves the quality of the papers overall. Maybe a faster-and-looser approach is better for the field. In any case though, it's been great to see VLDB and SIGMOD move toward these explicit tracks and use TPMS.

Tuesday, April 9, 2019

Conferences: Strata Data and SysML 2019

I returned last week from two awesome conferences in the Bay Area: Strata Data 2019 and SysML 2019. This post is about what I learned from these trips about the world of data+ML+systems, both in research and in practice. I will also give shout outs to talks/papers I found most interesting.

Strata Data 2019


This was my first major practitioners conference. It featured 100s of talks with 1000s of attendees from 100s of companies, spanning enterprise companies, small and large Web companies, non-profits, and more organizations. I went there to give a talk on my group's research and our open source software to accelerate ML over multi-table data. But a larger goal was to take the pulse on industrial practice and network with the data scientists, ML engineers, and other researchers who were there. Some observations about stuff I found most interesting at this conference:

  • ML has come of age:
    I already knew this, but the conference made it bluntly clear that systems for ML workloads, including for the end-to-end lifecycle, have exploded in popularity. Almost every session had plenty of talks from companies on how they are using ML for different business-critical tasks, what ML+data tools they were using, or by ML tools developers on what their tools do. The ML hype is dead--ML, including deep learning, is now serious business on the critical path of almost every company there!

    Interestingly, this crowd seems to be well aware that ML algorithmics itself is just one part of the puzzle. The data systems infrastructure for sourcing features, building models, and for deployment/monitoring of models in production all seem disjointed and thus, a focus of further work. But almost every company I spoke to on these issues are rolling out their own in-house platforms. This leads to me to my next observation.

  • ML platforms craze:
    There were many "AI startups" and also larger companies at the expo that claim to be building a "data science/ML platform" of some sort, including H2O and SAS. It did leave me wondering how reusable and pluggable the components of such platforms are with existing infrastructure, especially given the incredible heterogeneity of the datasets and ML uses cases across companies.

    But one setting where automated platforms for end-to-end data preparation, feature extraction, and model building on structured data are indeed graining traction is SalesForce. They gave a few interesting talks on Einstein, their AutoML platform that is apparently used by 1000s of their customers. Most of these are enterprise companies that cannot afford to have a data scientist on their own. Thus, they give their datasets to Einstein, specify the prediction targets and some objectives, and let it build models for various common tasks such as sales forecasting, fraud detection, etc. To me, it seems SalesForce is quite a bit ahead of its rivals, certainly for structured data. They also open sourced an interesting library for automating data prep and feature extraction: TransmogrifAI.

    Another interesting open source ML platform presented there was Intel's "Analytics Zoo" to integrate TensorFlow-based modeling workflows in the Spark environment. It also includes some pre-trained deep net models and useful packaged "verticals" for different ML applications.

  • Serverless:
    I finally managed to learn more about serverless computing thanks to a tutorial. The speaker gave a fascinating analogy that made total sense: buying your own servers is like buying your own car; regular cloud computing is like getting a rental car; serverless is like using Lyft/Uber.

    Model serving has become the killer app for serverless due to its statelessness. But apparently, data prep/ETL workflows and more stateful MapReduce workflows are also increasingly being deployed on serverless infrastructure. The benefits of fine-grained resource elasticity and heterogeneity offered by serverless can help reduce resource costs. But the con is that software complexity goes up. Indeed, the speaker noted a caveat that most ML training workloads and other communication/state-intensive workloads are perhaps not (yet) a good fit for serverless. All this reminded me of this interesting CIDR'19 paper by Prof. Joe Hellerstein and co. Nevertheless, I think disaggregated and composable resource management, a generalization of serverless, seems like the inevitable evolution of the cloud.

  • Everything for ML and ML for Everything?:
    Prof. Shafi Goldwasser of MIT gave an interesting keynote on how the worlds of ML and cryptography are coming together to enable secure ML applications. She mentioned some open research questions on both adapting ML to be more amenable to crypto primitives and creating new crypto techniques that are ML-aware. It is official folks: almost all other areas of computing (call it X) are working on "X for ML and ML for X" research! Heck, I even saw physicists working on "ML for physics and physics for ML"! :)


SysML 2019


This was the first archival year for this conference on "systems + ML" research. This whitepaper details the intellectual focus of SysML. There were about 32 talks, many from major schools and companies, including several from Google AI. One of my students, Supun Nakandala, presented a demonstration on Krypton, our tool for explaining CNN predictions more quickly using query optimization techniques. But apart from our paper, I came to SysML to get a feel for how this new community is shaping up and to network with the other attendees. I was on the PC; I found it to be refreshingly new experience to interact with folks from so many different areas under one roof: ML/AI, architecture, PL/compilers, databases, systems, HCI, etc.! The program naturally reflects the eclecticism of this emerging interdisciplinary community. Some observations about stuff I found most interesting at this conference:

  • Pertinent industrial interest:
    There was large number of ML engineers and researchers from Google, Facebook, Microsoft, Apple, etc. I also saw DeepMind and Tesla for the first time at a systems conference. This underscores an important aspect of research at this growing intersection: visibility among the "right" industrial audience. Most ML systems research so far has been published at SOSP/OSDI/NSDI, SIGMOD/VLDB, ISCA/HPCA, etc. But such broad conferences typically attract a more generic industrial presence that may or may not be pertinent for ML systems research. For instance, companies usually only send their relational DBMS or data warehousing folks to SIGMOD/VLDB, not ML systems folks. SysML has clearly found a long ignored sweet spot that is also growing rapidly.

  • Pipelining and parallelism on steroids:
    There were 4 main groups of papers: faster systems, new ML algorithms, ML debugging, and new ML programming frameworks. I will focus on the first, third, and fourth groups. The first group was largely from the networked/operating systems and architecture folks. The papers showed the power of two key systems tricks that long ago proved impactful in the RDBMS context: pipelining and parallel operators.

    Many papers aimed to reduce the network/communication overhead of distributed ML (e.g., Parameter Servers) by pipelining communication parts of the state with computation over parts of it. This is like hiding memory/IO latency on single-node systems. While the ideas were interesting, the performance gains underwhelmed me (~30% is the largest?!). But then again, there is a cultural/expectations gap between the networked systems folks and the database systems folks. :)

    There were many papers on hardware-software co-designed stacks, mainly for vision. But I found this paper from Stanford particularly interesting. It shows that to maximize resource efficiency, we need different kinds of parallelism for different operators within a deep net. I suspect such auto-tuned hybrid parallelism may have implications for other data processing systems too.

  • Debugging/programming frameworks for ML:
    These were a welcome relief from so many low-level systems papers! The only "data management for ML" paper was this one from Google that highlights issues in validating and debugging data-related issues in production ML settings. I was already familiar with this work and the TFX team. Such loosely coupled schema-guided approaches are crucial for dynamic heterogeneous environments where neither the data sources nor the model serving environments are under the ML engineer's control. Another interesting paper in this space was on enabling the software engineering practice of "continuous integration" for ML models. They reduce the labeled data requirements for reliably testing the accuracy of new ML models committed to a shared code repository.

    Finally, the paper I enjoyed reading the most was this one on TensorFlow.js. It studies ML training and inference in a peculiar setting: browsers. They give many remarkable example applications that use this framework, including in ML education with interesting pedagogical implications for teaching ML to non-CS folks. More touchingly, another application built a deep net-powered gestural interface to convert sign language videos to speech. It is heartening to see that the SysML community cares about more than just building faster ML systems or improving business metrics--democratizing ML is a much broader goal!

Monday, August 20, 2018

SIGMOD DEEM 2018 Panel Discussion

The ACM SIGMOD Second Workshop on Data Management for End-to-End Machine Learning (DEEM) was successfully held a few weeks ago in Houston, TX. The goal of DEEM is to bring together researchers and practitioners at the intersection of applied machine learning (ML) and data management/systems research to discuss data management/systems issues in ML systems and applications. This blog post gives an overview of DEEM'18 and a lighthearted summary of the exciting and informative panel discussion.

Overview of DEEM 2018


As per the SIGMOD Workshops chairs, DEEM'18 had 117 registrations--almost half more than the next largest workshop at SIGMOD'18, about thrice as large as a typical SIGMOD/VLDB workshop, and likely the highest in the history of SIGMOD/VLDB workshops! Clearly, this is a red-hot area and our program had stirred the curiosity of a great many people. The day of, "only" about 70-80 people showed up. Thanks to sponsorship from Amazon and Google, we also funded 4 student travel awards. There were 10 accepted papers, with 3 presented as long talks and 7, as short talks. These papers spanned many interesting topics, including new ML programming models, scalable ML on DB/dataflow systems, human-in-the-loop ML exploration tools, data labeling tools, and more. A variety of top schools and some companies were represented. All of this was made possible thanks to the hard work of a top-notch PC. We had 4 excellent invited keynotes/talks from both academia (Jens Dittrich and Joaquin Vanschoren) and industry (Martin Zinkevich from Google and Matei Zaharia with his Databricks hat).

Jens shared his thoughts on what DB folks can bring to ML systems research and education, as well as a recent line of work on using ML to improve RDBMS components. Joaquin spoke about his work on ML reproducibility, collaboration, and provenance management with the successful OpenML effort. Martin gave a unique talk on a topic seldom addressed in any research conference--how to navigate the space of objectives for ML-powered applications before even getting to the data or ML models. He used a very Google-y example of improving user engagement via click measurements. Finally, Matei spoke about the recently announced MLFlow project from Databricks for managing the lifecycle and provenance of ML models and pipelines.

Panel Discussion


Getting to the panel discussion itself, the topic was a hot-button issue: "ML/AI Systems and Applications: Is the SIGMOD/VLDB Community Losing Relevance?" I moderated it and my co-chair, Sebastian Schelter, also helped put together the agenda. To make the discussion entertaining, I played the devil's advocate and made the questions quite provocative. Apart from the 4 invited speakers, we had 2 additional panelists: Joey Gonzalez (faculty at UC Berkeley) and Manasi Vartak (PhD student at MIT), all of of whom are working on DEEM-style research. Two previously confirmed invitees, Luna Dong and Ce Zhang, were unfortunately unable to make it to the workshop.

Photos from the workshop. L to R: (1) The DEEM audience. (2) The Panelists: Matei, Joaquin, Jens, Joey, and Manasi (Martin not pictured). (3) Advocatus Diaboli.


First off, I clarified that the "irrelevance" in our question was meant only in the context of ML systems/applications, not data management in general, eliciting laughter from the audience. After all, as long as there is data to manage, data management research is relevant, right? :) But with the dizzying hype around AI and deep learning, we saw the above question as timely. The discussion was supposed to have 9 questions across 3 topics--problem selection/research content, logistics/optics of publication venues, and student training. But the amount of discussion generated meant we could cover only 6 questions. I started with an overview of the history of ML systems, from SAS and R in the 1970s all the way to the DEEM community today. In the rest of this post, I summarize each question, its context/background, and the panel responses and discussion. For brevity sake, I will not always identify who said what.

We started with two fun rapid-fire questions that I often use to put my students on the spot and gauge their technical worldview. What is a "database"? What is a "query"? The Merriam-Webster Dictionary says a database is just an "organized collection" of data, while a query is just a "request for information" against a database. Interestingly, almost all the panelists said similar things although the (wrong) definition that "a database is a system for managing data" did come up once. Most considered a query as a "program" run against a database. No relational. No structured. No system. Not even logic. The panel was off to a flying start!


Q1. Is "in-database ML" dead? Is "ML on dataflow systems" dead? Is the future of ML systems a fragmented mess of "domain-specific" tools for disparate ML tasks?


Context/Background:
There has been almost 2 decades of work on incorporating ML algorithms into RDBMSs and providing new APIs to support ML along with SQL querying. This avoids the need to copy data and offers other benefits of RDBMSs such as data parallelism. Alas, such in-RDBMS ML support largely failed commercially, according to Surajit Chaudhuri, a pioneer of in-RDBMS ML tools. At XLDB'18, he explained that a major reason was that ML users wanted a lot more tool support (latest models, iterative model selection, complex feature processing) that SAS and similar products offered. Those vendors also recognized the importance of near-data execution and connected their tools to RDBMS servers, leveraging user-defined functionality of such systems to reduce data copying. Moreover, statisticians and data scientists were unfamiliar with SQL and preferred the familiarity of SAS, R, and similar tools, sealing the fate of in-RDBMS ML support at that time. That said, as storage became cheaper, many enterprises no longer mind copying data to Hadoop clusters and using Mahout for ML. Anecdotally, some users also do this to reduce the load on their costly RDBMSs, which they use mainly for OLTP. Spark MLlib and similar "ML on dataflow systems" are now largely replacing Mahout. But in this "era of deep learning," the programming and execution architectures of both RDBMSs and Spark-style systems seem highly inadequate. So, most users stick to in-memory Python/R for standard ML and TensorFlow/PyTorch for deep learning. Thus, tackling data issues in ML workloads typically requires problem-specific tools. Perhaps ML is just too heterogeneous for a unified system. Hennessy and Patterson recently said that the future of the computer architecture community is in disparate "domain-specific architectures." Should the DEEM community be content with a similar future for ML systems with no unifying intellectual core like relational algebra/SQL for RDBMSs?

Panel Discussion Summary:
By design, this question induced a sharp polarization among the panel and set the tone for the rest of the discussion. Martin and Manasi agreed that in-database ML is pretty much dead and that custom problem-specific ML+data systems is the inevitable future. Manasi also opined that there will never be an equivalent of SQL for ML in the foreseeable future. Jens countered that even if RDBMSs are dead as an execution engine for ML, relational-style ideas will still be relevant for new custom ML systems, which everyone agreed with. Matei opined that while in-database ML is dying, if not already dead, ML-on-dataflow-systems is alive and well, since he finds many enterprise customers of Databricks adopting Spark MLlib. Joey weighed in based on his experience spanning both ML-on-dataflow and custom ML systems that both kinds of systems will co-exist, albeit with more emphasis on the latter. The overall consensus was that the application space for ML systems is indeed quite fragmented with different operating constraints and environments dictating which tools people will use. Most panelists agreed that while in-database ML may no longer be a particularly promising research direction, ML on dataflow systems, which is a more general environment than RDBMSs, are still a promising avenue for new ideas and will still matter for certain kinds of ML workloads.


Q2. What are the major open research questions for the DEEM community? Is the DB community's success with RDBMSs a guiding light or just historical baggage?


Context/Background:
Many CS communities are undergoing "crises" and witnessing massive "paradigm shifts," to borrow Thomas Kuhn's famous words. Perhaps the best example is the natural language processing (NLP) community. Almost 3 decades of work on feature engineering for applied ML over text have been discarded in favor of new end-to-end deep learning approaches. Some NLP folks say they find the new paradigm refreshing and more productive, while others say they had to take therapy to soothe the trauma caused by this upheaval. In this backdrop, the DEEM community can approach research on ML systems using RDBMSs as a "guiding light" to tackle problems and propose ideas. But this philosophy is fraught with the infamous pitfall of the streetlight effect--only problems/ideas that are easy to connect with RDBMS-style work and/or appease "the RDBMS orthodoxy" will get attention instead of what is truly valuable for ML applications. This pitfall is a highway to practical irrelevance, wherein researchers publish papers that look good to each other, while the "real world" moves on. An alternative philosophy is a clean slate world view in exploring novel problems/ideas in ML systems. But this philosophy is fraught with the risk of repeating history, including wasteful past mistakes in data systems research. Is it even possible to get a judicious mix of both these philosophies?


Panel Discussion Summary:
The first part of the question elicited a wide range of responses. Overall, several major open research problems/topics were identified by the panel, especially the following:
  • More support and automation for data preparation and data cleaning pipelines in ML
  • Abstractions and systems for ML lifecycle and experimentation management
  • Efficient ML model serving and better integration of ML models with online applications
  • Better visualization support for debugging ML models and data
  • Frameworks to think about how to craft ML prediction objectives, especially beyond supervised ML
The second part of the question was met mostly with a measured response that RDBMS ideas will still matter in the context of ML systems but we need to pick and choose depending on the problem at hand. For instance, the DB community has long worked on data preparation, ETL, and data cleaning. But adapting them to ML workloads introduces new twists and requires new research in the ML context, not just routine application or extension of DB work. Joey and Martin also cautioned that it is important to study ML systems problems in the context that matters for ML users and developers, which might often require departing entirely from RDBMS-style ideas. The operating/distributed systems community routinely witnesses such changes. But it is likely that such changes will cause painful "culture shocks" for the DB and DEEM communities, given the stranglehold of the successful legacy of RDBMSs.


Q3. Why has 30yrs of work in the DB community on ETL and data cleaning had almost no impact on data preparation for ML among practitioners?


Context/Background:
Recent surveys of real-world data scientists, e.g., this massive Kaggle survey and this CrowdFlower report, repeatedly show that collecting, integrating, transforming, cleaning, labeling, and generally organizing training data (often collectively called "data preparation") dominates their time and effort, up to even 80%. Clearly, this includes data cleaning and integration concerns, which the DB community has long worked on. And yet, none of the data scientists interviewed or anecdotally quizzed seem to have found any techniques or tools from this literature usable/useful for their work. To paraphrase Ihab Ilyas, a leading expert on data cleaning research, "decades of research, tons of papers, but very little success in practical adoption" characterizes this state of affairs. Perhaps data cleaning is just too heterogeneous and too dataset-specific--more like "death by a thousand cuts" rather than a "terrible swift sword." Perhaps there are just way too many interconnected concerns for a generic unified system to tackle, whether or not it applies ML algorithms internally. What hope is there for DB-style data cleaning work in the ML context when its success in the SQL context is itself so questionable?


Panel Discussion and Summary:
Naturally, the provocative phrasing of the question elicited smiles and raised eyebrows, as well as a heated discussion. I separate the discussion and summary to highlight the different perspectives.

Jens countered that the DB community's work on data cleaning, especially those that apply ML, have indeed had an impact or at least, look very promising. Joey opined that part of the problem is with the term "cleaning." A lot of time is inevitably spent by practitioners on understanding and reshaping their data to suit their tools, to bring in domain knowledge, etc., but all such activities get grouped under the catch-all term "cleaning." Manasi agreed, adding that recasting the data representation for the ML task in peculiar ways also gets talked about as data cleaning or organization. Moreover, almost no ML or data mining curricula teach such data cleaning/prep issues, which skews perceptions. Jens agreed, adding that the "data cleaning" area has a major marketing problem, since it sounds so boring and janitorial. He suggested a clever play on words for naming: "bug data analytics"!

Joaquin said that it will be hard to eliminate humans completely and better human-in-the-loop solutions are needed to reduce manual effort. Jens suggested that pushing more of the cleaning steps into the ML modeling itself could help, similar to what some tree-based models already do. Joaquin agreed that making ML models more robust to data issues is also promising. Joey cautioned that ML will not be a panacea, since it relies on useful signals and enough of it being present in the data to achieve anything useful. One will still need domain expertise to guide the process and set the right objectives. Matei then interjected to opine that the DB community's work on SQL and dataflow tools has had a major impact on data prep and without this work, the 60% in the survey may very well have been 95% or more! Martin then pondered if an excessive focus on "cleaning" the data is wrongheaded when one should actually be "fixing" the data generating process, especially in Google-like settings, where most of the data is produced by software.

In summary, a major takeaway was that there is likely an unhelpful terminology confusion between researchers and practitioners in the data prep for ML arena, which could hinder progress. Another was to collaborate more with the ML community to make ML models more natively robust to dirty data. But there was consensus that data prep for ML, including cleaning/organizing/transforming data, will remain a core focus for the DEEM community.


Q4+Q5. Is the DB community too obsessed with (semi-)structured data and ignoring the deep learning revolution for unstructured data? On the other hand, is deep learning too overrated for ML analytics outside of "Big Tech" (Google, Amazon, etc.), especially for enterprises?


Context/Background:
These 2 questions are based on 2 key findings from the massive Kaggle survey, shown in the screenshots below.
"Relational" data includes both tables and multi-variate time series. Text data is clearly ubiquitous too, while images are not far behind. Given this, should the DB community be thinking more holistically about data rather than shoehorning themselves to structured and semi-structued data? Since deep learning with CNNs and RNNs is the way to go for text and multimedia analytics, the DEEM community should look at deep learning more. But the survey also shows that the most popular models are still linear models, trees, SVMs, Bayes Nets, and ensembles, way above CNNs and RNNs ("neural networks" in this list are likely just classical MLPs). This is likely related to the previous finding--relational data dominates their use cases and interpretability/explainability/actionability are crucial, not just accuracy. GANs, the new darling of the ML world, are at a mere 3%. Overall, there is a huge mismatch in what ML researchers consider "sexy" and what is important for ML practitioners! One might think deep nets are still too new, but the deep learning hype has been around for half a decade before this survey was done. So, one can only conclude that deep learning is overrated for most enterprise ML uses cases. This is a massive rift between the enterprise world and Big Tech/Web companies like Google, Facebook, and Amazon, who are using and aggressively promoting deep learning.

Panel Discussion and Summary:
Once again, due to the amount of interesting discussion these questions generated, I separate the discussion and summary to highlight different perspectives.

Manasi started off by agreeing that deep learning for text, speech, images, and video is something the DEEM community should study more but opined that current deep learning methods do not work on relational data. Thus, there is lots of room for work in this context. Joey bluntly stated that deep learning is indeed overrated and that most real-world ML users will remain happy with linear models, trees, and ensembles (e.g., RandomForest and XGBoost)! He also joked that we should rename logistic regression as a "deep net of depth 1." The compute cost and labeled data needs of deep nets are impractical for many ML users. Deep nets are also expensive to serve/deploy en masse. While deep nets are useful for speech and images, most users will just download and reuse pre-trained deep nets from Google/Facebook/etc. for such data, not train their own. Overall, his position was that surmising that the whole world will switch everything to deep learning is completely mistaken. For good jovial measure, he then added that his students are all working on deep learning for their papers!

Martin agreed that ML users should start simple but then also try complex models, including deep nets, based on available resources. For images and text, deep nets that exploit their structure are becoming unbeatable. Another key benefit of deep nets is that they are "compact" artifacts. Thus, their serving-time memory access characteristics are better than many other models that need extensive and cumbersome data transformation pipelines for feature engineering. These older models are also a nightmare to port from training to serving. Joey then interjected saying that such serving benefits only hold for "medium-sized" nets, not the 100s of layers ML people go crazy over. Matei weighed in with an anecdote about a Databricks customer. Deep nets are increasingly being adopted for images (and for text to a lesser extent), which are often present along with structured data. One example is "Hotels.com" using CNNs for semantic deduplication of images for displaying on their webpage. But he also agreed that not everyone will train deep nets from scratch. For instance, pre-trained CNNs can be used an image "featurizer" for transfer learning in many cases to greatly reduce both compute and data costs.

In summary, there was a consensus that the DB community should welcome more work on unstructured data and deep learning-based analytics. But there was also caution against getting carried away by the deep learning hype and levity that one should still work on deep learning anyway. The heterogeneity of ML use cases and requirements means that a diverse set of ML models will likely remain popular in practice for the foreseeable future.


Q6. Should the DEEM community break up with SIGMOD and join SysML? What more can we do to enhance the impact of DEEM-style work?


Context/Background:
The DB community's PC processes, as well as the kinds of research it values are now hotly debated topics. Stonebraker declared recently that SIGMOD/VLDB PCs are unfair to systems-oriented research. While praiseworthy steps are being taken by SIGMOD/VLDB to reduce such DB research culture wars, will this situation lead to DEEM-style work falling through the cracks? SIGMOD/VLDB PC chairs also repeatedly face the issue of not enough ML systems-related expertise/knowledge being available on their PCs and conflation of DEEM-style work with data mining work, worsening the reviewing issues. The SysML Conference was created recently as a home for ML systems research, including DEEM-style work, to avoid such issues. So, in terms of the potential for visibility, research impact, and fairness of research evaluation, is it better for DEEM to break up with SIGMOD and join SysML? What other venues are suitable for DEEM-style work?

Panel Discussion Summary:
This question drew both audible gasps and sniggers from the audience. It was also highly pertinent for at least 2 panelists. Joey helped run the MLSys/AISys workshop at NIPS/ICML and SOSP for the last few years and NIPS BigLearn for a few years before that. Matei helped start SysML in 2018 and is its PC chair for 2019.

Matei readily admitted that as a head of SysML, he will be delighted to have DEEM at SysML. According to him, each paper at SysML will get at least 1 expert reviewer each from ML and systems. The PC meeting will be in person. He opined that SysML is a good fit for DEEM, since DB-inspired work is a core focus of SysML. DB-style ideas can be impactful in the ML systems context as they were with "Big Data" systems. Moreover, since many well-known ML experts are involved with SysML, it offers more visibility among the ML community too. He also suggested that the smaller and more focused communities of NSDI and OSDI are other good options for DEEM-style work that is systems-oriented. SysML will be modeled on their processes. Joey countered that it is perhaps better to keep DEEM at SIGMOD to ensure ML-oriented work gets more visibility/attention in the DB world, which has a lot to offer. He also suggested the MLSys formula of colocating DEEM with 2 different venues, say, in alternate years. One could be SIGMOD and the other could be SysML or an ML venue like NIPS. He wondered if MLSys had overemphasized operating/distributed systems aspects of ML at the cost of other systems-oriented concerns. DEEM at SIGMOD can be a forum for ideas from all DB cultures and can be complementary to SysML.

Martin asked if it was harder to get DB/systems folks to work on ML concerns or vice versa. The latter is widely considered harder. Joey interjected to suggest that building strong artifacts aimed at ML users can improve visibility in the ML world, a la TensorFlow/PyTorch. Martin and Joaquin emphasized the need for solid standardized/benchmark datasets for DEEM-style work on data prep/cleaning/organization/etc. to be taken more seriously in the ML world. This is similar to the UCI repo and ImageNet, both of which boosted ML research. Manasi opined that DEEM-style work need not be so ML-oriented to be publishable at NIPS. But she suggested that SIGMOD/VLDB should create an "ML systems" track/area and add more ML expertise to their PCs. But focusing only on systems-oriented stuff in ML could lead to wasteful repetitions of other DB-style ideas. She said keeping DEEM at SIGMOD will help the DB world stay engaged. Matei had a caveat that without proper ML expertise on PCs, there is a danger of publishing papers that "look nice" but lack ML methodological rigor. While such papers will be discredited in the longer run, they will waste time/resources. Researchers interested in ML systems work should first understand ML well enough. The same holds for PCs. Martin concluded by saying that since a lot of ML systems research studies practical and industrially relevant problems, regardless of the venue, researchers interested in impact should talk to practitioners in industry.

In summary, while the panel was divided on whether DEEM should swap SIGMOD for SysML, they had good suggestions on increasing visibility and impact of DEEM-style work, including building good artifacts, dataset standardization, and bringing more ML experts to SIGMOD/VLDB/DEEM PCs. I closed the discussion declaring that as the DEEM organizers, we have no plans of leaving SIGMOD, since we are thoroughbred "database people," eliciting laughter from the panelists and the audience, likely in relief. :)


Concluding Remarks


Overall, the panel discussion was provocative and passionate but also insightful and constructive. Many of the audience members and panelists later opined that they too found the discussion educational and entertaining. The questions we could not cover included some industry trends and student training issues--another time then. A big thank you again to my fellow DEEM'18 organizers (Sebastian Schelter and Stephan Seufert), the steering committee, the PC, our invited speakers and panelists, the sponsors (Amazon and Google), the officials/volunteers of SIGMOD'18, and all the authors and attendees who made DEEM'18 such a success! I believe DEEM will be back at SIGMOD 2019.


Disclaimer:

We do not own the copyrights for the illustrations used in this article for educational purposes. We acknowledge the sources of the illustrations in order here: https://edtosavetheworld.com/2014/05/28/1-thomas-kuhn-the-structure-of-scientific-revolutions/, http://first-the-trousers.com/hello-world/, http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf, https://en.wikipedia.org/wiki/Lingchi#/media/File:Martyrdom_of_Joseph_Marchand.jpg, and https://www.kaggle.com/surveys/2017/. If a copyright owner wants their illustration removed, we are happy to oblige.

Friday, April 13, 2018

The Culture Wars of Data Management

Scientists are people. Despite their fervent protestations of objectivity, all scientists are prone to conflating subjective experience with objective truth, at least once in a while. Einstein himself infamously dissed on quantum mechanics initially but then contributed to its growth. The "whole truth" is an elusive and enigmatic beast. Consequently, people (including scientists) often fight "tribal culture wars" under the illusion that they (and those who agree with them) are right and all others are wrong. Perhaps nothing captures this issue more eloquently than the timeless parable of the blind men and the elephant. This issue arises over and over again in all fields of human endeavor, including the sciences. In this post, I examine how this issue is affecting the "data management" research community. My motivation is to offer a contemplative analysis of the state of affairs, especially for the benefit of students and new researchers, not prescribe definitive solutions.


This post is partly inspired by Pedro Domingo's article, "The Five Tribes of Machine Learning." He explains why there are deep "tribal" divisions within the machine learning (ML) community in terms of both the topics they pursue and their research culture. For instance, explosive battles often take place between the "Bayesians" (think statistics) and the "Connectionists" (think deep learning), two of ML's oldest tribes.

The data management field too has similar divisions, albeit not as strongly partitioned by tribes. It all came to a head last year when some prominent members of this community issued an ultimatum to SIGMOD and VLDB, the top data management research conferences. Criticizing the repeated unfair treatment of systems-oriented papers by reviewers (e.g., wrongly deeming a lack of theoretical analysis as low "technical novelty" or "technical depth," or crass dismissive statements such as "just engineering"), they threatened to fork off and start a new systems-oriented data management conference. Such brinkmanship is neither new nor unique to this field. In fact, break ups often occur on account of "unfairness," e.g., IMC splitting from SIGCOMM, ICLR vs CVPR, Mobisys vs MobiCom, the list is long indeed! From my own experience, I agree with the core claim of the ultimatum. Apparently, the SIGMOD/VLDB leadership also agreed. The PC chairs of VLDB then emailed the whole PC (which I was on) with guidelines on how to fairly evaluate systems papers. Problem solved? Haha, no.

Why write this post? Who is it for?


It may appear puerile to put labels on entire "research cultures." But such cultural gaps are ubiquitous, even in physics. Acknowledging and understanding such gaps is the first step to either mitigating them or making peace with them. Without such understanding, I reckon the data management community might be on a death spiral to more fragmentation (or maybe that is inevitable due to other factors such as size, who knows?). I see such labels as a way of acknowledging the different cultures, tolerating them, and even celebrating the intellectual diversity. More importantly, it is crucial for students and new researchers to be aware of such gaps and why they exist. This could help them make more sense of paper rejections or even negative in-person interactions with other researchers. From my experience, many established researchers transcend the cultural gaps and often bridge multiple cultures. Even among those that stay within one culture, many are not antagonistic to the others. I just wish everyone would be more like these folks. I hope this post nudges more people, especially newcomers, towards that ideal world.

The Four Canonical Cultures (as I see it)


Based on my interactions with dozens of researchers, reading hundreds of papers, and reviewing for SIGMOD/VLDB for the last two years, I see at least 4 "canonical" cultures (to overload a classic DB term!). Unlike Pedro's tribes of ML, which are partitioned by areas (e.g., Connectionists study neural networks), it is misleading to delineate the divisions within the data management community by areas/topics because many areas, e.g., query optimization, have all four cultures (or at least more than one) represented. Thus, my split is based on the inherent expectations of each research culture, their methodologies, the non-data management fields they get inspiration from (including technical vocabulary and "hammers"), and the style and content of their papers. Such differences are more insidious than areas, which is perhaps why the "culture wars" of data management are more damaging than the tribals wars of ML. These cultures are not mutually exclusive; in fact, many researchers have successfully hybridized these cultures and all 2^4 possible combinations are represented at SIGMOD/VLDB to varying extents. Anyway, here are the 4 canonical cultures I see:
  • Formalist
  • Systemist
  • Heuristicist
  • Humanist

The Formalist Culture


The Formalists bring a theory-oriented view to data management. Their culture has been influential since the founding years of the data management field, going back to Ted Codd himself. Formalists draw inspiration from theoretical fields such as logic, math, statistics, and theoretical CS, especially combinatorial optimization, formal methods/PL theory, and ML theory. They seek a high degree of formal rigor in the characterization and communication of ideas, especially in papers. Common elements in Formalist papers are non-trivial theorems and proofs, typically of the "hardness" of problems (couched in the language of computational complexity theory), approximation or randomized algorithms, and formal analyses of the complexity and quality of the algorithms. Many pride themselves in the theoretical sophistication of their results.

Some Formalists also publish in non-data management venues such as SODA, LICS, and NIPS, but curiously, not so much in STOC/FOCS. Database theory (think PODS/ICDT) is a major part of this culture, but not all Formalists are database theoreticians, i.e., some publish regularly at SIGMOD/VLDB too. Some Formalists have a reputation for expecting a high degree of rigor in all data management papers.

The Systemist Culture


The Systemists bring a systems-oriented view to data management. Their culture has also been influential since the founding years. Many pioneers of this field, including Charlie Bachman, Michael Stonebraker, Jim Gray, and David DeWitt can be considered as Systemists. They draw inspiration from the computer systems fields broadly writ (operating/distributed systems, compilers and software engineering, networking, computer architecture, etc.). Interestingly, their ideas have reshaped some of those other fields, e.g., the concept of ACID and distributed systems. Most Systemists seek a high degree of real-world practicality in ideas and care much less (compared to Formalists) for rigor in data management papers. Some even dislike seeing theorems. Common elements in Systemist papers are system architecture diagrams, discussions of system design decisions, and analyses of system trade-offs, typically with extensive experiments. They typically use system runtime or throughput metrics on synthetic workloads and datasets (e.g., TPC benchmarks). Scalability is often a key concern. Many pride themselves in the practical adoption of their systems, including via startups they found.

Some Systemists and Formalists have a reputation for waging "wars" against each other's culture or even other fields. The war between Bachman and Codd on the (de)merits of the relational data model is one example, as is the war between Stonebraker and Ullman on whether database theory research is needed at all. More recently, some Systemists have waged wars against the distributed systems field, e.g., this now-infamous blog post. Curiously, most Systemists tend not to publish in non-data management computer systems venues such as NSDI/OSDI/SOSP/ASPLOS/ISCA/etc.

The Heuristicist Culture


The Heuristicists draw inspiration from the fields of artificial intelligence (ML, natural language processing, etc.), but also logic, math, statistics, and theoretical CS. This culture started growing mostly from the 1990s. Popular topics where this culture is well-represented are data mining, Web/text data analysis, data cleaning, and data integration. Heuristicists design practical heuristic algorithms, typically without a high degree of rigor in the exploration of ideas (compared to Formalists) and without deep systems-oriented trade-off analyses (compared to Systemists). But there is substantial diversity in this culture. Many papers use rigor in communication and thus, seem closer to the Formalists. Others focus on complex algorithmic architectures and thus, seem closer to the Systemists. Many researchers bridge this culture with the Formalists, especially on the topic of data integration. But many problems in such topics are so "messy" that theoretical work alone goes only so far. So, many researchers also bridge this culture with the Systemists. Common elements in Heuristicist papers are math notation but no non-trivial theorems, large algorithm boxes or diagrams, and extensive experiments, typically with real-world datasets and workloads. They typically focus on quality metrics such as accuracy, precision, recall, AuROC, etc. Runtime or scalability metrics are less common.

Data cleaning and data integration have become central themes in data management research. But a large chunk of data mining researchers broke up with the SIGMOD/VLDB community and joined the newly created SIGKDD community along with applied AI/ML researchers. However, many data mining researchers still publish routinely at SIGMOD/VLDB along with SIGKDD/ICDM/etc. Similarly, many Web/text data researchers also publish at both SIGMOD/VLDB and WWW/AAAI/etc.

The Humanist Culture


The Humanist culture has also been around since the early years (e.g., query-by-example), but this culture too started growing mostly only from the 1990s. This culture draws inspiration from the fields of human computer interaction, programming languages, and cognitive science, but also parts of theoretical CS and computer systems. This culture puts humans that work with data at the center of the research world. Topics in which this culture is well-represented are new abstract programming/computation models for managing/processing data, new query interfaces, interactive data exploration/analysis tools, and data visualization. Common elements in Humanist papers are terms such as usability and productivity, user studies, and even links to demo videos of the tools they build. Remarkably, many Humanist papers often overlap with one or more of the other cultures, which perhaps makes this culture the least dogmatic and most eclectic. Humanist papers often have multiple metrics, including quality, system runtime, and human effort (measured with "lines of code", human interaction time, interviews, etc.). Many Humanists also publish at CHI/UIST.

This culture seems to be growing, primarily by attracting people from the other cultures. Working on human-centered problems has a long history in this field, going back to Codd himself. Many researchers often seem to forget the fact that the relational model itself was created not to solve a "hard" theoretical problem, improve a system's performance, or design a heuristic algorithm, but rather to improve the productivity of humans that queried structured data.

Culture Wars and Extremism


By "wars," I mean the endless intellectual tussle for suzerainty over the research field. The most common way such wars cause damage is unfair negative evaluations of research papers because one is unable to see the merits of a paper from a different culture or a cultural hybrid. Many "extremist" Formalists and Heuristicists (and Formalist+Heuristicists) often dismiss many Systemist papers as "just engineering." Many extremist Formalists also dismiss many Heuristicist papers as "too ad hoc" or "too flaky." Many extremist Systemists dismiss many Formalist papers as "not practical" or "too mathy." Many extremist Formalists and Systemists (and Formalist+Systemists) dismiss many Humanist papers as "fluff" and "soft science." Many extremist Formalists, Heuristicists, and Systemists (and many cultural hybrids) also often dismiss many Systemist or Heuristicist papers that explore new important problems and propose initial solutions as "too straightforward" or "not novel" by conflating simplicity, an oft-exalted virtue in the real world, with a lack of novelty. Overall, one often ends up wrongly judging a fish by its ability to climb a tree. Sadly, such wars sometimes force researchers to add needlessly contrived content to papers just to appease such extremists.

Such tribal culture wars and extremism, whether deliberate or not, detract from fair and honest critiques that actually help advance the science. Personally, I find such wars ridiculous. So, please allow me to amuse myself (and hopefully, you) by ridiculing the extremist bigots of each culture that ridiculously diss other cultures and glorify only their own using the provocative meme of "X as seen by Y" (see this one about programmer wars first, if you do not know this meme). Since a 16x16 matrix is too big for me to construct, I restrict myself to the canonical 4x4. :) Hopefully, this will help students/researchers realize that they are not alone in getting caustic comments. I also hope this will cause people to think twice before engaging in such ridiculous wars themselves in the future. Behold, I present to you the culture wars of data management!

(Optional) Explanatory Caption (might be painfully obvious for some). In row-major order from the top-left. First row: Einstein (geniuses, of course), FSM (mushy false gods), Crashed truck (what a disaster!), and Big fat hacker/engineer. Second row: Lucius Malfoy (snooty pure-blood evil wizards destined to be defeated), Justice League (superheroes saving humans), Awkward nerd, and Zealots (close-minded bigots serving the devil). Third row: Kung Fu Panda 3 (googly-eyed admiration), Pretend-Superman, Star Trek (the future beckons), and Airplane pilots (so many moving parts!). Fourth row: Irrelevant contrived junk peddlers, Kids with fun toys (so cute!), Calvin-and-Hobbes's games (so naive!), and The One (the prophetic savior).


Making matters worse are "civil wars" within cultures, especially the Systemists. Some Systemists are so obsessed with relational DBMSs that they pooh pooh any new data systems. Such insularity has caused much grief in the last decade, especially due to the rise of "Big Data" systems (think MapReduce/Hadoop or Spark) and "NoSQL" systems (think BigTable) from the distributed systems field. Of course, most Systemists acknowledge their mistakes and change their minds over time, but not without causing serious damage to the field. There is also a mini civil war among the Formalists between the logic and discrete math-oriented sub-culture and statistics/ML theory-oriented sub-culture. With so many culture wars going on, I fear research on "data management for ML"/"ML systems," which is the area my own research focuses on, will be driven away from SIGMOD/VLDB. This area is increasingly considered important for the wider CS landscape and thus, for the data management community too. I will be raising these issues (among others) at a panel discussion at the DEEM Workshop at SIGMOD 2018. Rest assured, I will return to blog about how that goes.

Should the Four Cultures Stick Together or Break Up?


The data management/database/data systems community is not a "monoculture"; it never was and it never will be (almost surely). As a vertical slice of CS, it is "multicultural" and will always have high intellectual diversity. The four cultures may be irreconcilable, but they are complementary and can co-exist. The benefits of inter-cultural tolerance, cultural hybridization, and trans-cultural work are clear: cross-pollination of problems and ideas, trans-cultural partnerships and collaboration, infusion of ideas from each culture's favored non-data management fields, export of ideas from one culture via another to another CS field or non-CS disciplines, and so on. This sort of inter-cultural amity and partnership was/is the norm at the Database Group of the University of Wisconsin-Madison, where I went to graduate school, and the Database Lab of the University of California, San Diego, where I am on the faculty now, and many other database groups. There is a long tradition of research cultural hybridization and trans-cultural research that is practiced and even celebrated by many researchers, senior and junior alike.

Yet, to claim such culture wars do not exist is to be the proverbial ostrich that buries its head in the sand, while pretending we can all just sing kumbaya, pat ourselves on our backs, and get along with stiff upper lips is to be utterly naive. We have to acknowledge and accommodate the vast cultural gaps amicably lest the centrifugal forces caused by unfair research evaluation and tribal cultural wars spin out of control. This form of multiculturalism is a source of strength for SIGMOD/VLDB and sets them apart from related communities such as STOC, OSDI, ISCA, SIGKDD, and CHI. Of course, it will also defeat the point of inter-cultural amity if everyone is expected to work across cultures; it should be left to each researcher to pick a culture or cultural hybrid for their work based on their skills and taste.

Instead of trying to shame or bully any of the cultural groups into conformity to another culture, all groups should practice mutual respect. An analogy I can draw is with movie genres: one cannot coerce a person that only likes arthouse dramas to like big-budget blockbusters or vice versa. All 4 cultures and all cultural hybrids bring something different and valuable to the table of data management research. It will be nothing short of a catastrophe for SIGMOD/VLDB if the Systemists (or indeed, any other group) leave. While such a pyrrhic war has been averted for now, I fear it might turn in to a Cold War that poisons the field further. Unfortunately, the current peer review processes of SIGMOD/VLDB are broken and are amplifying the centrifugal forces. Thankfully, they are aware of this issue and working to fix them soon. I am reasonably confident they will course correct but I doubt it will happen quickly or easily.

All that said, I could be wrong and perhaps the cultures are better off breaking up. As such, attending a conference is no longer needed for following new research thanks to the Web. Cutting-edge "data management" work is no longer published at SIGMOD/VLDB only; NSDI, OSDI, SIGKDD, NIPS, and more share the pie. Inter-cultural research exchange can happen across communities too. There are precedents for both area-based and culture-based splits/spinoffs: PODS (for database theory), CIDR (for "visiony" data systems), SoCC (for cloud computing), and SysML (for ML systems). Who can be 100% sure it is "bad" to have more splits/spinoffs?

Even if SIGMOD/VLDB remain multicultural, should they become more "federated" instead of "unitary" in terms of the research tracks? It is clearly unfair to have a random SODA (or SIGKDD) reviewer evaluate a random OSDI (or CHI) paper and vice versa--and yet, this is roughly the kind of unfairness caused by the culture wars that continue unabated within the SIGMOD/VLDB community. Is it not a form of "emotional abuse" of students to subject them to such toxic subjective culture wars instead of a resolute focus on the objective (de)merits of ideas? Does the SIGMOD/VLDB community really want to cut such a sorry figure against NSDI/OSDI/SOSP or NIPS/ICML or CHI or other communities that compete for bright students? I think the SIGMOD/VLDB community should openly and honestly tackle such contentious questions without acting like this or this. But I do not know the best avenue for spurring such conversations: a short workshop, a townhall or panel discussion at SIGMOD/VLDB, or something else, who knows?

Concluding Remarks


Scientists are people. It is crucial to practice objectivity and apply solid quality filters for research evaluation. But it is also crucial to be aware of one's own blind spots. Unknown unknowns are also impossible to fathom for anyone, no matter how knowledgeable--none of the blind men can say anything about parts of the elephant they cannot reach. This is all the more true for scientists at the cutting edge of the ever-expanding universe of knowledge. Also crucial is respecting subjective differences on intellectual practice, since the leap from objective facts to judgmental interpretation is far too often a leap of faith guided by subjective experience--even if two of the blind men grasp the same tail fur, one might call it soft on his hand and the other, coarse. Perhaps I am being a naive idealist, but I think it is imperative that we all inject ourselves with large doses of scientific humility, curiosity, civility, empathy, and magnanimity. Watching this video regularly might help! :) Unless there is credible verifiable evidence to the contrary, every scientist must be willing to admit that they could be wrong. Every one of us is (intellectually speaking) "blind" in some way, and we will remain blind no matter how much we learn. But we can all be less blind by talking with others that have an earnest world-view that is different from ours, not talking down to them.


PS: I do not claim that these 4 cultures are the only ones in data management. I am sure as the field keeps growing, we might see the addition of new cultures or the reorganization of existing ones. Perhaps I will write another post on this topic 20 years from now! If you have any thoughts on this topic or this post, please do leave your comments below.

ACKs: Special thanks to Julian McAuley for being a sounding board for this post. I also thank the following people (in alphabetical order of surnames) for giving me feedback on this post, including suggestions that strengthened it: Peter Bailis, Spyros Blanas, Vijay Chidambaram, Joe Hellerstein, Paris Koutris, Sam Madden, Yannis Papakonstantinou, Jignesh Patel, Christopher Re, Victor Vianu, Eugene Wu, and a few others that did not want to be acknowledged. My ACKs do not necessarily mean they endorse any of the opinions expressed in this post, which are my own.



Reactions (Contributed Paragraphs)


The post elicited a wide range of reactions from data management researchers. Some researchers have kindly contributed a paragraph to register their reactions here, both with and without attribution. I expect to have more contributions in due course. If you do not want to comment publicly, feel free to email me your reaction so that I can add it here as an anonymous contribution. I hope all this contributes to stirring more conversations on this important topic.

Jignesh Patel:
Interesting! Note Systemists don't abhor theory, they abhor theory for the sake of theory. Classic systems paper have some formalism, but just what is required to understand the system implication (think of the classic Gray locking paper) I don't think Systemists wage wars. Like true system people, they are quick to identify a practical problems, and speak up. Having said all that, you do bring out an important problem of tribal wars that is killing the community. Also, we actually have a really nice database systems community where the senior folks actually take criticism quite well. Arguments are welcome! Don't know for sure about the theory folks, but I think they are pretty open-minded too. Overall, we have a pretty nice community, and largely the right things happen in the long-run. Ok--I'm tenured and I'm not as pessimistic as other are perhaps as a result?

Anonymous:
I liked how you have broken up the community into 4 tribes. I agree with most of it. However, on a higher-level feedback, here's what I think: I feel like this is perhaps triggered by bad/unfair reviews which every one of us has dealt with. However, I personally see the poor reviewing issue (i.e., lack of tolerance) only a symptom of the bigger problem which probably has a lot less with "culture" and has more to do with a "mafia" mentality in the community. Awards/Recognitions/Opportunities are not distributed based on merits. Rather, advisors/friends look out for their own advisees/friends. For example, the more famous the advisor, the more opportunities for their students. That means most everyone else is simply an "academic orphan" in the community. Not all but the majority of senior people in the community who are in control of how recognitions/opportunities are distributed do NOT act on what they preach. In public, most of them talk about the importance of "impactful work" but when you read their own letters they are doing nothing more than "bean counting" when it comes to assessing impact. The traditional data management community is rapidly losing its relevance, and as such, everyone is trying to come up with a definition of what's a fake problem and what's real/worthy. How does this relate to low-quality reviews? The data management community is a hostile environment due to its contentious and unfair interworkings. In a hostile environment individuals aren't acting rationally, let alone fairly. For example, even PC chairs and area chairs are not always people who are deemed by the majority of the community as reasonable or even insightful.