The Data Dossier: June 2016

There was once a paper,
Which got badly rejected at PODS.
Boldly, it went forth to SIGMOD,
Against it were overwhelming odds.
Branded "extremely controversial,"
Yet "profound," it waited a long while.
At last, deliverance was in sight,
Though many a reviewer it did rile!

This is a post about my SIGMOD 2016 paper titled To Join or Not to Join? Thinking Twice about Joins before Feature Selection. It is co-authored with my advisors, Jeff Naughton and Jignesh Patel, and our ML collaborator, Jerry Zhu.

Introduction

Machine learning (ML) is all the rage in CS right now. The "traditional" view of training data in ML is that it is a single table with all the features about the entity to model. However, in most real-world applications, datasets are often multi-table and connected by primary key-foreign key (KFK) relationships. Thus, data scientists often join the base tables to gather as many features as they can for ML. Here is a concrete example (from the paper):

Example: Consider a data scientist at an insurance company modeling their customers using ML to prevent customer churn. They have a table with customer features such as their age, gender, income, employer, etc. One of those features, the employer's ID is a foreign key (FK) that refers to a separate table with information about companies and other organization. Given this situation, the data scientist instinctively joins these two tables to create a single table in which the feature vectors from the customer and employer tables are concatenated. They then use it to train their favorite ML model, e.g., Naive Bayes, logistic regression, neural networks, etc.

Joins before ML do not just arise in insurance; they are ubiquitous and arise in practically every data-driven application. Recommendation systems join tables with ratings, users, and movies. Web security systems join tables about login events, user accounts, and webpages. The list goes on and on. In all these cases, feature selection is typically performed implicitly or explicitly along with ML to help improve accuracy. For example, explicit greedy subset search techniques such as backward or forward selection are popular for Naive Bayes, while implicit L1 or L2 regularization are popular for logistic regression and neural networks.

In this paper, we ask a simple but radical question: Do we really "need" these joins before ML? The motivation for this question is a simple information theoretic result we prove in the paper: the features brought in by such KFK joins are redundant. In the ML literature, many feature selection methods explicitly search for redundant features and remove them. This motivates us to short-circuit the feature selection process using database schema information: simply avoid KFK joins before ML! Clearly, this could help improve runtime performance because several features are avoided before feature selection without even looking at the data. Avoiding the join processing cost also saves some time but this component is usually minor. Empirically, we saw speed-ups of up to two orders of magnitude (186x) on real-world datasets.

But the million dollar question for avoiding joins is: Will avoiding a KFK join reduce ML accuracy? This paper is an in-depth exploration of this rather surprising and counter-intuitive question that combines a statistical learning theory-based analysis with a database-style practical approach to avoiding joins in the real world.

Summary

Our first technical results show that information theory may not be perspicacious enough to explain whether avoiding joins would reduce accuracy or not. Thus, we resort to applying statistical learning theory, which explains that the prediction (test) error of an ML classifier can be decomposed into three components: bias, variance, and noise. Informally speaking, the bias quantifies how complex an ML model's hypothesis space is -- more complex models have lower bias. The variance quantifies the instability of the training process -- more complex models have higher variance (conversely, fixing the model, having more training examples lowers the variance). At the heart of ML is a trade-off between bias and variance, which is often illustrated as follows: as the model complexity increases, the bias decreases but the variance increases, which leads to higher test errors than train errors -- a situation known colloquially as "overfitting."

Thus, our question now becomes: What is the effect of avoiding a KFK join on bias and variance? The paper goes into the technical details and arrives at the following answer: Avoiding a join will not increase the bias but it might increase the variance. The crux of why this happens is a simple two-fold reason, which I explain briefly next. First, the foreign key (the employer's ID) encodes all information about the employer because all employer features are merely functions of the employer's ID. In database parlance, this is also called a functional dependency. Thus, any ML classifier, which is simply a function of the feature vector, can be recast as another function that only uses the employer's ID and avoids other employer features. Essentially, a function of a function is just another function! Thus, the hypothesis space of the model does not shrink when avoiding the join, which means the bias will not increase. In contrast, it might turn out that some specific employer feature/features is/are sufficient to represent the true concept. But usually, the foreign key has a much larger domain than all features it refers to (for example, there are only 50 states, but there are millions of employers). Thus, the hypothesis will be far larger if the employer ID is used a representative of all employer features, which get avoided. Since the number of training examples is still the same whether or not we avoid the join, this implies that the variance might shoot up if we avoid the join.

How do we use the above theoretical analysis to help data scientists in practice? One practical way, which we take, is to bound the potential increase in variance and enable data scientists to apply a threshold on it; essentially, this creates a simple decision rule. We call this process avoiding the join "safely." Basically, the data scientist tolerates a certain risk of slightly higher test error in return for potentially much faster runtime performance. The paper goes into the details of how we adapt classical VC dimension-based bounds on variance from the learning theory literature for our purpose.

After a lot of analysis and conservative simplifications, we arrive at a stunningly simple and easy-to-check criterion in the end: the Tuple Ratio (TR) rule. Essentially, we only need to compare the ratio of the number of tuples in the tables that are joined: in our example, this is the ratio of the number of customers to the number of employers. If this ratio is above a certain threshold, then it is "safe" to avoid the join. A key distinguishing aspect of our approach is that our TR threshold is tied only to the VC dimension of an ML model and is independent of the dataset instance (unlike traditional ML hyper-paramaters, which require tuning per instance). We tuned our rule using a simulation study and found that for an error tolerance of 0.001, a reasonable value for the TR threshold is 20.

An extensive empirical validation using 7 real-world datasets (14 joins between them) shows that the TR rule with a threshold of 20 works out of the box for both Naive Bayes and logistic regression combined with several different feature selection methods! Overall, it correctly avoids 7 out of 11 joins that are safe to avoid and correctly does not avoid 3 out of 3 joins that are not safe to avoid. The remaining 4 cases are missed opportunities to avoid joins, which happens because our decision rule is conservative.

Concluding Remarks

Our paper opens up a new connection between learning theory and joins, which are fundamental relational operations that arise during the process of feature engineering for ML over multi-table datasets. Much work remains to be done, however, and this paper is only the beginning. Our ideas assume that the ML model has a finite VC dimension (specifically, we assume a linear VC dimension). It is not clear what happens if we use models with higher (potentially infinite) VC dimensions, e.g., neural networks, RBF-kernel SVMs, or decision trees. We also assume discrete feature spaces. It is not clear what happens with numeric features spaces. Several such interesting questions remain unanswered and answering them requires new research that truly lies at the intersection of data management and machine learning, both of which are critical for today's data-driven world.

Interestingly, the SIGMOD reviewers found our work "extremely controversial," as the opening poem alludes to. In fact, we got six reviewers instead of the usual three! In the first round, two gave us "champion" accepts, while the rest, including an external ML expert, gave us a reject. After the revision round, however, all six accepted our paper. The key bone of controversy was whether foreign keys can indeed be used as features for ML. The short answer is: it depends but in most applications, foreign keys are routinely used as features for ML. The paper explains this issue in more detail.

To Join or Not to Join,
That is the question.
Tuple Ratio, we did coin,
Use at your discretion!

I invite you to read our research paper. Questions, suggestions, and comments -- critical or otherwise -- are welcome! Controversy is even more welcome! :)

Links

Project webpage.
Research paper at SIGMOD 2016.
Technical report.
Code on GitHub.
Real datasets.

The Data Dossier

Saturday, June 25, 2016

Research Paper: To Join or Not to Join?

About Me

Blog Archive