## New Blog

I am now blogging at modsimopt.netlify.com.

## A Short Presentation on Probabilistic Programming and Variational Inference

## The Cult of Universality in Statistical Learning Theory

The question is frequently raised as to why the theory and practice of machine learning are so divergent. Whereas if you glance at any article about classification, chances are that you will find symbol upon lemma & equation upon inequality, making claims about the bounds on the error rates, that should putatively guide the engineer in the solution of her problem.

However, the situation seems to be that the engineer having been forewarned by her pragmatic colleagues (or having checked a few herself) that these bounds are vacuous for most realistic problems, circumvents them altogether in her search for any useful nuggets in the article.

So why do these oft-ignored analyses still persist in a field that is largely comprised of engineers? From my brief survey of the literature it seems that one (but, by no means, the only) reason is the needless preponderance of *worst-case thinking*. (Being a panglossian believer of the purity of science and of the intentions of its workers, I am immediately dismissing the cynical suggestion that these analyses are appended to an article only to intimidate the insecure reviewer.)

**The cult of universality**

An inventive engineer designs a learning algorithm for her problem of classifying birds from the recordings of their calls. She suspects that her algorithm is more generally applicable and sits down to analyze it formally. She vaguely recalls various neat *generalization error *bounds she learned about during her days at the university, and wonders if they are applicable.

The bounds made claims of the kind

“for my classifier whose complexity is , if trained on examples, then for any distribution that generated the data, it is guaranteed that the

generalization error rate error rate on the training set + some function of (c,m)

with high probability”.

Some widely used measures of the complexity of a classifier are its VC dimension and its Rademacher complexity, both of which measure the ability of the classifier to separate *any *training set. The intuition is that if the classifier can imitate any arbitrary labeling of a set of vectors, it will generalize poorly.

Because of the phrase “for any distribution” in the statement of the bound, the bound is said to be *universally* applicable. It is this pursuit of *universality* which is a deplorable manifestation of worst-case thinking. It is tolerable in mathematicians that delight in pathologies, but can be debilitating in engineers.

The extent of pessimism induced by the requirement of universality is not well appreciated. The following example is designed to illustrate this by relaxing the requirement from “any distribution” to “any smooth distribution”, which is not much of a relaxation at all.

Assume that I have a small training data set in drawn from a continuous distribution . Assume further that is reasonably smooth.

I now build a linear classifier under some loss (say an SVM). I then take all the training examples that are misclassified by the linear classifier and memorize them along with their labels.

For a test vector , if is within of a memorized training example I give it the label of the training example. Otherwise I use the linear classifier to obtain my prediction.

I can make very small and since the training examples will be in general position with probability one, this classification scheme is unambiguous.

This classifier will have zero error on all training sets and therefore will have high complexity according to the usual complexity measures like VC, Rademacher etc. However, if I ignore the contribution of the memorized points (which only play a role for a set of vanishingly small probability), I have a linear classifier.

Therefore, although it is reasonable to expect any analysis to yield very similar bounds on the generalization error for a linear classifier and my linear+memorization classifier, the requirement of universality leads to vacuous bounds for the latter.

Even if I assume nothing more than smoothness, I do not know how to derive reasonable statements with the existing tools. And we almost always know much more about the data distributions!

To reiterate, checking one’s learning algorithm against the worst possible distribution is akin to designing a bicycle and checking how well it serves for holding up one’s pants.

**“The medicine bottle rules”**

Our engineer ponders these issues, muses about the “no free lunch” results that imply that for any two classifiers there are distributions for which either one of them is better than the other, and wonders about the philosophical distinction between *a priori* restricting the function space that learning algorithm searches in, and *a priori* restricting the distributions that the learning algorithm is applicable for.

After a short nap, she decides on a sensible route for her analysis.

1. *State the restrictions on the distribution*. She shows that her algorithm will perform very well if her assumptions of the data distribution are satisfied. She further argues that the allowed distributions are still broad enough to cover many other problems.

2. *State to what extent the assumptions can be violated*. She analyzes how the quality of her algorithm degrades when the assumptions are satisfied only approximately.

3. *State which assumptions are necessary*. She analyzes the situations where her algorithm will definitely fail.

I believe that these are good rules to follow while analyzing classification algorithms. My professor George Nagy calls these the *medicine bottle rules*, because like on medicine label, we require information on how to administer the drug, what it is for, what is bad for, and perhaps on interesting side effects.

I do not claim to follow this advice unfailingly and I admit to some of the above crimes. I, however, do believe that medicine bottle analysis is vastly more useful than much of what passes for learning theory. I look forward to hearing from you, nimble reader, of your thoughts on the kinds of analyses you would care enough about to read.

## Random Fourier Features for Kernel Density Estimation

The NIPS paper Random Fourier Features for Large-scale Kernel Machines, by Rahimi and Recht presents a method for randomized feature mapping where dot products in the transformed feature space approximate (a certain class of) positive definite (p.d.) kernels in the original space.

We know that for any p.d. kernel there exists a *deterministic* map that has the aforementioned property but it may be infinite dimensional. The paper presents results indicating that with the randomized map we can get away with only a “small” number of features (at least for a classification setting).

Before applying the method to density estimation let us review the relevant section of the paper briefly.

**Bochner’s Theorem and Random Fourier Features**

Assume that we have data in and a continuous p.d. kernel defined for every pair of points . Assume further that the kernel is shift-invariant, i.e., and that the kernel is scaled so that .

The theorem by Bochner states that under the above conditions must be the Fourier transform of a non-negative measure on . In other words, there exists a probability density function for such that .

where (1) is because is real. Equation (2) says that if we draw a random vector according to and form two vectors and , then the expected value of is .

Therefore, for , if we choose the transformation

with drawn according to , linear inner products in this transformed space will approximate .

**Gaussian RBF Kernel**

The Gaussian radial basis function kernel satisfies all the above conditions and we know that the Fourier transform of the Gaussian is another Gaussian (with the reciprocal variance). Therefore for “linearizing” the Gaussian r.b.f. kernel, we draw samples from a Gaussian distribution for the transformation.

**Parzen Window Density Estimation**

Given a data set , the the so-called Parzen window probability density estimator is defined as follows

where is often a positive, symmetric, shift-invariant kernel and is the bandwidth parameter that controls the scale of influence of the data points.

A common kernel that is used for Parzen window density estimation is the Gaussian density. If we make the same choice we can apply our feature transformation to linearize the procedure. We have

where has been absorbed into the kernel variance.

Therefore all we need to do is take the mean of the transformed data points and estimate the pdf at a new point to be (proportional to) the inner product its transformed feature vector with the mean.

Of course since the kernel value is only approximated by the inner product of the random Fourier features we expect that the estimate pdf will differ from a plain unadorned Parzen window estimate. But different how?

**Experiments**

Below are some pictures showing how the method performs on some synthetic data. I generated a few dozen points from a mixture of Gaussians and plotted contours of the estimated pdf for the region around the points. I did this for several choices of and (the scale parameter for the Gaussian kernel).

First let us check that the method performs as expected for large values of because the kernel value is well approximated by the inner product of the Fourier features. The first 3 pictures are for for various values of .

—————————————————————————

—————————————————————————

Now let us see what happens when we decrease . We expect the error in approximating the kernel would lead to obviously erroneous pdf. This is clearly evident for the case of .

—————————————————————————

—————————————————————————

The following picture for and is even stranger.

—————————————————————————

—————————————————————————

**Discussion**

It seems that even for a simple 2D example, we seem to need to compute a very large number of random Fourier features to make the estimated pdf accurate. (For this small example this is very wasteful, since a plain Parzen window estimate would require less memory and computation.)

However, the pictures do indicate that if the approach is to be used for outlier detection (aka novelty detection) *from a given data set, *we might be able get away with much smaller . That is, even if the estimated pdf has a big error on the entire space, on the points from the data it seems to be reasonably accurate.

## Regularized Minimax on Synthetic Data

First I would like to mention that, since my last post, I came across the paper from 2005 on Robust Supervised Learning by J. Andrew Bagnell that proposed almost exactly the same regularized minimax algorithm as the one I derived. He motivates the problem slightly differently and weights each example separately and not based on types, but the details are essentially identical.

**Experiments on Synthetic Data**

I tried the algorithm on some synthetic data and a linear logistic regression model. The results are shown in the figures below.

In both examples, there are examples from two classes (red and blue). Each class is a drawn from a mixture of two normal distributions (i.e., there are two *types* per class).

The types are shown as red squares and red circles, and blue diamonds and blue triangles. Class-conditionally the types have a skewed distribution. There are 9 times as many red squares as red circles, and 9 times as many blue diamonds as triangles.

We would expect a plain logistic regression classifier will minimize the overall “error” on the training data.

However since an adversary may assign a different set of costs to the various types (than those given by the type frequencies) a minimax classifier will hopefully try to avoid incurring a large number of errors on the most confusable types.

**Example 1**

** **

** **

Recall that as gamma decreases to zero, the adversary has more cost vectors at his disposal, meaning that the algorithm optimizes for a worse assignment of costs.

**Example 2**

**Discussion**

1. Notice that the minimax classifier trades off more errors on more frequent types for lower error on the less frequent ones. As we said before, this may be desirable if the type distribution in the training data is not representative of what is expected in the test data.

2. Unfortunately we didn’t quite get it to help on the named-entity recognition problem that motivated the work.