Curse of Dimensionality Explained

Curse of Dimensionality Explained


The Curse of Dimensionality, a term initially introduced by Richard Bellman[1], is a phenomena that arises when applying machine learning algorithms to highly-dimensional data.

Let's take a simple example as an illustration of the issue. Say we have a data set of observations X^T=\{x_1,\dots,x_{N}\} where N=204, and x \in [0;20] is a scalar (D=1). We also have a set of target classes T^T=\{t_1,\dots,t_{N}\}, where t is a class variable expressing association with one of two possible classes - either "green" or "red" (t\in\{g,r\}).

We want to train a classifier that can output the probability of the n^{th} observation being of class "green":

\begin{equation}
p(t_n = g \mid x_n)
\end{equation}

Let's use a quite naïve approach and partition the space of X into four equally sized regions - R_1, R_2, R_3, and R_4, as shown in the left most sub-plot of Figure 1.

The curse of dimensionality - as the number of dimensions increases, the number of regions grows exponentially

The curse of dimensionality - as the number of dimensions increases, the number of regions grows exponentially

In the first region, that is the x \in [0,5] interval, we have 3 observations of class red and one observation of class green, so

\begin{equation}
p(t_n = g \mid R_1) = \frac{1}{4} = 0.25
\end{equation}

We can apply the same approach to the remaining three regions and get

\begin{equation}
\begin{split}
p(t_n = g \mid R_2) = \frac{2}{3} \approx 0.67 \\
p(t_n = g \mid R_3) = \frac{4}{7} \approx 0.57 \\
p(t_n = g \mid R_4) = \frac{3}{6} = 0.5
\end{split}
\end{equation}

We now have a working, although quite simplistic classifier. If we are presented an unseen observation, all we have to do is figure out its region, in order to make a prediction for its class.

Now let's increase the dimensionality of the data set X, by making x a two dimensional vector (D=2)

\begin{equation}
X^T = \{(x_{11}, x_{12}),\dots,(x_{N1}, x_{N2})\}
\end{equation}

Look at the middle sub-plot of Figure 1. We have the same number of observations, but the number of regions has increased to 16.

In the case of a one dimensional x we had 20 observations across 4 regions, resulting in an average of \frac{20}{4}=5 observations per region. Using the same number of observations in a two dimensional space reduces this number to \frac{20}{16}=1.25 observations per region.

We now have regions that only contain green classes (e.g. R_2, x_1 \in [5;10], x_2 \in [0,5]) where according to our classifier the probability of observing "green" will be 1 and the probability of observing "red" will be 0.

Another problem that we can immediately spot is that regions with zero observations emerge (e.g. R_1, x_1 \in [0;5], x_2 \in [0,5]). Our classifier will be undefined in such regions and we won't be able to use it to make any predictions.

Increasing the dimensionality further worsens the situation - adding a third dimension results in 64 regions and a density of \frac{20}{64} \approx 0.31 observations per region.

Hastie et al.[2] show that sampling density is proportional to N^{\frac{1}{D}}. Let see how many observations we need if we want to keep the one-dimensional density from our example in a three-dimensional space.

\begin{equation}
\begin{split}
20 ^ \frac{1}{1} = x ^ \frac{1}{3} \\
x = 8000
\end{split}
\end{equation}

This means we have to use 8'000 observations in three-dimensional space to get the same density as we would get from 20 observations in a one-dimensional space.

This illustrates one of the key effects of the curse of dimensionality - as dimensionality increases the data becomes sparse. We need to gather more observations in order to present the classification algorithm with a good space coverage. If we, however, keep increasing the number of dimensions, the number of required observations quickly goes beyond what we can hope to gather.

Note: One could argue that given the way we defined the "curse of dimensionality", this phenomena is not tied to the data dimensionality alone, but it is a joint problem involving high number of dimensions and the inability of the processing algorithm to scale accordingly.

There is indeed work being done to make certain algorithms less susceptible to the curse. For example, Taylor and Einbeck[3] suggest methods that enable multivariate polynomial fitting in high-dimensional spaces.

There are certain approaches that can greatly alleviate the "curse of dimensionality".

Increasing the number of observations is the obvious way of tackling the issue, although this method is not always viable in practice. Working from the opposite direction and reducing the number of dimensions directly, often appears to be a more pragmatic solution.

Principle Component Analysis (PCA) is probably the most common method used for dimensionality reduction.

Feature selection is another popular technique, which primary goal is to select a subset of relevant features. Reducing the number of features at play not only improves performance prediction, but it also reduces the necessary storage capacity and improves training times[4].

References

[1] Bellman, R. and Bellman, R.E., Adaptive Control Processes: A Guided Tour, Princeton Legacy Library, Princeton University Press, 1961
[2] Hastie, Trevor and Tibshirani, Robert and Friedman, Jerome, The Elements of Statistical Learning, Springer Series in Statistics, 2001
[3] James Taylor and Jochen Einbeck, Challenging the curse of dimensionality in multivariate local linear regression, Computational statistics, 28.3, 2013
[4] Guyon, I. and Elisseeff, A., An Introduction to Variable and Feature Selection, J. Mach. Learn. Res., 3, 2003