The Curse of Dimensionality
One of the new characteristics that separates modern data science from much of the earlier work is the large number of features used. Statistics models used to be considered large if they contained 10 to 20 features. Now, it’s not uncommon for there to be 10 to 20 thousand features to be used in a model. The explosion comes from the fact that there is so much more data available, and we’ve seen a marked decrease in computational costs.
Let’s take a look at how high-dimensional data complicates analyses. For simplicity, I’ll start off with two and three dimensions to gain a little intuition, but we’ll see that higher dimensions are even stranger than the oddities we’ll first uncover.
Most of standard analyses can be thought of as either fitting a hyperplane that passes through the data (regression) or fitting a hyperplane to separate regions of space that contain data (classification). In both cases, we need a hyperplane. For simplicity, I’ll talk about regression, but most of the problems apply equally to classification.
The first problem that we encounter is the difficulty in finding a unique solution to the problem. In three-dimensional space, we need at least three non-collinear points to make a plane. In higher dimensions, we’ll be able to find exactly one hyperplane that fits all the points if we have the same number of non-collinear points as dimensions. However, this is often a luxury in data science. Sometimes we’ll have less points than dimensions, which means there are an infinite number of hyperplanes that fit the data. If the number of points is equal to the number of features, then the exact solution will exist but will be sensitive to any variation in a single point. We’ll often find we just don’t have enough data to find a robust fit to so many features. This is the first aspect to the curse of dimensionality.
The second problem is more subtle, but extends to even large datasets. The first intuition, which is only partially right, is consider how many points are needed to fill a space. Say we want to fill the space between -10 and 10 in 1D space with steps of 1 unit. We need 21 points. For 2D space between -10 and 10 on each axis, we need 21 x 21 = 441 points. For 3D space, we need 9261. The number rises quickly as dimensions are added.
However, the problem is ever more insidious. Let’s think about a point that lies at [0.5, 0.5] in 2D space and [0.5, 0.5, 0.5] in 3D space. What is the distance to the nearest integer neighbor of a unit grid? For 2D space, it’s 0.707 units, but for 3D, the distance is 0.866 units. This means that the situation is worse than it looks at first glance. As the number of dimensions increases, even naively scaling on a grid is not enough. A random point is unlikely to be near another point, but there will be many points at nearly the same distance far away.
Overall, the curse of dimensionality can wreak havoc on what appears to be a straightforward analysis. As the number of dimensions increases, the number of data points needed just to find a unique solution grows quickly. However, even if one tries to mitigate this problem by adding a more points along the new dimensions, the volume that each point represents grows as well. Since many algorithms implicitly or explicitly use nearby points to make a prediction, the increasing volume surrounding each point means that a nearby point is unlikely to exist. One must always be aware that the curse of dimensionality can make an analysis that works well in low dimensions fail miserably in high dimensions even if one expands the sample space into the extra dimensions.