In studying the genetic basis of a disease, it is now common to select a set of relevant genes G, and to measure how strongly they are expressed in cell samples from a group of patients, some healthy and some ill.1 The expression level of a gene is mapped to a value in [-1,1]. Each patient's data is then a vector with one entry per gene, or equivalently, a point in R|G|. The size of G is frequently in the thousands, which makes the data high-dimensional by present standards.
Analyzing data in a high-dimensional space Rd is rife with challenges. First, many statistical estimators are accurate only when the number of data points (call it n) is orders of magnitude larger than d. Some models, such as histograms, have error rates of the form n−1/d, so that to halve the error, you need 2d times as much data: bad news even for double-digit d. A second difficulty is computational, as typified by the ever-important 2-means clustering problem: divide a data set into two groups, each with a designated center, to minimize the average squared distance from data points to their closest centers. Naive approaches run in time nd, which is astronomical even for smallish d, and NP-hardness results have dampened hopes for dramatically better algorithms. As a result, such problems are attacked with heuristics that offer no guarantees on their solutions. Understanding the quality of these schemes is doubly tricky because they are often justified on intuitive grounds, inevitably informed by experiences of 2D and 3D spacewhile high dimensional spaces are full of strange and counterintuitive effects.
Such complications are collectively referred to as the curse of dimension: A superstition that the murky realm of high dimension brings bad luck. But mathematical analysis is starting to clear away the haze. A pleasant surprise is that the counterintuitive geometry of high-dimensional space, once properly characterized, can be exploited to defeat other facets of the curse.
Even a solid ballthe simplest of objectshas unusual aspects in high dimension. For d=2 or d=3, a set of points picked at random from the unit ball Bd = {x in Rd: ||x|| ≤ 1} will have some significant fraction near the origin, say within distance ½ of it. But we wouldn't find even one such point in high dimension unless we were to draw exponentially many points from the ball. This is because points at distance r < 1 from the origin constitute an rd fraction of Bd, and this fraction goes rapidly to zero with rising dimension. For large d, the supposedly filled-in ball Bd is in fact well approximated by a thin, hollow shell, {x in Rd: 1−ε≤ ||x|| ≤ 1} for ε = O(1/d).
Here's another curiosity. Suppose we need lots of vectors in Rd that are orthogonal (at right angles) to each other. How many can we get? Exactly d, because this is the maximum number of linearly independent vectors. But if we only need them approximately orthogonalwith angles that need not be exactly 90 degrees, but 90±εthen we can find exponentially many vectors. A collection of exp(O(ε2 d)) vectors picked at random from the surface of Bd will with high probability satisfy the angle constraint.
The curse of dimension: A superstition that the murky realm of high dimension brings bad luck. But mathematical analysis is starting to clear away the haze. A pleasant surprise is that counterintuitive geometry can defeat other facets of the curse.
These examples hint at the strangeness of high-dimensional space. However, such effects do not directly help with data analysis because they pertain to very specialized sets of pointsthose chosen randomly from the unit ballwhereas real data sets might look different. The trick is to take an arbitrary data set and then add randomness to it in such a way that the outcome is helpful and predictable.
An early groundbreaking result of this kind was Dvoretsky's theorem.2 Let K be a convex body in Rd: for instance, the feasible region of a linear program with d variables. K could be enormously complicated. But a random slice through K (of appropriate dimension) will with high probability be almost ellipsoidal. More precisely, it will contain an ellipsoid E and be contained within (1+o(1))E.
A more recent result is the Johnson-Lindenstrauss theorem.3 Take any n points in Euclidean space of arbitrarily high dimension. If they are projected into a random subspace of O(log n) dimensions, distances between points will with high probability be almost perfectly preserved. Since clustering and other forms of data analysis frequently depend only on interpoint distances, the dimension of data sets can automatically be reduced to O(log n).
The following paper by Nir Ailon and Bernard Chazelle entitled "Faster Dimension Reduction" demonstrates an ingenious variant of this theorem that permits the projection to be achieved especially fast.
1. Alizadeh, A. et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403 (2000), 503511.
2. Dvoretzky, A. Some results on convex bodies and Banach spaces. In Proceedings of the International Symposium on Linear Spaces (Jerusalem, 1961), 123160.
3. Johnson, W. and Lindenstrauss, J. Extensions of Lipschitz maps into a Hilbert space. Contemporary Mathematics 26 (1984), 189206.
©2010 ACM 0001-0782/10/0200 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
No entries found