Suppose that one observes an incomplete subset of entries selected from a low-rank matrix. When is it possible to complete the matrix and recover the entries that have not been seen? We demonstrate that in very general settings, one can perfectly recover all of the missing entries from most sufficiently large subsets by solving a convex programming problem that finds the matrix with the minimum nuclear norm agreeing with the observed entries. The techniques used in this analysis draw upon parallels in the field of compressed sensing, demonstrating that objects other than signals and images can be perfectly reconstructed from very limited information.
In many practical problems of interest, one would like to recover a matrix from a sampling of its entries. As a motivating example, consider the task of inferring answers in a partially filled out survey in which questions are asked to a collection of individuals. Then we can form a matrix where the rows index the individuals and the columns index the questions. We collect data to fill out this table, but unfortunately, many questions are left unanswered. Is it possible to make an educated guess about what the missing answers should be? How can one make such a guess? Formally, we may view this problem as follows. We are interested in recovering a data matrix M with n1 rows and n2 columns but have access to only m of its entries, where m is much smaller than the total number of entries, n1n2. Can one recover the matrix M from m of its entries? In general, everyone would agree that this is impossible without some additional information.
In many instances, however, the matrix we wish to recover is known to be structured in the sense that it is low-rank or approximately low-rank. (We recall for completeness that a matrix has rank r if its rows or columns span an r-dimensional space.) Consider the following two scenarios as prototypical examples.
These examples are of course far from exhaustive and there are many other problems which fall in this general category.
Suppose for simplicity that we wish to recover a square n × n matrix M of rank r. Although M contains n2 numbers, our assumption that its rank is r means that it can be represented exactly by its singular value decomposition (SVD)
where VT denotes the transpose of V. Σ is an r × r diagonal matrix with real, positive elements σk > 0. U is an n × r matrix with orthonormal columns u1,..., ur. That is, uTkuk = 1 and uTi uj = 0 if i ≠ j. V is also n × r with orthonormal columns v1,..., vr. The column space of M is spanned by the columns of U, and the row space is spanned by the columns of V.
The number of degrees of freedom associated with a rank r matrix M is r(2n − r). To see this, note that Σ has r nonzero entries, and U and V each have nr total entries. Since U and V each satisfy r(r + 1)/2 orthogonality constraints, the total number of degrees of freedom is r + 2nr − r (r + 1) = r (2n − r). Thus, when r is much smaller than n, there are significantly fewer degrees of freedom than the size of M would suggest. The question is then whether M can be recovered from a suitably chosen sampling of its entries without collecting n2 measurements.
In this paper, we demonstrate that most low-rank matrices can be indeed recovered from a very sparse sampling of their entries. In Section 2, we summarize the main results of our paper, highlighting the necessary assumptions, algorithmic ingredients, and theoretical foundations of reconstructing matrices from a presented collection of entries. In Section 3, we survey the subsequent developments in this area, including refinements and important extensions of our theory. We close with a discussion of further progress and advances in low-rank and sparse modeling.
Which matrices?
In general, one cannot hope to be able to recover a low-rank matrix from a sample of its entries. Consider the rank 1 matrix M equal to
where here and throughout, ei is the ith canonical basis vector in Euclidean space (the vector with all entries equal to 0 but the ith equal to 1). The matrix M has the entries of x along its first row and all the other entries are 0. Clearly, this matrix cannot be recovered from a sampling of its entries unless we see all of the entries in the first row. As another example, the matrix e1eTn is a matrix with a 1 in the (1, n) entry and 0s everywhere else. If we do not see this upper right corner, then we cannot distinguish the matrix from the all 0s matrix.
Even if it is impossible to recover all low-rank matrices from a set of sampled entries, can one recover most of them? To investigate this possibility, we introduce a simple model of low-rank matrices.
DEFINITION 2.1. Let M be a rank r matrix with SVD defined by (1.1). Then we say that M belongs to the random orthogonal model if the family {uk}1≤k≤r is selected uniformly at random among all families of r orthonormal vectors, and similarly for {vk}1≤k≤r. The two families may or may not be independent of each other. We make no assumptions about the singular values, σk.
If a matrix is sampled from the random orthogonal model, then we would expect most of the entries to be non-zero. This model is convenient in the sense that it is both very concrete and simple, and useful in the sense that it will help us fix the main ideas. In the sequel, however, we will consider far more general models. The question for now is whether or not one can recover such a generic matrix from a sampling of its entries.
Which sampling sets?
Clearly, one cannot hope to reconstruct any low-rank matrix Meven of rank 1if the sampling set avoids any column or row of M. Suppose that M is of rank 1 and of the form xyT, x, y Rn so that the (i,j) entry is given by Mij = xiyj. Then, if we do not have samples from the first row, one could never infer the value of the first component x1 as no information about x1 is observed. There is, of course, nothing special about the first row and this argument extends to any row or column. To have any hope of recovering an unknown matrix, one needs to have access to at least one observation per row and one observation per column.
This example demonstrates that there are sampling sets where one would not even be able to recover matrices of rank 1. But what happens for typical sampling sets? Can one recover a low-rank matrix from almost all sampling sets of cardinality m? Formally, suppose that the set Ω of locations corresponding to the observed entries ((i,j) Ω if Mij is observed) is a set of cardinality m sampled uniformly at random. Then, can one recover a generic low-rank matrix M, perhaps with very large probability, from the knowledge of the value of its entries in the set Ω?
Which algorithm?
If the number of measurements is sufficiently large, and if the entries are close to uniformly distributed, one might hope that there is only one low-rank matrix with these entries. If this were true, one would want to recover the data matrix by solving the optimization problem
where X is the decision variable and rank(X) is equal to the rank of the matrix X. The program (2.2) is a common sense approach which simply seeks the simplest explanation fitting the observed data. If there were only one low-rank object fitting the data, the solution of (2.2) would recover M perfectly. This is unfortunately of little practical use, because not only is this optimization problem NP-hard but also all known algorithms which provide exact solutions require time doubly exponential in the dimension n of the matrix in both theory and practice.
If a matrix has rank r, then it has exactly r nonzero singular values so that the rank function in (2.2) is simply the number of nonvanishing singular values. In this paper, we consider an alternative which minimizes the sum of the singular values over the constraint set. This sum is called the nuclear norm,
where, here and below, σk(X) denotes the kth largest singular value of X. The heuristic optimization we study is then given by
Whereas the rank function is equal to the number of non-vanishing singular values, the nuclear norm equals their sum. The nuclear norm is to the rank functional what the convex l1 norm is to the l0 norm in the area of sparse signal recovery. The main point here is that the nuclear norm is a convex function and can be optimized efficiently via semidefinite programming.14
There are many norms one could define for a given matrix. The operator norm is the largest singular value. The Frobenius norm is equal to the square root of the sum of the squares of the entries. This norm is akin to the standard Euclidean norm on a real vector space. Why should the nuclear norm provide lower rank solutions than either of these two more commonly studied norms?
One can gain further intuition by analyzing the geometric structure of the nuclear norm ball. The unit nuclear norm ball is precisely the convex hull of the rank 1 matrices of unit Frobenius norm. The nuclear norm minimization problem (2.3) can be interpreted as inflating the unit ball until it just touches the affine space Xij = Mij. Such an intersection will occur at an extreme point of the nuclear norm ball, and these extreme points are sparse convex combinations of rank 1 matrices. That is, the extreme points of the nuclear norm ball have low rank. This phenomenon is depicted graphically in Figure 1. There, we plot the unit ball of the nuclear norm for matrices parametrized as
The extreme points of this cylindrical object are the rank 1 matrices with unit Frobenius norm. The red line in this figure is a "random," one-dimensional, affine subspace which, as expected, intersects the nuclear norm ball at a rank 1 matrix.
As further motivation, an interesting connection exists between the nuclear norm and popular algorithms in data-mining and collaborative filtering. In these fields, researchers commonly aim to find an explicit factorization X = LRT that agrees with the measured entries. Here L and R are n × k matrices. Since there are many possible such factorizations that might agree with the observations, a common approach searches for matrices L and R that have Frobenius norm as small as possible, that is, the solution of the optimization problem
where we are minimizing with respect to L ε Rnxk, R Rnxk, and X nxn, and ||·||F denotes the Frobenius norm. Surprisingly, the optimization problem (2.4) is equivalent to minimization of the nuclear norm subject to the same equality constraints provided k is chosen to be larger than the rank of the optimum of the nuclear norm problem (2.3).30
To get an intuition for this equivalence, take any matrix X of rank k. Suppose the SVD is X = UΣVT. If we set L:= UΣ1/2 and R:= VΣ1/2, we see that
because ΣiUij2 =ΣiVij2 = 1 for all j. Thus, the optimal solution of (2.3) is suboptimal for (2.4). The full equivalence can be seen via an appeal to semidefinite programming and can be found in Recht et al.30
The main advantage of this reformulation (2.4) is to substantially decrease the number of decision variables from n2 to 2nr. For large problems, this leads to a significant reduction in computation time, such that very large instances can be solved on a desktop computer. On the other hand, the formulation (2.4) is nonconvex and thus potentially has local minima that are not globally optimal. Nonetheless, this factored approximation (2.4) of the nuclear norm is one of the most successful stand-alone approaches to solving the Netflix Prize problem.16,24 Indeed, it was one of the foundational components of the winning team's prediction engine.
2.1. Main results
As seen in our first example (2.1), it is impossible to recover a matrix which is equal to 0 in nearly all of its entries unless we see all the entries of the matrix. This is particularly likely if the singular vectors of a matrix M have most of their mass concentrated in a few coordinates. For instance, consider the rank 2 symmetric matrix M given by
where the singular values are arbitrary. Then, this matrix vanishes everywhere except in the top-left 2 × 2 corner, and one would basically need to see all the entries of M to be able to recover this matrix exactly. There is an endless list of examples of this sort. Hence, we arrive at the notion that the singular vectors need to be sufficiently spread across all componentsthat is, uncorrelated with the standard basisin order to minimize the number of observations needed to recover a low-rank matrix. This motivates the following definition.
DEFINITION 2.2. Let U be a subspace of n of dimension r and PU be the orthogonal projection onto U. Then the coherence of U (vis-à-vis the standard basis (ei)) is defined to be
Note that for any subspace, the smallest μ(U) can be is 1, achieved, for example, if U is spanned by vectors whose entries all have magnitude . The largest possible value for μ(U) is n/r which would correspond to any subspace that contains a standard basis element. Matrices whose column and row spaces have low coherence are likely not to vanish in too many entries and are our most likely candidates for matrices that are recoverable from a few samples. As we discuss below, subspaces sampled from the random orthogonal model (Definition 2.1) have nearly minimal coherence.
To state our main result, we introduce two assumptions about an n1 × n2, rank r matrix M whose SVD is given by (1.1) and with column and row spaces denoted by U and V, respectively.
These definitions implicitly define two critical parameters, μ0 and μ1. These μ's may depend on r and n1, n2. Moreover, note that A1 always holds with since the (i, j)th entry of the matrix Σ1≤k≤r ukvTk is given by Σ1≤k≤r uikvjk and by the CauchySchwarz inequality,
Hence, for sufficiently small ranks, μ1 is comparable to μ0. We say that a subspace U ⊂ n is incoherent with the standard basis if μ(U) is at most logarithmic in n. As we show in the full version of this paper that, for larger ranks, both subspaces selected from the uniform distribution and spaces constructed as the span of singular vectors with bounded entries are not only incoherent with the standard basis but also obey A1 with high probability for values of μ1 at most logarithmic in n1 and/or n2.
We are now in a position to state our main result: if a matrix has row and column spaces that are incoherent with the standard basis, then nuclear norm minimization can recover this matrix from a random sampling of a small number of entries.
THEOREM 2.3. Let M be an n1 × n2 matrix of rank r obeying A0 and A1 and put n = max(n1, n2). Suppose we observe m entries of M with locations sampled uniformly at random. Then there exist constants C, c such that if
for some β > 2, then the minimizer to the problem (2.3) is unique and equal to M with probability at least 1−cn−β. For r ≤ μ0−1n1/5 this estimate can be improved to
with the same probability of success.
Theorem 2.3, proven in the full version of this paper, asserts that if the coherence is low, few samples are required to recover M. For example, if μ0 is a small constant and the rank is not too large, then the recovery is exact with large probability provided that
We give two illustrative examples of matrices with incoherent column and row spaces. This list is by no means exhaustive.
for some value of μB = O(1). Then, the maximum coherence is at most μB since μ(U) ≤ μB and μ(V) ≤ μB. Further, we show in the full version of this paper that A1 holds most of the time with . Thus, for matrices with singular vectors obeying (2.6), the recovery is exact provided that m obeys (2.5) for values of the rank not exceeding μ−1Bn1/5.
2.2. Numerical validation
To demonstrate the practical applicability of the nuclear norm heuristic for recovering low-rank matrices from their entries, we conducted a series of numerical experiments for a variety of the matrix sizes n, ranks r, and numbers of entries m. For each (n, m, r) triple, we repeated the following procedure 50 times. We generated M, an n × n matrix of rank r, by sampling two n × r factors ML and MR with i.i.d. Gaussian entries and setting M = MLMTR. We sampled a subset Ω of m entries uniformly at random. Then, the nuclear norm minimization problem was solved using the semidefinite programming solver, SeDuMi.33 We declared M to be recovered if the solution returned by the solver, Xopt, satisfied ||Xopt − M||F/||M||F < 10−3. Figure 2 shows the results of these experiments for n = 50. The x-axis corresponds to the fraction of the entries of the matrix that are revealed to the SDP solver. The y-axis corresponds to the ratio between the dimension of the rank r matrices, dr = r (2n − r), and the number of measurements m.
Note that the axes range from 0 to 1 as a value >1 on the x-axis corresponds to an overdetermined linear system where the semidefinite program always succeeds, and a value > 1 on the y-axis corresponds to when there are an infinite number of rank r matrices with the provided entries. The color of each cell in the figures reflects the empirical recovery rate of the 50 runs (scaled between 0 and 1). Interestingly, the experiments reveal very similar plots for different n, suggesting that our theoretical upper bounds on recovery may be rather conservative.
For a second experiment, we generated random positive semidefinite matrices and tried to recover them from their entries using the nuclear norm heuristic. As above, we repeated the same procedure 50 times for each (n, m, r) triple. We generated M, an n × n positive semidefinite matrix of rank r, by sampling an n × r factor MF with i.i.d. Gaussian entries and setting M = MFMTF. We sampled a subset Ω of m entries uniformly at random. Then, we solved the nuclear norm minimization problem with an additional constraint that the decision variable be positive definite. Figure 2(b) shows the results of these experiments for n = 50. The x-axis again corresponds to the fraction of the entries of the matrix that are revealed to the solver, but, in this case, the number of measurements is divided by Dn = n(n + 1)/2, the number of unique entries in a positive-semidefinite matrix, and the dimension of the rank r matrices is dr = nr − r(r − 1)/2. The color of each cell is chosen in the same fashion as in the experiment with full matrices. Interestingly, the recovery region is much larger for positive semidefinite matrices, and future work is needed to investigate if the theoretical scaling is also more favorable in this scenario of low-rank matrix completion.
These phase transition diagrams reveal a considerably smaller region of parameter space than the Gaussian models studied in Recht et al.30 In the experiments in Recht et al.,30 M was generated in the same fashion as above, but, in the place of sampling entries, we generated m random Gaussian projections of the data (see the discussion in Section 2.4). In these experiments, the recovery regime is far larger than that in the case of sampling entries, but this is not particularly surprising as each Gaussian observation measures a contribution from every entry in the matrix M.
2.3. More general bases
Our main result (Theorem 2.3) extends to a variety of other low-rank matrix completion problems beyond the sampling of entries. Indeed, suppose we have two orthonormal bases f1,..., fn and g1,..., gn of n, and that we are interested in solving the rank minimization problem
The machine learning community's interest in specialized algorithms for multiclass and multitask learning provides a motivating example (see, e.g., Amit et al.1 and Argyriou et al.2). In multiclass learning, the goal is to build multiple classifiers with the same training data to distinguish between more than two categories. For example, in face recognition, one might want to classify whether an image patch corresponds to an eye, nose, or mouth. In multitask learning, we have a large set of data and a variety of different classification tasks, but, for each task, only partial subsets of the data are relevant. For instance, in activity recognition, we may have acquired sets of observations of multiple subjects and want to determine if each observed person is walking or running. However, a different classifier is desired for each individual, and it is not clear how having access to the full collection of observations can improve classification performance. Multitask learning aims to take advantage of access to the full database to improve performance on individual tasks. A description of how to apply our results to the multiclass setting can be found in the full version of this paper.
To see that our theorem provides conditions under which (2.7) can be solved via nuclear norm minimization, note that there exist unitary transformations F and G such that ej = Ffj and ej = Ggj for each j = 1,..., n. Hence,
Then, if the conditions of Theorem 2.3 hold for the matrix FXGT, it is immediate that nuclear norm minimization finds the unique optimal solution of (2.7) when we are provided a large enough random collection of the inner products fTiMgj. In other words, all that is needed is that the column and row spaces of M be, respectively, incoherent with the bases (fi) and (gi).
2.4. Connections, alternatives, and prior art
Nuclear norm minimization is a recent heuristic introduced by Fazel14 and is an extension of the trace heuristic often used in control theory; see, for example, Beck and D'Andrea.3 Indeed, when the matrix variable is symmetric and positive semidefinite, the nuclear norm of X is the sum of the (nonnegative) eigenvalues and thus equal to the trace of X. Hence, for positive semidefinite unknowns, X, (2.3) becomes the semidefinite program
Even for the general matrix M, which may not be positive definite or even symmetric, the nuclear norm heuristic can be formulated in terms of semidefinite programming. The program (2.3) is equivalent to
with optimization variables X, W1, and W2 (see, e.g., Fazel14). There are many efficient algorithms and high-quality software packages available for solving these types of problems.
Our work is inspired by results in the emerging field of compressive sampling or compressed sensing, a new paradigm for acquiring information about objects of interest from what appears to be a highly incomplete set of measurements.8,13 In practice, this means that high-resolution images can be captured with fewer sensors or that signal acquisition can be accelerated by orders of magnitude in biomedical applications, simply by taking far fewer specially coded samples. Mathematically speaking, we wish to reconstruct a signal x n from a small number of measurements y = Φx, y m with m much smaller than n; that is, we have far fewer equations than unknowns. In general, one cannot hope to reconstruct x but assume now that the object we wish to recover is known to be structured in the sense that it is sparse (or approximately sparse). This means that the unknown object depends upon a smaller number of unknown parameters. Then, it has been shown that l1 minimizationminimizing the sum of the absolute values of x, subject to the linear constraints y = Φxallows recovery of sparse signals from remarkably few measurements.8 If Φ is chosen randomly from a suitable distribution, then with very high probability, all sparse signals with about k nonzero entries can be recovered from on the order of k log n measurements. For instance, if x is k-sparse in the Fourier domain, that is, x is a superposition of k sinusoids, then it can be perfectly recovered with high probabilityby l1 minimizationfrom the knowledge of about k log n of its entries sampled uniformly at random.
From this viewpoint, the results in this paper greatly extend the theory of compressed sensing by showing that other types of interesting objects or structures, beyond sparse signals and images, can be recovered from a limited set of measurements. Moreover, the techniques for proving our main results build upon ideas from the compressed sensing literature together with powerful probabilistic tools for bounding norms of operators between Banach spaces.
Also, our notion of coherence generalizes the concept of the same name in compressive sensing. Notably, the authors Candès and Romberg7 introduce the notion of the coherence of a unitary transformation U; the coherence of U is simply proportional to maxj,k|Uj,k|2. This quantity plays a crucial role in determining the minimal sampling rate necessary to recover a k-sparse signal by l1 minimization.
In Recht et al.,30 the authors studied the nuclear norm heuristic applied to a related problem where partial information about a matrix M is available from m equations of the form
Here, for each k, {Aij(k)}ij is an i.i.d. sequence of Gaussian or Bernoulli random variables and the sequences {A(k)} are also independent of each other (the sequences {A(k)} and {bk} are available to the analyst). Building on the concept of restricted isometry in the context of sparse signal recovery, Recht et al.30 establish the first sufficient conditions for which the nuclear norm heuristic returns the minimum rank element in the constraint set. The authors prove that the heuristic succeeds with large probability whenever the number m of available measurements is greater than a constant times 2nr log n for n × n matrices of rank r. These results do not generalize to the matrix completion problem of interest to us in this paper. The measurements in (2.8) give some information about all the entries of M, whereas in our problem information about most of the entries is simply not available. As a consequence, our methods are quite different and require more involved probabilistic analysis.
Our work also has close connections with the study of stochastic algorithms for low-rank matrix approximation. In this body of work, one is interested in sampling some entries of a matrix in order to construct an approximate factorization of this matrix. Typically, it is assumed that one may sample any subset of entries but would like to minimize the computational complexity involved in constructing an approximation. Pioneering work in this area appears in Frieze et al.15 and Liberty et al.,25 and an extensive survey of these methods can be found in Halko et al.19 While this body of work also uses similar foundational theory of random matrices, our modeling assumptions are fundamentally different. Here, we are primarily concerned with the scenario where we have very limited control over which entries of the matrix we can observe. In the examples described in the introduction, one does not have access to all of the entries of the matrix due to systemic constraints. Surprisingly, our results demonstrate that low-rank matrices can be recovered exactly from almost all sufficiently large subsets of entries. However, when we have the ability to sample entries at will, the algorithmic recovery schemes become considerably more efficient. We see our results as complementary extremes of the sort of access one may have to the entries of a matrix.
Indeed, when the sampling can be chosen in specially designed patterns, the exact recovery problem becomes dramatically simpler. For example, suppose that M is generic and that we precisely observe every entry in the first r rows and columns of the matrix.
Write M in block form as
with M11 an r × r matrix. In the special case that M11 is invertible and M has rank r, it is easy to verify that M22 = M21M−111M12. One can prove this identity by forming the SVD of M. That is, if M is generic, the upper r × r block is invertible, and we observe every entry in the first r rows and columns, we can recover M. This result immediately generalizes to the case where one observes r rows and r columns and the r × r matrix at the intersection of the observed rows and columns is invertible. Algorithms in the stochastic low-rank matrix approximation literature are essentially no more complicated than this simple algorithm. They use randomness to add numerical robustness and to guarantee that the sampled entries span the row/column space of the matrix to be acquired.
Our original article announced the possibility of various refinements and extensions, and invited researchers to develop the new field of matrix completion. We are pleased to see that the area of low-rank modeling and matrix completion has been quite active, and the field is growing at a very fast pace. In fact, there are so many new and exciting results recently developed that it is unfortunately impossible to review them all here. Below, we survey selected progress that has occurred since our original submission.
3.1. Improvements and other approaches
The results discussed in Section 2.1 show that under suitable conditions, one can reconstruct an n × n matrix of rank r from a small number, m, of its sampled entries provided that m is on the order of n1,2r log n, at least for moderate values of the rank. One would like to know whether better results hold, in the sense that exact matrix recovery would be guaranteed with a reduced number of measurements. In particular, recall that an n × n matrix of rank r depends on (2n − r)r degrees of freedom; is it possible to recover most low-rank matrices from on the order of nr randomly selected entries? Can the sample size be merely proportional to the true complexity of the low-rank object we wish to recover?
In this direction, we would like to emphasize that there is nothing in the approach of our original paper that stands in the way of stronger results. Our proof architecture requires bounding an infinite matrix series in the operator norm. We develop a bound on the spectral norm of each of the first four terms of this series and a general argument to bound the remainder of the series in the full version of this paper. Presumably, one could bound higher order terms by the same techniques. Getting an appropriate bound on the fifth term would lower the exponent of n from 6/5 to 7/6. The appropriate bound on the sixth term would further lower the exponent to 8/7, and so on. To obtain an optimal result, one would need to bound O(log n) terms.
Following this main idea, the authors Candès and Tao9 reduced the upper bound on the number of required measurements to O(nr log6(n)) using a combinatorial argument to bound precisely this particular series. Their results depend on some additional assumptions, including a "strong incoherence condition" that is more restrictive than the one defined in Section 2.1. However, they also show that no algorithm could succeed with high probability if less than Θ(nrlog(n)) entries were observed.
An unexpected and clever method for approximating this infinite matrix series was invented in Gross et al.18 This new approach used powerful large deviation bounds from quantum information theory combined with an iterative construction that circumvented much of the combinatorics necessary for the proof in Candès and Tao.9 This approach is dramatically simpler than previous approaches, and, using this technique, it was shown that O(nrlog2(n)) entries were sufficient for exact matrix completion in Gross17 and Recht.29 In Recht,29 the leading constant was even upper bounded by 64.
From a very different perspective, the authors in Keshavan et al.21 provided a non-convex algorithm for low-rank matrix recovery. Here, the authors analyze a gradient descent scheme over the Grassmannian manifold of subspaces. Using some of the techniques developed in the full version of this paper, the authors show that this nonconvex problem is actually convex in a neighborhood of the true low-rank matrix provided the number of observed entries exceeds O(nlog(n)) and the rank is less than log(n). This provides the asymptotically tightest bound on the number of entries required for recovery, but the authors need to assume that the singular values of the unknown low-rank matrix are all of order unity and that the rank is less than log(n) for their results to hold.
3.2. Toward a more general theory
More general measurement models. In our original work, we anticipated in Section 1.3 that our results would extend to the case where one observes a small number of arbitrary linear functionals of a hidden matrix M. Set N = n2 and let A1,..., AN be an orthonormal basis for the linear space of n × n matrices with the usual inner product 〈X, Y〉 = trace(XTY). Then, we predicted that our results should also apply to the rank minimization problem
where Ω ⊂ {1,..., N} is selected uniformly at random. In fact, (3.1) is (2.2) when the orthobasis is the canonical basis (eieTj)1≤i, j≤n. We conjectured that those low-rank matrices that have small inner product with all the basis elements Ak may be recoverable by nuclear norm minimization.
This conjecture was proven to be true by Gross,17 where a general definition of coherence was provided, and it was shown that the same number of measurements sufficed for reconstruction under this modified definition. Additionally, in Gross et al.,18 it was shown that the Pauli basis, studied in quantum information theory, was incoherent with any basis. This fact follows because all of the matrices in the Pauli basis are mutually orthogonal and unitary. Hence, the Pauli basis is a deterministic collection of matrices such that a random subset of these matrices can be used to reconstruct any low-rank matrix. This result has been applied to propose new methods in quantum-state tomography where one aims to determine the state of some quantum mechanical system with as few measurements or experiments as possible.
Matrix completion with noise. All of the results described above concern the problem of exact matrix completion, where we have perfect information about the entries of the matrix to be reconstructed. Of course, in almost all real-world problems, we can only gain access to noisy samples of the entries of the matrices we would like to recover. Fortunately, many authors have investigated the stability of matrix completion when the observations are noisy. The first such result6 uses a stability argument based on convexity to guarantee accurate recovery from noisy data. Several subsequent works studied this problem under different matrix models. The work in Keshavan et al.22 gives near-optimal bounds provided the unknown matrix obeys additional assumptions which say that the singular values are all about the same size. In Negahban and Wainwright,28 error bounds are derived provided the matrix is not spiky; that is to say, assuming that all the entries have about the same magnitude. We additionally invite interested readers to peruse Koltchinskii et al.,23 which very recently introduced powerful results with yet a slightly different matrix model.
Algorithmic innovations. While it was known that the nuclear norm problem could be efficiently solved by semidefinite programming, the results of Recht et al.30 and the full version of this paper have inspired the development of many special purpose algorithms to rapidly minimize the nuclear norm. For example, in Cai et al.4 and Ma et al.,27 the authors show that many of the fast first order methods developed for compressed sensing problems can be adapted to solve large-scale matrix completion problems. These algorithms are projected gradient algorithms which operate by alternately correcting the predictions on the observed entries and soft-thresholding the singular values of the iterate. Using accelerated gradient schemes, the authors Ji and Ye20 and Toh and Yun34 have improved these algorithms to provide very fast implementations of nuclear norm minimization. These codes enable the solution of problems with hundreds of thousands of rows and columns in a few hours on a standard workstation. Moreover, the numerical experiments in these works confirm that nuclear minimization can successfully recover very large low-rank matrices with on the order of 3 to 5 times the number of latent degrees of freedom. Additional experiments demonstrate that low-rank matrices can be robustly recovered under significant additive noise using nuclear-norm minimization.
3.3. From sparsity to rank and beyond
In concert with Recht et al.,30 our work on matrix completion crystallized some of the foundational ideas of compressed sensing. We were able to extend the notion of sparsity to the much more general concept of matrix rank, and situate the main ideas of compressed sensing in a dramatically broader context.
One exciting new development since the appearance of our original paper shows that the notions of sparsity and rank are in some sense orthogonal. If a matrix can be written as a sum of a low-rank matrix and a sparse matrix, then these two matrices can be identified and deconvolved from their sum. Deterministic conditions required for such an algorithm to work were provided in Chandrasekaran et al.12 A randomized analysis in Candes et al.5 provided sharper recovery guarantees and furnished a new method for Robust Principal Components Analysis, demonstrating that principal components could be constructed even in the presence of a large number of outliers. Moreover, the results of Chandrasekaran et al.12 were extended to provide convex algorithms for identifying Gaussian graphical models in Chandrasekaran et al.10 Prior art in this area had resorted to nonconvex heuristics based on ExpectationMaximization with no provable guarantees. It is quite surprising that, under very modest assumptions, a convex algorithm can solve a hidden variable estimation problem in multivariate statistics.
This subsequent research has shown that there is much more work to be done in this area. Work in compressed sensing, matrix completion, and their generalizations have shown that convex optimization can be used to solve a myriad of hard identification problems at nearly optimal rates. But the picture is likely much broader than what we currently understand. There are likely notions of simplicity beyond rank and sparsity that can also be leveraged in high-dimensional data analysis to open new frontiers in low-rate sampling. New work in Chandrasekaran et al.11 develops a unified program for recovering simple signals and objects from incomplete information, illustrating a general approach for translating expert domain knowledge into convex optimization algorithms. This work not only generalizes prior art on compressed sensing and matrix completion but also provides several new models where low-rate sampling can recover specially structured models. Such new developments suggest that we have only begun to scratch the surface of the types of models and objects that may be recovered from highly incomplete information.
1. Amit, Y., Fink, M., Srebro, N., Ullman, S. Uncovering shared structures in multiclass classification. In Proceedings of the Twenty-fourth International Conference on Machine Learning (2007).
2. Argyriou, A., Evgeniou, T., Pontil, M. Multi-task feature learning. In Neural Information Processing Systems, 2007.
3. Beck, C., D'Andrea, R. Computational study and comparisons of LFT reducibility methods. In Proceedings of the American Control Conference (1998).
4. Cai, J.F., Candès, E.J., Shen, Z. A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20, 4 (2008), 19561982.
5. Candes, E.J., Li, X., Ma, Y., Wright, J. Robust principal component analysis? Submitted for publication. Preprint available at http://www-stat.stanford.edu/~candes/papers/RobustPCA.pdf, 2009.
6. Candès, E.J., Plan, Y. Matrix completion with noise. Proc. IEEE 98, 6 (2009), 925936.
7. Candès, E.J., Romberg, J. Sparsity and incoherence in compressive sampling. Inverse Probl. 23, 3 (2007), 969985.
8. Candès, E.J., Romberg, J., Tao, T. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inform. Theor. 52, 2 (2006), 489509.
9. Candès, E.J., Tao, T. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans. Inform. Theor. 56, 5 (2009), 20532080.
10. Chandrasekaran, V., Parrilo, P.A., Willsky, A.S. Latent variable graphical model selection via convex optimization. Submitted for publication. Preprint available at http://ssg.mit.edu/~venkatc/cpw_lgm_preprint10.pdf, 2010.
11. Chandrasekaran, V., Recht, B., Parrilo, P.A., Willsky, A. The convex geometry of linear inverse problems. Submitted for publication. Preprint available at arxiv.org/1012.0621, 2010.
12. Chandrasekaran, V., Sanghavi, S., Parrilo, P.A., Willsky, A.S. Ranksparsity incoherence for matrix decomposition. SIAM J. Optim. 21, 2 (2011), 572596.
13. Donoho, D.L. Compressed sensing. IEEE Trans. Inform. Theor. 52, 4 (2006), 12891306.
14. Fazel, M. Matrix Rank Minimization with Applications. PhD thesis, Stanford University, 2002.
15. Frieze, A., Kannan, R., Vempala, S. Fast Monte-Carlo algorithms for finding low-rank approximations. J. ACM 51, 6 (2004).
16. Funk, S. Netflix update: Try this at home, December 2006. http://sifter.org/~simon/journal/20061211.html.
17. Gross, D. Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inform. Theor. 57 (2011), 15481566.
18. Gross, D., Liu, Y.K., Flammia, S.T., Becker, S., Eisert, J. Quantum state tomography via compressed sensing. Phys. Rev. Lett. 105, 15 (2010), 150401.
19. Halko, N., Martinsson, P.G., Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53, 2 (2011), 217288.
20. Ji, S., Ye, J. An accelerated gradient method for trace norm minimization. In Proceedings of the ICML (2009).
21. Keshavan, R.H., Montanari, A., Oh, S. Matrix completion from a few entries. IEEE Trans. Inform. Theor. 56, 6 (2009), 29802998.
22. Keshavan, R.H., Montanari, A., Oh, S. Matrix completion from noisy entries. J. Mach. Learn. Res. 11 (2009), 20572078.
23. Koltchinskii, V., Tsybakov, A.B., Lounici, K. Nuclear norm penalization and optimal rates for noisy low rank matrix completion. Submitted for publication. Preprint available at arxiv.org/abs/1011.6256, 2010.
24. Koren, Y., Bell, R., Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009), 3037.
25. Liberty, E., Woolfe, F., Martinsson, P.G., Rokhlin, V., Tygert, M. Randomized algorithms for the low-rank approximation of matrices. Proc. Nat. Acad. Sci. 104, 51 (2007), 2016720172.
26. Linial, N., London, E., Rabinovich, Y. The geometry of graphs and some of its algorithmic applications. Combinatorica 15 (1995), 215245,.
27. Ma, S., Goldfarb, D., Chen, L. Fixed point and Bregman iterative methods for matrix rank minimization. Math. Program. 128, 1 (2011), 321353.
28. Negahban, S., Wainwright, M.J. Restricted strong convexity and weighted matrix completion: Optimal bounds with noise. Submitted for publication. Preprint available at arxiv.org/abs/1009.2118, 2010.
29. Recht, B. A simpler approach to matrix completion. J. Mach. Learn. Res. (2009). To Appear 2012. Preprint available at http://arxiv.org/abs/0910.0651.
30. Recht, B., Fazel, M., Parrilo, P. Guaranteed minimum rank solutions of matrix equations via nuclear norm minimization. SIAM Rev. 52, 3 (2010), 471501.
31. Rennie, J.D.M., Srebro, N. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the International Conference of Machine Learning (2005).
32. So, A.M.C., Ye, Y. Theory of semidefinite programming for sensor network localization. Math. Program., Ser. B 109 (2007), 367384.
33. Sturm, J.F. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim. Meth. Software 1112 (1999), 625653.
34. Toh, K.-C., Yun, S. An accelerated proximal gradient algorithm for nuclear norm regularized least squares problems. Pac. J. Math. 6 (2010), 615640.
The original version of this paper was published in Foundations of Computational Mathematics 9, 6 (2009), 717772.
Figure 1. Unit ball of the nuclear norm for symmetric 2 × 2 matrices. The red line depicts a random one-dimensional affine space. Such a subspace will generically intersect a sufficiently large nuclear norm ball at a rank one matrix.
Figure 2. Recovery of full matrices from their entries. For each (n, m, r) triple, we repeated the following procedure 50 times. A matrix M of rank r and a subset of m entries were selected at random. Then, we solved the nuclear norm minimization for X subject to Xij = Mij on the selected entries. We declared M to be recovered if ||Xopt − M||F/||M||F < 10−3. The results are shown for (a) general 50 × 50 matrices (b) 50 × 50 positive definite matrices. The color of each cell reflects the empirical recovery rate (scaled between 0 and 1). White denotes perfect recovery in all experiments, and black denotes failure for all experiments.
©2012 ACM 0001-0782/12/0600 $10.00
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from [email protected] or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2012 ACM, Inc.
No entries found