Don’t invert that matrix

There is hardly ever a good reason to invert a matrix.

What do you do if you need to solve Ax = b where A is an n x n matrix? Isn’t the solution A-1 b? Yes, theoretically. But that doesn’t mean you need to actually find A-1. Solving the equation Ax = b is faster than finding A-1. Books might write the problem as x = A-1 b, but that doesn’t mean they expect you to calculate it that way.

What if you have to solve Ax = b for a lot of different b‘s? Surely then it’s worthwhile to find A-1. No. The first time you solve Ax = b, you factor A and save that factorization. Then when you solve for the next b, the answer comes much faster. (Factorization takes O(n3) operations. But once the matrix is factored, solving Ax = b takes only O(n2) operations. Suppose n = 1,000. This says that once you’ve solved Ax = b for one b, the equation can be solved again for a new b 1,000 times faster than the first one. Buy one get one free.)

What if, against advice, you’ve computed A-1. Now you might as well use it, right? No, you’re still better off solving Ax = b than multiplying by A-1, even if the computation of A-1 came for free. Solving the system is more numerically accurate than the performing the matrix multiplication.

It is common in applications to solve Ax = b even though there’s not enough memory to store A-1. For example, suppose n = 1,000,000 for the matrix A but A has a special sparse structure — say it’s banded — so that all but a few million entries of A are zero.  Then A can easily be stored in memory and Ax = b can be solved very quickly. But in general A-1 would be dense. That is, nearly all of the 1,000,000,000,000 entries of the matrix would be non-zero.  Storing A requires megabytes of memory, but storing A-1 would require terabytes of memory.

Related post: Applied linear algebra

106 thoughts on “Don’t invert that matrix

  1. Great post. But you missed the best part: Ax = b may have a (unique) solution even when A is not invertible!

    Anyhow, my experience has been that once you are down to inverting matrices, things have gone badly for you. Better rethink your models.

  2. If you want to calculate a projection matrix (e.g., for linear regression), you may need to do some inverting (http://en.wikipedia.org/wiki/Projection_matrix). For numerical stability, I often use the Moore-Penrose pseudoinverse (http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse) via SVD. If the original matrix is sparse, you may be able to truncate the SVD and store the component matrices (U, Sigma, V) in relatively little space. Most math packages (e.g., MATLAB) come with SVD, so it’s easy to do this.

  3. Is it possible to do econometrics w/o inverting matrices? (That is, w/o the software doing it when using estimators that are presented theoretically as involving inversion).

  4. @Stuart What would you do with the inverse once you had it? If you’re going to multiply a matrix by it, then just solve the system.

    To put it another way, pretend you have a software object AInverse. You can’t access the inverse directly, but AInverse has methods on it to let you do whatever you want: MultiplyByVector, GetDeterminant, etc. But when you look inside AInverse, there’s no inverse inside, just code that acts as if there were an inverse inside.

  5. There are some applications of matrix multiplication to cryptosystems (the classical example is the Hill cipher; more modern application is AES/Rijndael). In those settings the inverse of a matrix is actually a useful object — namely it’s the decryption key for the cipher. So it’s something that a user of the system would want to compute and have lying around.

    A mitigating factor here is that the linear algebra is usually done over a finite field, so the inverse computations could take significantly less time.

  6. John, in molecular orbital theory, one of the key steps is S(-1) x E x S where E is the eigenvector matrix and S is the overlapped orbital matrix to test for an orthogonal set. I haven’t dealt with the math in years, but I notice the technique still seems to be used in the public domain(?) program MINDO. I wouldn’t have a clue how to attack this problem without inverting the matrix S.

  7. @Robert: I assume that since encryption works over a finite field, the inverse is known exactly and accuracy isn’t an issue. But I wonder whether ideas from numerical linear algebra could make cryptosystems more efficient. Maybe deep down in code somewhere there is an LU factorization that you don’t see in the theory.

    @Jason et al: I’m not saying there’s never a reason to invert a matrix. There may be, but I don’t think I’ve never needed to (except on a homework assignment).

    @Gene: In your example, you multiply S-1 by the matrix ES. You could solve Sx = b for each column b of the matrix ES. I’m not familiar with your particular problem, but I wouldn’t be surprised if there’s some special structure you could exploit to do even better.

  8. The only thing I can think of is getting an asymptotic covariance matrix from a likelihood optimization (i.e. for doing subsequent inference). Usually, that involves inverting the Hessian. Or maybe there’s a better way for that too?

  9. I know that MATLAB has a right division and left division operations. IIRC they solve several linear systems instead of inverting and multiplying.

  10. In the next to last paragraph (“Now you might as well use it, right?”…) there’s some advice that’s not always right IMHO. If you have to do a lot of calculations of A^{-1} b for many different b’s, the invert-and-multiply approach, although not numerically as accurate as back-substitution, will be much faster (all inner products, no divisions). You often don’t care about the ultimate in accuracy, or know A is well-conditioned anyway.
    This is the typical situation in computations of Gaussian probabilities, which require x’ A^{-1} x for a positive-definite A. (You don’t use vanilla inverse in this case…you use cholesky decomposition followed by inverse, but the principle is the same.)

  11. @Michael: Good point about inner products. Both matrix multiplication and back-substitution require O(n^2) operations, but with a large matrix the multiplication could be faster because it uses memory more efficiently etc.

  12. If the matrix inverse is in any way going to be used for “solving” a system of linear equations, then it’s not accurate, stable, or efficient to solve the system using the inverse matrix; the system can be solved using a variety of direct (say LU, QR, SVD or Cholesky), semidirect (say, conjugate gradients), or iterative (say multigrid or overrelaxation) methods depending on the type of the problem (consistent linear system, least squares, etc.), special properties of the matrix involved (symmetric, triangular, Toeplitz, bi-diagonal, positive definite, etc.), size of the matrix, frequency with which the matrix is updated (time invariant problem, time dependent dependent problem, such as system identification/control problems, etc.), stability vs. efficiency considerations, among other things. (In MATLAB, an excellent implementation for direct methods is Tim Davis’s factorize available on FEx, Don’t let that INV go past your eyes; to solve that system, FACTORIZE!.)

    There are, however, rare cases that the inverse matrix itself is needed. One such cases is calculating the standard errors of the estimates, in an identification problem, based on the inverse of the Fisher information matrix. Another is computing the coefficients of an exponential time differencing (ETD) Runge-Kutta scheme via complex contour quadrature—to avoid catastrophic cancellation due to roundoff in the operator $latex (.)^{-1}(e^{(.)} – 1)$ or its higher order variants—when the spatial discretization operator is nondiagonal. You may read more about this latter case here: Complex Magic, Part 3: Function Evaluation via Integration.

  13. We use inv(Fisher information matrix) for Cramer-Rao statistical bounds on probabilistic estimators all the time. We usually look at the condition number of the matrix to make sure it’s invertible first (the SVD of a singular FIM can give us the basis of the null space, that is, the combinations of the columns of the FIM that yield zero and thus prevent inversion).

  14. Once you have the inverse of the Fisher information matrix (or any other matrix) what do you do with it?

    Multiply it by something? Then you could use back-substitution.

    Find its determinant? You could find the determinant of the original matrix and take it’s reciprocal.

    Find its eigenvalues or eigenvectors? A and A-1 have the same eigenvectors, and the eigenvalues of A-1 are the reciprocals of the eigenvalues of A.

    If an application needs A-1, it must do something unusual with it because the usual operations can be carried out other ways.

  15. In econometrics sometimes one needs so-called sandwich estimators, which involve matrix terms like A^{-1} B A^{-1}, where A and B are square (n x n) dense matrices. Is there any way to completely avoid the matrix inversion here?

  16. @Matt: What do you do with the matrix A^{-1} B A^{-1} once you’ve got it? Maybe there’s a way to accomplish that task without computing A^{-1}.

  17. @John: eventually I access the elements on the diagonals of the resulting matrix (and perform simple ops, like taking square root) — those are used directly (e.g. as standard errors).

  18. Is there a way to quantify how much worse solving using the inverse is? I have been challenged in the past and could not provide evidence that it was really not a good idea. If I remember correctly I don’t think I could even find this “common knowledge” in Golub and van Loan, never mind a comparative analysis.

  19. @Matt: That’s interesting. I’d never seen that problem. Are your A and B matrices large? Sparse?

    Here’s one way you could find the ith column of A^{-1} B A^{-1} without computing A^{-1}. This approach may not be a good idea in the context of your problem. It’s just an idea.

    Define x_i = A^{-1} B A^{-1} e_i where e_i is the column vector with a “1” in the ith position and zeros everywhere else. Note that x_i is the ith column of A^{-1} B A^{-1}. Then A x_i = B (A^{-1} e_i). Compute A^{-1} e_i by solving Ax = e_i, then multiply the result on the left by B, then solve for x_i. That would let you find one column of A^{-1} B A^{-1} without having to compute the entire matrix. This would be a win if you don’t need the entire matrix, but perhaps not if you do need the entire matrix. It could also be a win if A has a special structure.

  20. @John: Thanks for the post. Unfortunately, I need all of the diagonal elements. All of the matrices involved are dense, however, in the end I’m not interested in the off-diagonal entries of the resulting matrix and would be happy to disregard them (and save some computation time!). Still, haven’t found any interesting/smart way to solve that.

    Currently I’m even explicitly calculating A^{-1} (even though doing explicit inverses is a no-no from a numerical analysis’ point of view…) and pre- & postmultiplying B. I suppose I could use LU factorization to calculate (A^{-1}B) and then postmultiply by (this time explicitly obtained) A^{-1} but I’m skeptical regarding the gains in doing so. Still, I still have this subconscious feeling that I’m doing this in a sub-optimal way, especially since in the end I only need the diagonal entries…

  21. It turned out to be somewhat more complicated, after reviewing the circa 1970 FORTRAN code. The term is S-1/2CS1/2. I’m not sure if this makes any difference to your thesis/assertion. I just powered through the code way back in 1970. I did run it by Chuck Katholi, my numerical analysis professor and we made some changes to search for the largest off diagonal elements to force convergence a little faster.

  22. Francisco Gimenez

    What about using a Kalman Filter, where you need to invert potential large matrices to do updates?

  23. @Francisco Gimenez
    The $A^T A$ involved in Kalman filter is a block tridiagonal matrix. At every step, the lower right block of this matrix is changed. Although inverse matrix appears in the Kalman filter formulas, implementations never use the inverse, as that would be unstable and inefficient. Instead, the factorization $A^T A = LDL^T$ is updated—hence the name “square root” filter. (To be exact, $LDL^T$ approach is not used either. QR decomposition is used to orthogonalize the columns of A to achieve more stability.)

  24. What would you do with the inverse once you had it? If you’re going to multiply a matrix by it, then just solve the system.

    I wouldn’t do anything with the inverse; I’m certainly not inverting large matrices by hand. I’m just asking whether econometrics software (SAS, STATA, etc.) is actually calculating all of the matrix inversions that are seemingly required by econometric theory?

  25. @Jason Adams, I am wondering whether it makes sense to HAVE the inverse if the matrix involved is small (let’s say 3×3 or 4×4) and you need to apply it many times.
    Could it be? I don’t know. Maybe John C. knows the answer.

  26. From a completely pragmatic point of view, many software systems expect linear transformations in the form of a matrix. To be able to reuse such components without (complicated, time-consuming, possibly license-violating) modification, you’d expect that (when sufficiently accurate), an inverse would be just fine for lots of applications.

    Perhaps it’s not a perfect solution, but isn’t it said that the perfect is the enemy of the good?

    Not that this is really the kind of reason you were looking for, I suppose…

  27. @Eamon, you give a very good reason. If you need the inverse, not for your own person use, but in order to hand other software expecting a matrix as input, then you need to invert the matrix.

  28. Very interesting. Do you know any method to invert matrix for using in GRID computing? For example, a method that split a larger matrix in smallers ones. Each smaller matrix would be processed by a node and, at the final, deliver the results to the main node that would joins that pieces and deliver the final result (something like that – sorry my english…).

  29. John: Big fan of your articles, and I always learn something new. Thanks.

    Frequently one comes across cases where the matrix has some special structure. A common example is matrices belonging to the orthogonal group, where the inverse is simply the transpose. You don’t even have to compute/store the transpose as libraries such as BLAS take in a parameter that specifies whether you want to use the transpose. Similarly, in numpy you can use matrix.T to create a view.

  30. I think Stuart and Fasih’s points are good ones, that for calculating the standard errors in statistics often requires inverting a matrix. If we want to find the least squares fit for the overdetermined system Y=XB, then B=(X’X)^-1X’Y
    To calculate the standard error for our estimate of B, we need to know the diagonal elements of (X’X)^-1.

    This same basic problem comes up in maximum likelihood estimation. The standard errors of the coefficient estimates come from the inverse of the Hessian matrix for the optimization problem.

    http://en.wikipedia.org/wiki/Cram%C3%A9r-Rao_inequality

  31. Short of computer applications (which I don’t bother with) and the Markov chain, I have never found a practical use for matrices, whatsoever. Perhaps I’m not well versed enough with higher level math, or perhaps the entire subject is unnecessary. I mean, don’t we learn Linear Algebra in the 9th grade when we first solve simultaneous linear equations? Isn’t that the same thing? The same can be achieved, Ax = b → x = A-1b, using regular algebraic expressions as opposed to matrices. I’ve never really seen the point – the mathematical point, I mean – except to simplify certain concepts for those who cannot otherwise wrap their minds around it in regular algebra due to symbol-overload or whatever else overwhelms them.

  32. Perhaps the subject already exists, I’m not sure, but how do you solve non-linear equations using matrices? Is it even possible? I mean, isn’t that why matrix algebra is also called linear algebra, because the equations they represent are linear? That is another reason, from my perspective, that matrices are ridiculous. They lack the power to resolve non-linear equations – which are far more complicated and practical and common in the real world, I think. Is there a non-linear equation method using linear/matrix algebra? Or is regular pen and paper algebra better in every way?

  33. Nonlinear systems are routinely solved by a sequence of linear problems. Think of Newton’s method for finding roots, in just one variable. The method essentially says replace a function by its linear approximation (i.e. the line with slope given by the derivative) and see where that linear approximation crosses the x-axis. You can do the analogous thing in several variables, using a sequence of linear approximations.

  34. I need to transform vectors and points between two coordinate systems. The most elegant solution seemed to be to invert the transformation matrix and multiply with that. I am aware that I probably could special-case it in real life, but performance wise it is no matter also I wanted to understand matrices better and write a general solution. So I set out to find matrix inversion code to my liking. Found none that could satisfy me completely for my purpose (wanted constant time and space usage, least possible moves in memory and no loops, also calculate the determinant as a byproduct not separately – so a kind of trivially most efficient without special instructions), so I wrote one. I failed at it miserably…

    If I have a transformation matrix and I want to build the transformation matrix that undoes all affine transforms without matrix inversion I am all for it at this point. Could you point me in the right direction?

    My code by the way: http://pastebin.com/t5qtELDB
    Matrices are stored row-major, multiplication, transpose, and scalar multiplication are trivial not included.

    I think I completely failed the cofactor matrix calculation, just don’t know why… I used the fact that I can chose any row or column to expand when calculating determinants…

  35. If you’re talking about 3D coordinates, you’re probably not too concerned with accuracy or efficiency. In that case, inverting the 3×3 matrices is fine and may be conceptually easier. But that approach doesn’t scale well to larger matrices.

  36. Thanks!

    I thought the same, just wasn’t sure. I like the robustness and mathematical clearness of matrix inversion in this case.

    I have to tell you if I did not write kind of an API for others to work in for myself I would rather not use matrix inversion if my life depended on it… But it needs to be simple and easily understood code (more important than performance).

    I can’t ask that you tell me what is wrong with my code, but could you (or someone else) tell me if I’m right in the assumption that 4×4 matrix inversion can be done in (roughly) matrix multiplication time, and only have to use 6+6 (for the upper and lower half) 2×2 determinants to it?
    Or I completely misunderstood something?

  37. Sorry! There is nothing wrong with that code 3 days of my life I spent trying to find out what’s wrong with it, and what I did not check, if my Matrix by Vector function is really doing it’s job as it should… Well I do understand matrices better than before…

    You may delete my comments if you wish.

  38. Considering the problem Ax = b with A expressed in 2 significant decimal digits accuracy like 2.65, 0.35, 2.85, 0.49 etc say of rank 6 and x representing fractions as a column vector (0.25, 0.15, 0.51 and the likes, less than 1 and which add up to 1), one will have b accurate to 2 decimal digits only even though Ax may have 4 decimal digits from the A*x computation.

    When inversion is applied to above scenario, be it by any method, x = A^-1*b, what accuracy can be expected in the model retrieved as fractions?

    Can x be accurate to more than one significant decimal digit in the above case where A is expressed in 2 significant digits?

    hari

  39. A number of folks have asked for the diagonal of inv(A’A).

    Using the singular value decomposition, A = U S V’, with U’U = V’V = I, S diagonal,


    A'A = (USV')'USV' = VSU'USV' = VSSV'
    inv(A'A) = inv(VSSV') = inv(V') inv(SS) inv(V) = V inv(SS)' V'

    Since S is diagonal,

    inv(A'A) = V inv(S) inv(S)' V' = (V inv(S)) (V inv(S))'

    For any matrix X, diag(XX’) is the sum of the square of the rows of X or the columns of X’. With S diagonal, the inverse is just element-wise 1/S, and you can compute the diagonal entries of inv(A’A) by summing the squares of the columns of V’ and dividing the result by the corresponding diagonal entry in S.

    Paul

  40. In Interval Arithmetic it makes perfect sense to compute (floating point approximations) to the inverse. Needless to say that these approximations are computed using the LU decomposition or such. The best known scheme to compute an enclosure to the solution of some (nonlinear) system of (possibly degenerate) interval equations is based on the so-called Krawczyk Operator and relies on the explicit computation of an (approximation to an) inverse matrix.

  41. Hi,

    I take it the discussion here is about very large matrices? Or does it apply to small 4×4 matrices as well? For example, I have an iduction motor simulation in matab which is a 4 by 4 matrix and I was wondering if thie factorisation could speed up the simulation? And if so, where can I learn about factorisation of matrices?

    Cheers,

  42. I did have large matrices in mind when I wrote this. However, the same considerations apply to small matrices, though to a lesser extent.

    To learn about matrix factorizations, Matrix Computations by Golub and Van Loan is a classic.

  43. The linear algebra I am doing is very simple: basically I want to create a spline fit to a large set of points, and I decided to try using some averaging between quintic polynomial fits to subsets of 6 points each. The fitting procedure will be repeated say 10^9 times during a program run. Reading this article motivated me not to just invert the relevant matrix. I wouldn’t want to just apply an inversion routine – performance is crucial – but I did consider inverting it by hand and applying that.

    Instead I am writing LU by hand, and trying my best to exploit the special property involved. Viz, the polynomial takes the form a0 + a1(x-xmid)+a2(x-xmid)^2+..+a5(x-xmid)^5 where if the points are x0,…,x5 then xmid is the average of x2 and x3. Therefore where d corresponds to distance from xmid, one row is 1,d2,d2*d2,d2^3,d2^4,d2^5, and one row is 1,-d2,d2*d2,-d2^3,d2^4,-d2^5. And the first column is all 1’s. Other than that I’m not sure there is much to exploit.
    I wouldn’t prefer to write an LU routine (a la Press et al) because of the pivoting business – trying to keep this really straightforward as well as fast.

    I just wondered, since this specific problem is obviously a very common one, does anyone have a standard, numerically stable algorithm for doing it. (This reinventing the wheel is time-consuming, and unfortunately I have to go to work so can only work on this voluntary project at the weekend.)

  44. Fitting splines leads to banded linear systems. These can be solved very quickly with special algorithms. Solving a banded system takes O(N) operations whereas Gaussian elimination on a dense matrix requires O(N^3) operations.

  45. You have assumed a priori that solving Ax=b is THE reason anyone might want to invert a matrix. What if, for example, you want to find the probability density function of a large set of correlated gaussian random variables, and you have a very large covariance matrix. You need to take the covariance matrix and do what with it? Guess! That’s right! You need to invert it! Look up multivariate gaussian on wikipedia if you don’t believe me.

Comments are closed.