Don’t invert that matrix

There is hardly ever a good reason to invert a matrix.

What do you do if you need to solve Ax = b where A is an n x n matrix? Isn’t the solution A-1 b? Yes, theoretically. But that doesn’t mean you need to actually find A-1. Solving the equation Ax = b is faster than finding A-1. Books might write the problem as x = A-1 b, but that doesn’t mean they expect you to calculate it that way.

What if you have to solve Ax = b for a lot of different b‘s? Surely then it’s worthwhile to find A-1. No. The first time you solve Ax = b, you factor A and save that factorization. Then when you solve for the next b, the answer comes much faster. (Factorization takes O(n3) operations. But once the matrix is factored, solving Ax = b takes only O(n2) operations. Suppose n = 1,000. This says that once you’ve solved Ax = b for one b, the equation can be solved again for a new b 1,000 times faster than the first one. Buy one get one free.)

What if, against advice, you’ve computed A-1. Now you might as well use it, right? No, you’re still better off solving Ax = b than multiplying by A-1, even if the computation of A-1 came for free. Solving the system is more numerically accurate than the performing the matrix multiplication.

It is common in applications to solve Ax = b even though there’s not enough memory to store A-1. For example, suppose n = 1,000,000 for the matrix A but A has a special sparse structure — say it’s banded — so that all but a few million entries of A are zero.  Then A can easily be stored in memory and Ax = b can be solved very quickly. But in general A-1 would be dense. That is, nearly all of the 1,000,000,000,000 entries of the matrix would be non-zero.  Storing A requires megabytes of memory, but storing A-1 would require terabytes of memory.

Related posts:

Floating point numbers are a leaky abstraction
Ten surprises from numerical linear algebra

84 thoughts on “Don’t invert that matrix

  1. Fitting splines leads to banded linear systems. These can be solved very quickly with special algorithms. Solving a banded system takes O(N) operations whereas Gaussian elimination on a dense matrix requires O(N^3) operations.

  2. You have assumed a priori that solving Ax=b is THE reason anyone might want to invert a matrix. What if, for example, you want to find the probability density function of a large set of correlated gaussian random variables, and you have a very large covariance matrix. You need to take the covariance matrix and do what with it? Guess! That’s right! You need to invert it! Look up multivariate gaussian on wikipedia if you don’t believe me.

  3. @zortharg : They went over this in comment 11. You invert a matrix and keep it, but it’s the matrix from the Cholesky decomposition, not the original covariance matrix. You should read all the comments before posting your own.

  4. John :

    Thanks. I missed your comment (!) but in the end I ended up doing exactly this, using the band-diagonal LU algorithm in Press et al (not sure to whom they credited it). This for a genuine quintic spline curve not a weighted average. This then worked in practice about 10x slower than just taking derivatives using local quadratic approx, but the improvement was worthwhile in certain respects.
    Cheers,

    Warwick

  5. I havent read every comment, but I have a simple question: inverting A is not equivalent to solving A x_1=b_1,…,A x_n = b_n for b_1 = (1,0,…,0) , b_2= (0,1, ,0 ,…,0), etc. ??

  6. Question: What if I need to multiply something by the inverse, instead of just solving linear equations? For example, say I have a positive definite symmetric matrix Sigma. If I need to calculate a multivariate normal density with this matrix as the covariance, I need to do something like (X^T * inverse(Sigma) * X), where X is a column vector. Can I do this w/o explicitly taking the inverse?

  7. @Dave: Yes, you can do that by taking the inverse of the Cholesky decomposition. I.e., decompose as $latex L^TL=Sigma$, solve $latex Ly=x$ for $latex y$ (which is fast because $latex L$ is triangular), and then compute $latex y^Ty$. Or you could invert $latex L$ (which is easy because it’s triangular). This is mentioned in comment 11.

  8. Its clear from the comments above that solving a system of linear equations is NOT best done by finding the inverse of A. So, then what IS the best way to solve them? I assume some of the acronymns above (eg LU) hold the answer. Sorry, I know you guys are obviously light years ahead of me, but in linear algebra I only learned the “invert the matrix” method. Could you point me to some intro sources?
    Thanks!
    Scott

  9. The “L” and “U” in LU stand for lower-triangular and upper-triangular. It’s a way of factoring a matrix A into the product of two special matrices L and U.

    As for resources, I’d recommend any of the linear algebra books Ward Cheney has written. His latest book is very expensive, but you could look for a used copy or an earlier edition.

  10. Thanks so much for your quick response, John – appreciate your willingness to help a newbie out! I’ve checked out Ward’s books, and I guess the one titled “linear algebra theory and applications” would be a good place to start and I assume it would address the techniques you are describing. Looks like the 2nd edition just came out, so maybe I can pick the first one up at a discount, since I don’t have $160 at hand right now. :-). Until then ill keep checking out your great blog, as I attempt to make the shift from someone with a pretty good traditional background in stat and research methods to someone like you guys that actually knows how this stuff functions at a deep level – to the point of knowing things like how best to calculate standard deviation (one of your blog entries) – something I always assumed was very cut and dry! Anyway, thanks!

  11. Thanks for the advice. I now see why my matlab scrip is more stable than my C equivalent. In matlab I use b/A, but in C I explicitly calculate the inverse.

  12. I came across this article because my equations for Kalman filtering call for a matrix inversion, and I’m not sure about the invertibility of the matrix sum in question. The following matrix is part of the update step:

    (HPH’ + W)^-1

    where H is my output map, P is the estimate of the error covariance, and W is the covariance matrix for the measurement noise. I gather that there’s no way to be sure of the invertibility of this sum (besides, though I guess that P and W are invertible I know that H for my case does not have full row rank, as m > n). Here is the entire equation for the state update.

    x_{k+1|k+1} = x_{k|k} + PH(HPH’ + W)^-1(y – Hx_{k+1|k})

    Anyway, that’s the background for this comment; I thought my matrix was not guaranteed to be invertible. I’m now looking for tips on how to implement this equation. Comment #28 states that “implementations [of the Kalman filter] never use the inverse,” but yet the first one I looked at does (http://code.google.com/p/efficient-java-matrix-library/wiki/KalmanFilterExamples). Other comments above mention Cholesky decomposition for multiplication tasks, so I’m guessing that might be the approach to take. Though now, after reading a Wikipedia entry, I’m guessing this can’t be the right approach since I don’t think the sum will necessarily be Hermitian, positive definite. What do I use instead of the inverse here?

  13. If the matrix is noninvertible then there is not exactly one solution.
    If it is invertible then you can use LU.
    I expect others more learned will add their comments.

  14. @Warwick There is not expected to be an exact solution to the system. We’ve been using Gauss-Newton (ordinary least squares) to compute the state. I am working on an alternative with Kalman filtering. So, I guess I still have the question of invertibility, whether or not I end up doing that or some other method (such as LU)?

  15. I still will need to know what to do about the inversion, but it occurs to me that the matrix

    S = (HPH’ + W)

    is not guaranteed to be invertible even if H, P, and W are. I don’t understand how this update equation can be used then (I’m not saying it can’t, I just don’t see how one is to use an update with an inversion that may not be possible at every step).

  16. I would have thought (could be wrong) that for the sum of invertibles (with physical meanings) to be noninvertible implies some kind of special relationship between the summands. Given your physical definitions I don’t see how you’d get that. Degenerate matrices don’t occur by chance.

    If your equations do not have a solution (overdetermined), then you can get the least-squares fit with a pseudoinverse, and I guess there are then similar decompositions for that. I don’t know whether a Kalman filter should lead to insoluble equations.

    When you say m > n, I am taking it that means your observation y is an m-vector but your underlying process x is an n-vector. Correct?

  17. @Matt I know that a covariance matrix is positive semidefinite, but I’m not sure about a) the characteristics of HPH’, and b) the characteristics of two positive semidefinite matrices.

    @ Warwick You are correct about my meaning for m > n. As per your comment on invertibility, I don’t know that I have a sum of invertibles. HH’ is definitely not invertible, am not sure about HPH’. The current method is to get a least-squares fit with a pseudoinverse.

    —–

    I hate to clog the comments section like this, so bear with me. The following excerpt bears directly on my application. It covers my implementation question I started off asking, but I still don’t see how the matrix sum is guaranteed to be symmetric, positive-definite. Any thoughts?

    From _Kalman Filtering; Theory and Practice Using MATLAB_, second edition, by Grewal and Andrews:

    Cholesky decomposition provides an effcient and numerically stable method for solving equations of the form AX = Y when A is a symmetric, positive-definite matrix. The modified Cholesky decomposition is even better, because it avoids taking scalar square roots. It is the recommended method for forming the term

    (HPH’ + R)^-1 H

    in the conventional Kalman filter without explicitly inverting a matrix. That is, if one
    decomposes HPH’ + R as UDU’ , then

    (UDU’)(HPH’ + R)^-1 H = H.

    It then suffices to solve

    UDU’X = H

    for X.

    I’d say that this answers the implementation question I posed. There’s still the matter of invertibility of

    HPH’ + W

    but this page is for alternatives to matrix inversion so I’ll leave off posting any more questions here. Thanks Matt and Warwick!

  18. If P is PSD, then so is HPH’. Just consider y’HPH’y. x’Px is positive for any x, therefore it’s positive for any H’y. Since y’HPH’y is positive for any y, then HPH’ is PSD.

    The sum of two PSD matrices is always PSD (the positive semidefinite matrices form a cone in the vector space of all symmetric matrices; cones are closed under sum). So if P and W are PSD, then so is HPH’ + W. If W is not singular, or if H and P are not singular, then HPH’ + W is invertible.

  19. OK I could be wrong about this, but a back-of-envelope reckoning indicates that for a vector z,
    z^T HH^T z = sum{columns of H} [ (z.h column j)^2 ]
    So according to that, HH^T simply cannot have a nontrivial kernel, unless H itself is something strange.
    However, I could always be wrong.

    If I am correct that HH^T has to be invertible, then it follows that HPH^T is invertible.

    I don’t see physically why your stepping equation should be insoluble when m > n. In fact I think it should not be insoluble even if m < n, as the Kalman filter is just giving you the best estimate.

  20. Hi John,

    Do you know of a Python implementation of the ‘save the factors’ idea that you mentioned in your article (MATLABician is also referring to a similar .. endeavour in Matlab, see comment 14)?

    Can you tell more about the various possible direct methods of inverting matrices as the article mentioned in the comment 71 by Yang seems to tell that apart from the efficiency part, using $A^{-1}$ is not so inaccurate? Anyway, if the rest of your article is still valid, this article questions the penutltimate paragraph.

  21. Suppose you must sole a system Ax=b for a constant A (with respect to time) in a consequence of time steps, as in the unsteady finite element case. Then isn’t it more efficient to compute A-1 once for ever than solving Ax=b in each time step?

  22. h.m. It’s more efficient to factor the matrix A once, but not to invert it. If you do invert it, you still have to multiply by that inverse. Back-substitution with the factored matrix takes fewer operations than a matrix-vector multiply.

    Another problem with inverting a matrix is that the inverse of a sparse matrix is generally dense. In finite elements and many other applications, matrices are sparse and benefit from iterative methods that are faster than Gaussian elimination.

  23. John
    you said ‘In finite elements and many other applications, matrices are sparse and benefit from iterative methods.’ You mean iterative method for inverting or solving the system is more efficient than LU factorization in these applications?

  24. h.m. Yes, absolutely. An LU factorization takes the same amount of time no matter the content of the matrix. But iterative methods, based on matrix-vector products, can run much faster when the matrix is sparse, i.e. largely zeros. If the zeros elements are in a structure that software can exploit, there’s no need to store them or to explicitly multiply by them.

Comments are closed.