Don’t invert that matrix

There is hardly ever a good reason to invert a matrix.

What do you do if you need to solve Ax = b where A is an n x n matrix? Isn’t the solution A-1 b? Yes, theoretically. But that doesn’t mean you need to actually find A-1. Solving the equation Ax = b is faster than finding A-1. Books might write the problem as x = A-1 b, but that doesn’t mean they expect you to calculate it that way.

What if you have to solve Ax = b for a lot of different b‘s? Surely then it’s worthwhile to find A-1. No. The first time you solve Ax = b, you factor A and save that factorization. Then when you solve for the next b, the answer comes much faster. (Factorization takes O(n3) operations. But once the matrix is factored, solving Ax = b takes only O(n2) operations. Suppose n = 1,000. This says that once you’ve solved Ax = b for one b, the equation can be solved again for a new b 1,000 times faster than the first one. Buy one get one free.)

What if, against advice, you’ve computed A-1. Now you might as well use it, right? No, you’re still better off solving Ax = b than multiplying by A-1, even if the computation of A-1 came for free. Solving the system is more numerically accurate than the performing the matrix multiplication.

It is common in applications to solve Ax = b even though there’s not enough memory to store A-1. For example, suppose n = 1,000,000 for the matrix A but A has a special sparse structure — say it’s banded — so that all but a few million entries of A are zero.  Then A can easily be stored in memory and Ax = b can be solved very quickly. But in general A-1 would be dense. That is, nearly all of the 1,000,000,000,000 entries of the matrix would be non-zero.  Storing A requires megabytes of memory, but storing A-1 would require terabytes of memory.

Related post: Applied linear algebra

106 thoughts on “Don’t invert that matrix

  1. @zortharg : They went over this in comment 11. You invert a matrix and keep it, but it’s the matrix from the Cholesky decomposition, not the original covariance matrix. You should read all the comments before posting your own.

  2. John :

    Thanks. I missed your comment (!) but in the end I ended up doing exactly this, using the band-diagonal LU algorithm in Press et al (not sure to whom they credited it). This for a genuine quintic spline curve not a weighted average. This then worked in practice about 10x slower than just taking derivatives using local quadratic approx, but the improvement was worthwhile in certain respects.
    Cheers,

    Warwick

  3. I havent read every comment, but I have a simple question: inverting A is not equivalent to solving A x_1=b_1,…,A x_n = b_n for b_1 = (1,0,…,0) , b_2= (0,1, ,0 ,…,0), etc. ??

  4. Question: What if I need to multiply something by the inverse, instead of just solving linear equations? For example, say I have a positive definite symmetric matrix Sigma. If I need to calculate a multivariate normal density with this matrix as the covariance, I need to do something like (X^T * inverse(Sigma) * X), where X is a column vector. Can I do this w/o explicitly taking the inverse?

  5. @Dave: Yes, you can do that by taking the inverse of the Cholesky decomposition. I.e., decompose as $latex L^TL=Sigma$, solve $latex Ly=x$ for $latex y$ (which is fast because $latex L$ is triangular), and then compute $latex y^Ty$. Or you could invert $latex L$ (which is easy because it’s triangular). This is mentioned in comment 11.

  6. Its clear from the comments above that solving a system of linear equations is NOT best done by finding the inverse of A. So, then what IS the best way to solve them? I assume some of the acronymns above (eg LU) hold the answer. Sorry, I know you guys are obviously light years ahead of me, but in linear algebra I only learned the “invert the matrix” method. Could you point me to some intro sources?
    Thanks!
    Scott

  7. The “L” and “U” in LU stand for lower-triangular and upper-triangular. It’s a way of factoring a matrix A into the product of two special matrices L and U.

    As for resources, I’d recommend any of the linear algebra books Ward Cheney has written. His latest book is very expensive, but you could look for a used copy or an earlier edition.

  8. Thanks so much for your quick response, John – appreciate your willingness to help a newbie out! I’ve checked out Ward’s books, and I guess the one titled “linear algebra theory and applications” would be a good place to start and I assume it would address the techniques you are describing. Looks like the 2nd edition just came out, so maybe I can pick the first one up at a discount, since I don’t have $160 at hand right now. :-). Until then ill keep checking out your great blog, as I attempt to make the shift from someone with a pretty good traditional background in stat and research methods to someone like you guys that actually knows how this stuff functions at a deep level – to the point of knowing things like how best to calculate standard deviation (one of your blog entries) – something I always assumed was very cut and dry! Anyway, thanks!

  9. Thanks for the advice. I now see why my matlab scrip is more stable than my C equivalent. In matlab I use b/A, but in C I explicitly calculate the inverse.

  10. I came across this article because my equations for Kalman filtering call for a matrix inversion, and I’m not sure about the invertibility of the matrix sum in question. The following matrix is part of the update step:

    (HPH’ + W)^-1

    where H is my output map, P is the estimate of the error covariance, and W is the covariance matrix for the measurement noise. I gather that there’s no way to be sure of the invertibility of this sum (besides, though I guess that P and W are invertible I know that H for my case does not have full row rank, as m > n). Here is the entire equation for the state update.

    x_{k+1|k+1} = x_{k|k} + PH(HPH’ + W)^-1(y – Hx_{k+1|k})

    Anyway, that’s the background for this comment; I thought my matrix was not guaranteed to be invertible. I’m now looking for tips on how to implement this equation. Comment #28 states that “implementations [of the Kalman filter] never use the inverse,” but yet the first one I looked at does (http://code.google.com/p/efficient-java-matrix-library/wiki/KalmanFilterExamples). Other comments above mention Cholesky decomposition for multiplication tasks, so I’m guessing that might be the approach to take. Though now, after reading a Wikipedia entry, I’m guessing this can’t be the right approach since I don’t think the sum will necessarily be Hermitian, positive definite. What do I use instead of the inverse here?

  11. If the matrix is noninvertible then there is not exactly one solution.
    If it is invertible then you can use LU.
    I expect others more learned will add their comments.

  12. @Warwick There is not expected to be an exact solution to the system. We’ve been using Gauss-Newton (ordinary least squares) to compute the state. I am working on an alternative with Kalman filtering. So, I guess I still have the question of invertibility, whether or not I end up doing that or some other method (such as LU)?

  13. I still will need to know what to do about the inversion, but it occurs to me that the matrix

    S = (HPH’ + W)

    is not guaranteed to be invertible even if H, P, and W are. I don’t understand how this update equation can be used then (I’m not saying it can’t, I just don’t see how one is to use an update with an inversion that may not be possible at every step).

  14. I would have thought (could be wrong) that for the sum of invertibles (with physical meanings) to be noninvertible implies some kind of special relationship between the summands. Given your physical definitions I don’t see how you’d get that. Degenerate matrices don’t occur by chance.

    If your equations do not have a solution (overdetermined), then you can get the least-squares fit with a pseudoinverse, and I guess there are then similar decompositions for that. I don’t know whether a Kalman filter should lead to insoluble equations.

    When you say m > n, I am taking it that means your observation y is an m-vector but your underlying process x is an n-vector. Correct?

  15. @Matt I know that a covariance matrix is positive semidefinite, but I’m not sure about a) the characteristics of HPH’, and b) the characteristics of two positive semidefinite matrices.

    @ Warwick You are correct about my meaning for m > n. As per your comment on invertibility, I don’t know that I have a sum of invertibles. HH’ is definitely not invertible, am not sure about HPH’. The current method is to get a least-squares fit with a pseudoinverse.

    —–

    I hate to clog the comments section like this, so bear with me. The following excerpt bears directly on my application. It covers my implementation question I started off asking, but I still don’t see how the matrix sum is guaranteed to be symmetric, positive-definite. Any thoughts?

    From _Kalman Filtering; Theory and Practice Using MATLAB_, second edition, by Grewal and Andrews:

    Cholesky decomposition provides an effcient and numerically stable method for solving equations of the form AX = Y when A is a symmetric, positive-definite matrix. The modified Cholesky decomposition is even better, because it avoids taking scalar square roots. It is the recommended method for forming the term

    (HPH’ + R)^-1 H

    in the conventional Kalman filter without explicitly inverting a matrix. That is, if one
    decomposes HPH’ + R as UDU’ , then

    (UDU’)(HPH’ + R)^-1 H = H.

    It then suffices to solve

    UDU’X = H

    for X.

    I’d say that this answers the implementation question I posed. There’s still the matter of invertibility of

    HPH’ + W

    but this page is for alternatives to matrix inversion so I’ll leave off posting any more questions here. Thanks Matt and Warwick!

  16. If P is PSD, then so is HPH’. Just consider y’HPH’y. x’Px is positive for any x, therefore it’s positive for any H’y. Since y’HPH’y is positive for any y, then HPH’ is PSD.

    The sum of two PSD matrices is always PSD (the positive semidefinite matrices form a cone in the vector space of all symmetric matrices; cones are closed under sum). So if P and W are PSD, then so is HPH’ + W. If W is not singular, or if H and P are not singular, then HPH’ + W is invertible.

  17. OK I could be wrong about this, but a back-of-envelope reckoning indicates that for a vector z,
    z^T HH^T z = sum{columns of H} [ (z.h column j)^2 ]
    So according to that, HH^T simply cannot have a nontrivial kernel, unless H itself is something strange.
    However, I could always be wrong.

    If I am correct that HH^T has to be invertible, then it follows that HPH^T is invertible.

    I don’t see physically why your stepping equation should be insoluble when m > n. In fact I think it should not be insoluble even if m < n, as the Kalman filter is just giving you the best estimate.

  18. Hi John,

    Do you know of a Python implementation of the ‘save the factors’ idea that you mentioned in your article (MATLABician is also referring to a similar .. endeavour in Matlab, see comment 14)?

    Can you tell more about the various possible direct methods of inverting matrices as the article mentioned in the comment 71 by Yang seems to tell that apart from the efficiency part, using $A^{-1}$ is not so inaccurate? Anyway, if the rest of your article is still valid, this article questions the penutltimate paragraph.

  19. Suppose you must sole a system Ax=b for a constant A (with respect to time) in a consequence of time steps, as in the unsteady finite element case. Then isn’t it more efficient to compute A-1 once for ever than solving Ax=b in each time step?

  20. h.m. It’s more efficient to factor the matrix A once, but not to invert it. If you do invert it, you still have to multiply by that inverse. Back-substitution with the factored matrix takes fewer operations than a matrix-vector multiply.

    Another problem with inverting a matrix is that the inverse of a sparse matrix is generally dense. In finite elements and many other applications, matrices are sparse and benefit from iterative methods that are faster than Gaussian elimination.

  21. John
    you said ‘In finite elements and many other applications, matrices are sparse and benefit from iterative methods.’ You mean iterative method for inverting or solving the system is more efficient than LU factorization in these applications?

  22. h.m. Yes, absolutely. An LU factorization takes the same amount of time no matter the content of the matrix. But iterative methods, based on matrix-vector products, can run much faster when the matrix is sparse, i.e. largely zeros. If the zeros elements are in a structure that software can exploit, there’s no need to store them or to explicitly multiply by them.

  23. What if I want to look at the data and model resolution matrix, I will need to compute the inverse of matrix G, N=G*inv(G); R=inv(G)*G
    Unfortunately my G is non-square and large…any way around this or to speed up the computation of inv(G), is pinv a good way??

  24. In against to discussion here, I will say that inverting a matrix not too bad always, when it has to implemented on VLIW architecture based embedded processor.
    I have solve equation Ax=b , where A is 3×3 matrix, and b is 3×1 matrix. ANd this i have to do many many times, for different A & b.

    Problem with LU decomposition and back substitution is that, you can not execute multiple operation in parallel. Before starting a row operation, other previous operation has to be finished. This creates great dependency between operations, hence multiple ALU units present in one processor can not run in parallel.

    Where as if I do inverse of matrix A using co-factor based mechanism then all cofactor can be scheduled on multiple ALU units, hence overall operation finishes faster.

  25. Hello dear all, here is my problem. could you give me some suggestion?
    I want to solve a linear system Ax=B, and original A is 3200*3200 matrix, which is
    not a sparse matrix. I understand LU algorithm or Choleskey can be used to solve this, but my problem is: Each time, the size of B is different (for example, If I delete one value from B, then the corresponding row and column from A will be also delete), now I have nearly 1000 B, if I have LU/Choleskey, I will calculate nearly 1000times. I want to ask whether a method exists to speed up the computation when I changed A and B size?

  26. Sir,John, today i was calculating the determinant of a 6×6 matrix in MATLAB installed in my laptop and i got zero but the result is not zero when i calculated the same in MATLAB installed in my friend’s laptop, although theoretically it is zero.
    Would you explain it why it is so?

    Thanks for your earlier blog. I learned many practical facts of linear algebra.

  27. John,

    Since you point out that computing A^-1 essentially implies one is going down the wrong path, then perhaps you could shed some light on the “correct” path one should take in the context of discretizing a system matrix A by the method of zero-order hold for the purpose of obtaining the discretized system coefficients used in a state observer.

    ZOH discretization involves computing the matrix exponential, using for example the iterative method described in “Matrix Computations” by Golub and van Loan. At the end, matrix division is performed using… at least in my implementation… A^-1.

  28. Hello dear all, Here is my problem.Could you give me some suggestion?
    I want to solve a linear system Ax=B, and original A is 300*300 matrix, which is square, sparse and banded, having both the lower and upper bandwidths equal to 4. what is the best method for solving using Matlab?
    Thank you

  29. Hello dear all, Here is my problem.Could you give me some suggestion?
    I want to solve a linear system Ax=B, and original A is 3000*3000 matrix, which is square, sparse and banded, having both the lower and upper bandwidths equal to 4. what is the best method for solving using Matlab?
    Thank you

  30. I have to solve system of linear equations Ax=b where A is symmetric positive definite matrix. Currently I am using conjugate gradient method to solve the problem. Is there any other method which can speed up the process?

  31. I have systems of equations that fit nicely in an Ax=b matrix format and usually solve them by inverting A. They’re relatively small (also sparse but not banded and not necessarily positive definite), so this works well. However, I have to make incremental changes in the values of A (say, change two of the three non-zero values in one row) and find the corresponding x and have to do this many, many times. Can factoring help me preserve the work from previous iterations and reduce my ridiculous run times?

  32. If you change A then you have a new problem and so you can’t reuse the old factorization.

    On the other hand, if your change to A is small, say A’ = A + E where E has small norm, then a solution x to Ax = b may be approximately a solution to A’x = b. If you’re solving the latter by an iterative method, then you could give it a head start by using the old x as your starting point.

  33. Great Post!

    Here is a follow-up question: what if I have a matrix A and vector b and I need the quantity b’A^{-1}b. Of course I can get A^{-1} explicitly and compute the product. This is very unstable, even in small examples. What would be the “solving the system” formulation of this problem?

  34. Hi John, thanks for the great post.

    I have a question: Assuming that M and N have the same size, which would be faster?

    (I). Solving (Mx = b) 100 times with the same b but different Ms,
    (II). Solving (Nx = c) and (Nx = d) 100 times with the same N but different c,d.

    As you pointed out, N^-1 could be calculated and used repeatedly in case (II). This way, my guess is that case (II) may be faster.

    Thanks in advance.

  35. How do you solve Ax = b if A is under-constrained? (i.e. if I need the minimum norm solution or the equivalent as the pseudo-inverse. I assume you don’t compute the pseudo-inverse according to you post)

  36. In [A]{X}=[b], for m different right-hand sides and where T is clock cycle time on a typical AMD chip,
    time would be 8/3n^3T+8mn^2T by decomposition (factor) method,
    time would be 32/3n^3T+2mn^2T by finding inverse first via decomposition method and then using matrix multiplication.
    For m>4*n/3, the latter is computationally more efficient (aside from the round-off issues).

  37. >Storing A requires megabytes of memory, but storing A-1 would require terabytes of memory

    I do not understand: A and A^-1 have the same size

  38. Theoretically A and A inverse have the same size. But they don’t require the same amount of storage. The inverse of a sparse matrix is generally a dense matrix.

    Suppose A is a million by million matrix, but only contains non-zero entries on the main diagonal and two diagonals above and below. Then storing A requires storing 5,000,000 numbers. But storing A inverse will usually require storing 1,000,000,000,000 numbers.

Comments are closed.