Python for data analysis

I recommend using Python for data analysis, and I recommend Wes McKinney’s book Python for Data Analysis.

I prefer Python to R for mathematical computing because mathematical computing doesn’t exist in a vacuum; there’s always other stuff to do. I find doing mathematical programming in a general-purpose language is easier than doing general-purpose programming in a mathematical language. Also, general-purpose languages like Python have larger user bases, are better designed, have better tool support, etc.

Python per se doesn’t have everything you need for mathematical computing. You need to combine several tools and libraries, typically at least SciPy, matplotlib, and IPython. Because there are different pieces involved, it’s hard to find one source to explain using them all together. Also, even with the three additional components mentioned before, there is a need for additional software for working with structured data.

Wes McKinney developed the pandas library to give Python “rich data structures and functions designed to make working with structured data fast, easy, and expressive.” And now he has addressed the need for unified exposition by writing a single book that describes how to use the Python mathematical computing stack. Importantly, the book covers two recent developments that make Python more competitive with other environments for data analysis: enhancements to IPython and Wes’ own pandas project.

Python for Data Analysis is available for pre-order. I don’t know when the book will be available but Amazon lists the publication date as October 29. My review copy was a PDF, but at least one paper copy has been spotted in the wild:

Wes McKinney holding his book at O’Reilly’s Strata Conference. Photo posted on Twitter yesterday.

15 thoughts on “Python for data analysis

  1. I can appreciate your comment that you “…prefer Python to R for mathematical computing because mathematical computing doesn’t exist in a vacuum; there’s always other stuff to do.”

    I program in SAS but sometimes find I need to do work beyond SAS’ intent. For example, I often want to interact with the OS (i.e. files and directories) and while I can do it in SAS, it’s often not as elegant as doing the same in Python.

    I’m not finished, but I have gone through the first bit of ‘Python for Data Analysis’. For me, doing mathematical programming in SAS is far more productive than Python. But perhaps I would have felt differently if I had the pandas library at the time I was learning SAS.

  2. In The Innovator’s Dilemma, Clayton Christensen discussed the “disruptive innovation”. Loosely, it is how a large, established company finds the soft underbelly of its market being eaten away by a small, scrappy, less capable player overgrowing its previously ignorable niche.

    One has to wonder: does that foretell the future of large, established systems like SAS, Matlab, etc., getting trounced from below by powerful, domain-specific libraries embedded in general purpose scripting languages?

  3. Python per se doesn’t have everything you need for mathematical computing. You need to combine several tools and libraries, typically at least SciPy, matplotlib, and IPython. Because there are different pieces involved, it’s hard to find one source to explain using them all together.

    We’re working on defining and promoting the ‘Scipy Stack’, a collection of packages including numpy, scipy, matplotlib, ipython, pandas, sympy and nose. My hope is that this will encourage us to see these more as parts of a complete system, and develop things like better documentation.

  4. @Rick Bryan Tools for numerical computing in Python definitely pose a threat to established players in the data analysis arena. But the stack isn’t yet a good product experience . No one has as yet done with Python what e.g. Revolution Analytics has done with R. ContinuumIO might be the one to fill that niche.

  5. Would this be a good book for a newb at machine learning or more accurately someone who is interested in machine learning? I know how to program but it seems difficult to find a resource about machine learning that doesn’t make my brain hurt within the first 30 minutes.

    Thanks for the post.

    Nick

  6. Nickthedude: I wouldn’t call Wes’ book a machine learning book per se, but it does cover things someone should know before they get into machine learning, so maybe it would be a good place to start. See also my review of Machine Learning in Action for an ML book using Python. There’s also ML for Hackers if you’re OK with examples in R.

  7. I use R in a daily basis and Python around once a month and, from my point of view, the main barriers to adoption are: i- the lack of a complete, standard download that includes the basic elements to get you going with *data analysis*, ii – easy package (library) installation and update from inside the system, iii – ggplot2, iv- good explanations for integrating the lot through iPython. It’s a bit silly that I needed to buy Wes’s book (which is excellent, by the way) to learn how to use it properly.

    The Python experience feels more like bits and pieces glued together rather than using a unified system (although some people may say the same about R).

  8. Hi

    Thanks for posting your thoughts.

    Given your opinion, can you please give some short code examples showing where Python is better than R. That way I can more fully understand what you mean. They are, after all, quite extensive computing environments with many different use cases.

  9. Leon: In my opinion, the benefit of Python doesn’t show up in short code examples. Python is better than R for things that don’t fit into short examples.

    If R can do what you want with little programming, and your work doesn’t have to connect with anything else, R’s great. But when you have to write complicated programming logic, or connect with non-statistical code, that’s when Python shines.

  10. It is interesting; most people I know still use FORTRAN (the other group uses C++, since that is what GEANT4 is in) for their mathematical work. Now, I don’t know many, but apparently 1) When you get into complex things you WILL notice that Python is an interpreted language, as when your simulation already runs overnight a 10% (or whatever) speed hit starts getting noticeable, and 2) Everyone already knows FORTRAN, and doesn’t want to learn something new.

    oh, and 3) All the legacy code is in FORTRAN.

  11. If performance matters, I’ll use a compiled language like C++ or C#.

    You can get some speedup from Python by using NumPy well, calling high-level Python wrappers around compiled library code. And you can replace part of your Python code with Cython code, a variation on Python that compiles to C. But I usually just write C++ rather than optimizing Python. But I could imagine a situation in which I’d put more effort into staying in Python.

    I last wrote FORTRAN sometime around 1995 and I don’t intend to ever write it again.

  12. I understand that FORTRAN90 and subsequent editions have done a lot to modernize it.

    What I don’t get, is I was talking to someone who was using FORTRAN and he implemented his own complex numbers. Isn’t the whole point of using FORTRAN to get built in support for complex numbers and other such useful things?

  13. I agree – it is easier, in general, to integrate Python code with the rest of a code base than R. But with good APIs one can get very far…

Comments are closed.