Python for data analysis

I recommend using Python for data analysis, and I recommend Wes McKinney’s book Python for Data Analysis.

I prefer Python to R for mathematical computing because mathematical computing doesn’t exist in a vacuum; there’s always other stuff to do. I find doing mathematical programming in a general-purpose language is easier than doing general-purpose programming in a mathematical language. Also, general-purpose languages like Python have larger user bases, are better designed, have better tool support, etc.

Python per se doesn’t have everything you need for mathematical computing. You need to combine several tools and libraries, typically at least SciPy, matplotlib, and IPython. Because there are different pieces involved, it’s hard to find one source to explain using them all together. Also, even with the three additional components mentioned before, there is a need for additional software for working with structured data.

Wes McKinney developed the pandas library to give Python “rich data structures and functions designed to make working with structured data fast, easy, and expressive.” And now he has addressed the need for unified exposition by writing a single book that describes how to use the Python mathematical computing stack. Importantly, the book covers two recent developments that make Python more competitive with other environments for data analysis: enhancements to IPython and Wes’ own pandas project.

Python for Data Analysis is available for pre-order. I don’t know when the book will be available but Amazon lists the publication date as October 29. My review copy was a PDF, but at least one paper copy has been spotted in the wild:

Wes McKinney holding his book at O’Reilly’s Strata Conference. Photo posted on Twitter yesterday.

15 thoughts on “Python for data analysis”

Jared

24 October 2012 at 08:55

I can appreciate your comment that you “…prefer Python to R for mathematical computing because mathematical computing doesn’t exist in a vacuum; there’s always other stuff to do.”

I program in SAS but sometimes find I need to do work beyond SAS’ intent. For example, I often want to interact with the OS (i.e. files and directories) and while I can do it in SAS, it’s often not as elegant as doing the same in Python.

I’m not finished, but I have gone through the first bit of ‘Python for Data Analysis’. For me, doing mathematical programming in SAS is far more productive than Python. But perhaps I would have felt differently if I had the pandas library at the time I was learning SAS.

Rick Bryan

24 October 2012 at 09:27

In The Innovator’s Dilemma, Clayton Christensen discussed the “disruptive innovation”. Loosely, it is how a large, established company finds the soft underbelly of its market being eaten away by a small, scrappy, less capable player overgrowing its previously ignorable niche.

One has to wonder: does that foretell the future of large, established systems like SAS, Matlab, etc., getting trounced from below by powerful, domain-specific libraries embedded in general purpose scripting languages?

Thomas Kluyver

24 October 2012 at 10:17

Python per se doesn’t have everything you need for mathematical computing. You need to combine several tools and libraries, typically at least SciPy, matplotlib, and IPython. Because there are different pieces involved, it’s hard to find one source to explain using them all together.

We’re working on defining and promoting the ‘Scipy Stack’, a collection of packages including numpy, scipy, matplotlib, ipython, pandas, sympy and nose. My hope is that this will encourage us to see these more as parts of a complete system, and develop things like better documentation.

Mark Hepburn

24 October 2012 at 15:53

I pre-ordered mine directly from O’Reilly, and got the shipment notification yesterday.

Paul

24 October 2012 at 22:41

I’m not sure if you’ve used Spyder but it comes pretty close to replacing a basic MATLAB…

Adam Klein

25 October 2012 at 11:01

@Rick Bryan Tools for numerical computing in Python definitely pose a threat to established players in the data analysis arena. But the stack isn’t yet a good product experience . No one has as yet done with Python what e.g. Revolution Analytics has done with R. ContinuumIO might be the one to fill that niche.

Nickthedude

25 October 2012 at 10:43

Would this be a good book for a newb at machine learning or more accurately someone who is interested in machine learning? I know how to program but it seems difficult to find a resource about machine learning that doesn’t make my brain hurt within the first 30 minutes.

Thanks for the post.

Nick

John

25 October 2012 at 13:46

Nickthedude: I wouldn’t call Wes’ book a machine learning book per se, but it does cover things someone should know before they get into machine learning, so maybe it would be a good place to start. See also my review of Machine Learning in Action for an ML book using Python. There’s also ML for Hackers if you’re OK with examples in R.

Luis

26 October 2012 at 15:10

I use R in a daily basis and Python around once a month and, from my point of view, the main barriers to adoption are: i- the lack of a complete, standard download that includes the basic elements to get you going with *data analysis*, ii – easy package (library) installation and update from inside the system, iii – ggplot2, iv- good explanations for integrating the lot through iPython. It’s a bit silly that I needed to buy Wes’s book (which is excellent, by the way) to learn how to use it properly.

The Python experience feels more like bits and pieces glued together rather than using a unified system (although some people may say the same about R).

Leon du Toit

27 October 2012 at 08:17

Thanks for posting your thoughts.

Given your opinion, can you please give some short code examples showing where Python is better than R. That way I can more fully understand what you mean. They are, after all, quite extensive computing environments with many different use cases.

John

27 October 2012 at 08:34

Leon: In my opinion, the benefit of Python doesn’t show up in short code examples. Python is better than R for things that don’t fit into short examples.

If R can do what you want with little programming, and your work doesn’t have to connect with anything else, R’s great. But when you have to write complicated programming logic, or connect with non-statistical code, that’s when Python shines.

Canageek

27 October 2012 at 18:53

It is interesting; most people I know still use FORTRAN (the other group uses C++, since that is what GEANT4 is in) for their mathematical work. Now, I don’t know many, but apparently 1) When you get into complex things you WILL notice that Python is an interpreted language, as when your simulation already runs overnight a 10% (or whatever) speed hit starts getting noticeable, and 2) Everyone already knows FORTRAN, and doesn’t want to learn something new.

oh, and 3) All the legacy code is in FORTRAN.

John

27 October 2012 at 19:20

If performance matters, I’ll use a compiled language like C++ or C#.

You can get some speedup from Python by using NumPy well, calling high-level Python wrappers around compiled library code. And you can replace part of your Python code with Cython code, a variation on Python that compiles to C. But I usually just write C++ rather than optimizing Python. But I could imagine a situation in which I’d put more effort into staying in Python.

I last wrote FORTRAN sometime around 1995 and I don’t intend to ever write it again.

Canageek

27 October 2012 at 19:46

I understand that FORTRAN90 and subsequent editions have done a lot to modernize it.

What I don’t get, is I was talking to someone who was using FORTRAN and he implemented his own complex numbers. Isn’t the whole point of using FORTRAN to get built in support for complex numbers and other such useful things?

Leon du Toit

28 October 2012 at 06:49

I agree – it is easier, in general, to integrate Python code with the rest of a code base than R. But with good APIs one can get very far…

Comments are closed.