Scientific computing in Python is expanding and maturing rapidly. Last week at the SciPy 2015 conference there were about twice as many people as when I’d last gone to the conference in 2013.
You can get some idea of the rapid develop of the scientific Python stack and its future direction by watching the final keynote of the conference by Jake VanderPlas.
I used Python for a while before I discovered that there were so many Python libraries for scientific computing. At the time I was considering learning Ruby or some other scripting language, but I committed to Python when I found out that Python has far more libraries for the kind of work I do than other languages do. It felt like I’d discovered a secret hoard of code. I expect it would be easier today to discover the scientific Python stack. (It really is becoming more of an integrated stack, not just a collection of isolated libraries. This is one of the themes in the keynote above.)
When people ask me why I use Python, rather than languages like Matlab or R, my response is that I do a mixture of mathematical programming and general programming. I’d rather do mathematics in a general programming language than do general programming in a mathematical language.
One of the drawbacks of Python, relative to C++ and related languages, is speed. This is a problem in languages like R as well. However, with Python there are ways to speed up code without completely rewriting it, such as Cython and Numba. The only reliable way I’ve found to make R much faster, is to rewrite it in another language.
Another drawback of Python until recently was that data manipulation and exploration were not as convenient as one would hope. However, that has changed due to developments such as Pandas, initiated by Wes McKinney. For more on how that came to be and where it’s going, see his keynote from the second day of SciPy 2015.
It’s not clear why Python has become the language of choice for so many people in scientific computing. Maybe if people like Travis Oliphant had decided to use some other language for scientific programming years ado, we’d all be using that language now. Python wasn’t intended to be a scientific programming language. And as Jake VanderPlas points out in his keynote, Python still is not a scientific programming language, but the foundation for a scientific programming stack. Maybe Python’s strength is that it’s not a scientific language. It has drawn more computer scientists to contribute to the core language than it would have if it had been more of a domain-specific language.
* * *
If you’d like help moving to the Python stack, please let me know.
I think Python caught on because it is (a) easily extensible with C (and FORTRAN!) and (b) Python is easy to read.
Tcl was the extension language of choice a while back. At the time, Tcl didn’t have the loadable library capabilities like Python did. Why wasn’t NumPy instead NumTcl? Would be an interesting question to ask Mr. Oliphant.
Perl is a beautiful language but can be hard to read. Python is not as elegant but has a much more shallow learning curve and is much easier to read.
Was there popular scripting language at the time Numerical Python was started? Ruby?
I believe I have read your rationale for using Python “I’d rather do mathematics in a general programming language than do general programming in a mathematical language” before. As a statement of preference order [(generality, math ability) > (math ability, generality)], there’s not much to debate. However, I have faced the same dilemma before (when choosing between R and Python, mostly), and have come to the opposite conclusion. My reasoning is that, in choosing a language, I look at my most binding constraint and the benefit resulting from relaxing it (i.e., its shadow price). The binding constraint for R may be generality, or flexibility of the language, with more general-purpose packages. How much would I benefit from a more flexible language? Not much at all, since I usually deploy packages to a few users, or Shiny apps for the same user base, and R is just fine for this. If anything, R’s speed is more of an issue. For Python, the binding constraint is the lack of very specialized and well-tested libraries, and the lower integration of data-friendly constructs in the language How much are they binding? In some (many) cases, a lot, since statistical analysis takes most of my time, and developing libraries would be very time-consuming. Still in some case I prefer Python because of the speed afforded by Numba and Pypy. But if speed and language generality were my main concerns, I would be coding neither in R nor in Python. This suggests that there’s an interplay between language versatility and its specific strength, and that your dichotomy doesn’t quite capture it.
So I am not so sold on the “generality” argument. But maybe it’s because I am not being general enough.
We use Python a lot. It’s become the de-facto common scripting language in my field (coputational neuroscience), and as you say, the numerical and scientific libraries are great.
I think it’s fair to say that at least in science, people are really using Scipy, Pandas, Sage, NEST, PyNN, Brian and so on. Python the language is just along for the ride. It could just as easily have been Ruby, Perl or whatever (and I don’t, personally, think Python was the best choice).
That, I believe, is a major reason the take-up of Python 3 is so slow: Many Python users don’t actually care about Python, and don’t see a point in having to relearn things that work fine for them already. If the language itself doesn’t matter, it’s just a pure waste of time to learn new idioms and to update all your existing code to be compatible.
I think that the dominance of Python is largely due to lucky timing on its part. Python came along to unify the simple programming style of web languages such as PHP at a time when personal computers were finally capable of running a heavy-duty scripting language for bigger problems. Actually, my personal view is that the switch of Red Hat to using Python for all OS scripting was probably the biggest boon to the Python community; this led to an explosion in good developers in the language which led to the increase in libraries. The scientists just followed along because they didn’t like programming in C and Fortran anymore. Some of the better programmers may have chosen R and other languages, but those looking for quick scriptability were drawn to Python (let’s face it R has a terribly inconsistent programming syntax).
Julia is coming along now. If it had been around just a few years earlier I don’t think that scientific Python would have taken off. There was a massive influx of scientists to the Python community maybe 4-5 years ago, these were people following in the footsteps of the early adopters who finally saw a technology which was mature and which wasn’t going to disappear overnight and which allowed for much faster prototyping than traditional languages. I would like to see Julia or a similar language take over as I’m a purist from the perspective that numerical computing in a language that doesn’t understand numbers natively seems folly to me…. but it will be a slow burn for now I think.
John, you may also want to include S Anand’s talk on Faster Data Processing in Python in this post about dealing with Python performance issues.
Links to video and code: https://twitter.com/sanand0/status/522694197103452160
Talk proposal: https://in.pycon.org/funnel/2014/165-faster-data-processing-in-python/
Scientific uses of Python are much older than NumPy. In fact, NumPy was the fusion of two older array libraries, Numeric and numarray. Numeric goes back to 1996, as do the first published scientific applications of Python. I have been doing molecular simulations in Python since 1997, and I never looked back to my Fortran past.
What made scientists look seriously at Python in the early days (around 1995) was the combination of a readable syntax, an already rather complete standard library (“batteries included”), and easy interfacing with C and Fortran code. The two fundamental tools that made scientific Python take off were Numeric (the array library) and the interface generator SWIG. Numeric was (and still is) as much an interfacing tool as an array library, because its arrays have the same memory layout as C or Fortran arrays. Back then, speed was hardly an issue, Python was “just” the glue language that tied together C or Fortran libraries.
The set of libraries available to do things, without developing and vetting original code, is many times larger for R than for Python. In my world, at least, analyst and quantitative engineer labor costs are MUCH bigger than costs for computers or even time. Accordingly, if some package for R implements exactly what you need, it’s a win over something that needs to be developed afresh.
Moreover, the speed comparison is getting lame …. Increasingly, there are high performance implementations of uncompiled, bare R, taking advantage of networked hardware, such as those using the parallel package in R. If the problem is such that it succumbs by throwing additional servers at it, sounds like a good thing to me.
Then there are implementations like Revolution Analytics R.