Sunday, 4 March 2018

Faster python for data science and scientific computing

Scientific computing and HPC developers will probably be familiar with  Intel's C/C++ compiler suite, which can be used to compile your C, C++ and Fortran code instead of the free GCC compilers and can often result in significant performance improvements without changing a single line. Further improvements can be made by swapping out (generally fantastic) open source C maths libraries such as ATLAS or BLAS for equivalent functionality in Intels MKL (Math Kernal Language). Again - this is usually simply a matter of compiling your existing code against Intel's library and can result in very impressive speed gains for very little work.

What has this to do with Python? Most of Python's most famous data science and scientific computing libraries are written in C/C++, with a simple wrapper allowing them to be called easily from python. If you've ever wondered why Numpy, SciPy, scikit-learn and pandas are so much faster than trying to write the same code yourself in native Python, it's because all of the work in a function like np.multiply() is actually carried out in C "under the hood".

Previously, if you had a licence for Intel's  compiler suite you could compile these python libraries yourself and take advantage of Intel's speed boost in your python applications, but this required both familiarly with C code compilation, as well as an expensive licence. However Intel have now made available a free pre-compiled Python distribution with all the major packages (numpy, scipy, pandas etc.) based on the popular Anaconda distribution.  According to kdnuggets Intel have also re-written some common functions entirely for further optimization - in particular it looks like numpy and scipy's FFT (Fast Fourier Transform) functions have been enhanced significantly. Depending on your workload, using this distribution could boost the execution speed of these libraries by 10-50% without the need for any code change.

If you're interested in optimizing Python code that you wrote yourself and isn't available in any existing (C-implemented) library check out Cython as a way of implementing the most performance sensitive parts of your code in C. Unlike using the Intel distribution linked above, converting part of your code to use Cython can take some development work, however even when using the free GCC compilers you'll see a significant increase in speed over native python code.

Monday, 5 February 2018

Pip unable to download packages running in a Ubuntu docker image on kubernetes

I recently ran into a problem where pip was unable to download packages while running in a docker image on a kubernetes pod. The issue seemed to be that it could not find the actual repo to download from - likely due to some kind of networking issue within either docker or kubernetes. The solution turned out to be to create a file at /etc/docker/daemon.json and enter google's DNS servers as follows:

{ "dns": ["", ""] }

I was working from a Ubuntu base image, so I created the file as above before installing and starting docker. Keep in mind that as docker images usually don't contain systemmd, it's not all that easy to restart docker once you have installed it, so creating the configuration first is pretty useful. You can find more information on this at

AWS Keyspaces - Managed Cassandra review

AWS recently went live with Keyspaces, their managed version of Cassandra ( ). This service is primarily a...