Cython

October 24, 2013
categories: tools, programming

Mainly for my own records, I thought I could start writing short articles on the tools I find really useful in my work. It's somewhere to put links to useful pages, and notes on things to watch out for next time. First up, since it's what I've been playing with in the last week, Cython.

Scientific computing has historically been dominated by languages like Fortran and C, because they are efficient and computers were slow. Higher-level, more abstract languages like Python are now increasingly being used for number crunching. They are quicker to write code in, easier to learn and understand, and if they run a bit slower then who really cares? Most of the time computers are now fast enough for the overhead to be acceptable, and your time as a researcher is a lot more valuable than some extra computer time. This is not a particularly new trend, but it has accelerated in the last 10 years or so as computers have got faster and compilers better at optimising these languages. There are libraries like NumPy and SciPy which allow Python to perform numerical calculations at very close to the speed of C or Fortran. The standard approach is becoming: Write your code in a language like Python first, then write bits of it in C or Fortran if you really need it to run faster. What Cython does is make this process almost embarrassingly easy. There's a paper discussing scientific applications here.

The language of choice in Magnetic Confinement Fusion labs has been IDL for a long time, but there is a drive now to move towards using Python instead. To help this process along I wrote an interface to the MAST tokamak data access system for Python. Unfortunately I didn't know about Cython at the time, so I wrote it in C. It's not particularly large, but at 950 lines (640 lines of actual code according to sloccount, the rest comments etc.) it's not exactly concise either. Defining Python classes in C is also quite prone to mistakes which result in odd behaviour, so when the time came to extend the library significantly, something had to be done.

Fortunately, Alex M suggested Cython, and after a read through the excellent documentation I set about recreating my interface code. The difference is fairly impressive: the new code looks pretty much like Python (so is fairly easy to read), and is only 204 lines long (102 of code), rather than 950. As an example, one of the functions in the library is "getIdamProperty", which just looks up whether a setting is true or false. The C code needed to be able to use this from Python was (stripped down a bit):

static PyObject*
idam_getProperty(PyObject *self, PyObject *args)
{
  const char* prop;
  int val;

  if(!PyArg_ParseTuple(args, "s", &prop))
    return NULL;

  val = getIdamProperty(prop);

  return Py_BuildValue("i", val);
}

static PyMethodDef IdamMethods[] = {

  {"getProperty",  idam_getProperty, METH_VARARGS,
   "Get a property for client/server behavior"},

  {NULL, NULL, 0, NULL}        /* Sentinel */
};

PyMODINIT_FUNC
initidam(void)
{
  PyObject *m;

  m = Py_InitModule3("idam", IdamMethods, "IDAM data access module.");
  ...
}

Using Cython, the same thing is accomplished by:

cdef extern from "idamclient.h":
    bint getIdamProperty(const char *property)

def getProperty(property):
    "Get a property for client/server behavior"
    return getIdamProperty(property)

NumPy arrays

A few things were not entirely clear when using Cython, particularly passing NumPy arrays to the C code. This is fairly crucial in scientific code, but there seem to be many ways to get a C array pointer (e.g. float or double) from a NumPy array object, only some of which seemed to work. Attempting to use Numpy ".data" or ".ctypes.data" members as suggested here resulted in errors along the lines of "Python object cannot be converted to float*". What worked was to use the NumPy C API as suggested here.

An artificial, but fairly typical example is a C library with functions to get the size of an array, and functions to fill given arrays with data. First you just copy and paste the functions you want to call from the C header file (here "mylibrary.h") into your Cython code:

cdef extern from "mylibrary.h":
   int getArraySize()             # Returns how much data there is
   void getArrayData(float* data) # Given an array, fill it with data

then import NumPy Python and C interfaces:

import numpy as np   # For the Python interface
cimport numpy as np  # For the C interface

Then call the functions from what looks like Python code:

# Get the size of the array (calling C)
size = getArraySize()

# Create a NumPy array of the right size (Python)
arr = np.empty(size, dtype=np.float32)

# Pass the underlying C pointer (calling C)
getArrayData(<float*> np.PyArray_DATA(arr))