Multidimensional Iterators in NumPy > Key Challenges in N-Dimensional Array Operations

19. Multidimensional Iterators in NumPy

Travis E. Oliphant

Numpy is an optional package for the python language that provides a powerful N-dimensional array object. An N-dimensional array is a data structure that uses N integers, or indices, to access individual elements. It is a useful model for a wide variety of data processed by a computer.

For example, a one-dimensional array can store the samples of a sound wave, a two-dimensional array can store a grayscale image, a three-dimensional array can store a color image (with one of the dimensions having a length of 3 or 4), and a four-dimensional array can store the value of pressure in a room during a concert. Even higher-dimensional arrays are often useful.

NumPy provides an environment for the mathematical and structural manipulation of arrays of arbitrary dimensions. These manipulations are at the heart of much scientific, engineering, and multimedia code that routinely deals with large amounts of data. Being able to perform these mathematical and structural manipulations in a high-level language can considerably simplify the development and later reuse of these algorithms.

NumPy provides an assortment of mathematical calculations that can be done on arrays, as well as providing very simple syntax for structural operations. As a result, Python (with NumPy) can be successfully used for the development of significant and fast-performing engineering and scientific code.

One feature that allows fast structural manipulation is that any subregion of a NumPy array can be selected using the concept of slicing. In Python, a slice is defined by a starting index, a stopping index, and a stride, using the notation start:stop:stride (inside square brackets).

For example, suppose we want to crop and shrink a 656x498 image to a 160x120 image by selecting a region in the middle of the image. If the image is held in the NumPy array, im, this operation can be performed using:

	im2=im[8:-8:4, 9:-9:4]

An important feature of NumPy, however, is that the new image selected in this way actually shares data with the underlying image. A copy is not performed. This can be an important optimization when calculating with large data sets where indiscriminate copying can overwhelm the computing resources.

19.1. Key Challenges in N-Dimensional Array Operations

In order to provide fast implementations of all mathematical operations, NumPy implements loops (in C) that work quickly over an array or several arrays of any number of dimensions. Writing such generic code that works quickly on arrays of arbitrary dimension can be a mind-stretching task. It may be easy to write a for loop to process all the elements of a one-dimensional array, or two nested for loops to process all the elements of a two-dimensional array. Indeed, if you know ahead of time how many dimensions the array consists of, you can use the right number of for loops to directly loop over all of the elements of the array. But how do you write a general forloop that will process all of the elements of an N-dimensional array when N can be an arbitrary integer?

There are two basic solutions to this problem. One solution is to use recursion by thinking about the problem in terms of a recursive case and a base case. Thus, if copy_ND(a, b, N) is a function that copies an N-dimensional array pointed to by b to another N-dimensional array pointed to by a, a simple recursive implementation might look like:

	if (N==0)
	    copy memory from b to a
	    return
	set up ptr_to_a and ptr_to_b
	for n=0 to size of first dimension of a and b
	    copy_ND(ptr_to_a, ptr_to_b, N-1)
	    add stride_a[0] to ptr_to_a and stride_b[0] to ptr_b

Notice the use of the single for loop and the check for the base case that stops the recursion when N reaches 0.

It is not always easy to think about how to write every algorithm as a recursive algorithm, even though the code just shown can often be used as a starting model. Recursion also requires the use of a function call at each iteration of the loop. So it can be all too easy for recursion to create slow code, unless some base-case optimization is performed (such as stopping when N==1 and doing the memory copy in a for loop locally).

Most languages will not do that kind of optimization automatically, so an elegant-looking recursive solution might end up looking much more contrived by the time optimizations are added.

In addition, many algorithms require the storage of intermediate information that will be used during later recursions. For example, what if the maximum or minimum value in the array must be tracked? Typically, such values become part of the recursive call structure and are passed along as arguments in the recursive call. In the end, each algorithm that uses recursion must be written in a slightly different way. Thus, it is hard to provide the programmer with additional simplifying tools for recursive solutions.

Instead of using recursion, NumPy uses iteration to accomplish most of its N-dimensional algorithms. Every recursive solution can be written using an iterative solution. Iterators are an abstraction that simplifies thinking about these algorithms. Therefore, using iterators, N-dimensional routines can be developed that run quickly, while the code can still be read and understood using a single looping structure.

An iterator is an abstract concept that encapsulates the idea of walking through all of the elements of an array with just one loop. In Python itself, iterators are objects that can be used as the predicate of any for loop. For example:

	for x in iterobj:
	    process(x)

will run the function process on all of the elements of iterobj. The most important requirement of iterobj is that it has some way to get its "next" element. Thus, the concept of an iterator refocuses the problem of looping over all the elements of a data structure to one of finding the next element.

In order to understand how iterators are implemented and used in NumPy, it is crucial to have at least some conception of how NumPy views the memory in an N-dimensional array. The next section should clarify this point.