einsum2

This is a python package that contains a parallel implementation for a subset of numpy.einsum functionality.

numpy.einsum is a fantastic function for multiplying numpy arrays. However, numpy.dot and numpy.tensordot are typically faster, especially if numpy is linked to a parallel implementation of BLAS: then numpy.dot and numpy.tensordot will take advantage of the multiple CPUs, whereas numpy.einsum remains single-threaded.

The trouble is, some einsum products are impossible to express as dot or tensordot. For example,

numpy.einsum("ijm,kmi->ijk", A, B)

returns a tensor C with C_i,j,k = ∑_m A_i,j,m B_k,m,i. This cannot be expressed as a dot or tensordot because the shared axis i is included in the output C.

This operation can be expressed in terms of numpy.matmul, and in particular this example is equivalent to numpy.matmul(A, np.transpose(B)). However, on my machine, numpy.matmul does not appear to take advantage of parallel BLAS with multiple cores. This may eventually change in the future (last year Intel introduced batched GEMM operations), but in the meantime, you can use einsum2 to parallelize such einsum calls.

einsum2 is also compatible with the autograd package for automatic differentiation.

Installation

Pre-requisites are a C compiler that supports OpenMP, Cython, numpy, and pip. Assuming you satisfy these requirements, you can install with

pip install .

at the top-level directory of einsum2.

If you have problem installing or running, it’s probably because of your C compiler. On OSX in particular, the default compiler clang does not support OpenMP and thus won’t work. I’ve also occasionally run into issues using Anaconda Python on Linux, apparently because of the old gcc-4 it was compiled with – however, I’m currently unable to reproduce this error.

In case you are having issues with the C compiler, it is recommended to install and run einsum2 in a virtual environment with a compatible C compiler. For example, if you use Anaconda Python, you can do the following:

Create a new virtual environment named env_name with conda create -n env_name python=3.5 anaconda (alternatively, you can use python=2.7 anaconda).
Switch to the new environment with source activate env_name.
Do conda install gcc to install the Anaconda distribution of gcc, which both supports OpenMP and is fully compatible with the Anaconda Python. Note this clobbers your system gcc, which is why we are using a virtual environment here.
Finally, install einsum2 with CC=gcc pip install . (the CC=gcc is required on OSX, otherwise the installation will try to use the default clang compiler that does not support OpenMP).

To test whether the parallelization is working, try sh test/profile_einsum2.sh, you should see a speedup for the function calls using multiple threads (OMP_NUM_THREADS > 1).

Usage

`einsum2`

This computes einsum products that can be expressed in terms of matmul, transpose, and reshape operations. It can either have the form

einsum2.einsum2(a, a_sublist, b, b_sublist, out_sublist)

einsum2.einsum2(subscripts_str, a, b)

Here a and b are tensors. In the first format, a_sublist and b_sublist label the indices of a and b, while out_sublist gives the indices of the output array. In the second format, subscripts_str is a string specifying the subscripts of a, b, and the output, and looks like a_subs,b_subs->out_subs.

The official numpy.einsum documentation provides more details about specifying the subscripts; here is an example. The following two calls are equivalent:

einsum2(A, [0,1,2,3], B, [4,1,0], [0,2,4])

and

einsum2("ijkm,nji->ikn", A, B)

and returns a tensor C with C_ikn = ∑_j,m A_ijkm B_nji.

Unlike numpy.einsum, we allow the subscripts in the first format to be a list of arbitrary hashable keys, so

einsum2(A, ["zero","one","two","three], B, ["four","one","zero"], ["zero","two","four"])

would also be allowed.

einsum2 has several limitations compared to numpy.einsum: it only operates on two input arrays, does not allow diagonal operations (repeated subscripts on the same array), and requires the output subscripts to always be specified.

Unlike the standard einsum, einsum2 will perform computations in parallel. The number of parallel threads is selected automatically, but you can also control this with the environment variable OMP_NUM_THREADS.

To perform the parallel computation, einsum2 will either use numpy.dot (if possible), otherwise it will use a parallel for loop. The advantage of using numpy.dot is that it uses BLAS which is much faster than a for loop. However, you need to make sure numpy is compiled against a parallel BLAS implementation such as MKL or OpenBlas. You won’t need to worry about this for most packaged, precompiled versions of numpy (e.g. Anaconda Python).

`batched_dot`

This is a parallel implementation of numpy.matmul. More specifically, for 3-tensors a and b,

einsum2.batched_dot(a, b)

computes numpy.matmul(a,b) in parallel.

batched_dot is only currently implemented for a and b that are 3-tensors. If the leading dimension has length 1, then batched_dot will use numpy.dot to take advantage of BLAS.

`einsum1`

This is a convenience function for einsum operations on a single array. In particular,

einsum2.einsum1(in_arr, in_sublist, out_sublist)

returns an array out_arr that is derived from in_arr, but with subscripts given by out_sublist. In particular, all subscripts of in_sublist not in out_sublist are summed out, and then the axes of in_arr are rearranged to match out_sublist.

Like einsum2, arbitrary keys are allowed to label the subscripts in einsum1. Also like einsum2, repeated subscripts (i.e. diagonal operations) are not supported.

shinaoka / einsum2