# Minimizing overhead due to the large number of Numpy dot calls

## It depends on the size of the matrices

Edit

For larger nxn matrices (aprox. size 20) a BLAS call from compiled code is faster, for smaller matrices custom Numba or Cython Kernels are usually faster.

The following method generates custom dot- functions for given input shapes. With this method it is also possible to benefit from compiler related optimizations like loop unrolling, which are especially important for small matrices.

It has to be noted, that generating and compiling one kernel takes approx. 1s, therefore make sure to call the generator only if you really have to.

Generator function

``````def gen_dot_nm(x,y,z):
#small kernels
@nb.njit(fastmath=True,parallel=True)
def dot_numba(A,B):
"""
calculate dot product for (x,y)x(y,z)
"""
assert A.shape==B.shape
assert A.shape==B.shape

assert A.shape==x
assert B.shape==y
assert B.shape==z

res=np.empty((A.shape,A.shape,B.shape),dtype=A.dtype)
for ii in nb.prange(A.shape):
for i in range(x):
for j in range(z):
acc=0.
for k in range(y):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res

#large kernels
@nb.njit(fastmath=True,parallel=True)
def dot_BLAS(A,B):
assert A.shape==B.shape
assert A.shape==B.shape

res=np.empty((A.shape,A.shape,B.shape),dtype=A.dtype)
for ii in nb.prange(A.shape):
res[ii]=np.dot(A[ii],B[ii])
return res

#At square matices above size 20
#calling BLAS is faster
if x>=20 or y>=20 or z>=20:
return dot_BLAS
else:
return dot_numba
``````

Usage example

``````A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)

dot22=gen_dot_nm(2,2,2)
X=dot22(A,B)
%timeit X3=dot22(A,B)
#5.94 µs ± 21.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
``````

Another alternative, but with more work to do, would be to use some special BLAS implementations, which creates custom kernels for very small matrices just in time and than calling this kernels from C.

Example

``````import numpy as np
import numba as nb

#Don't use this for larger submatrices
@nb.njit(fastmath=True,parallel=True)
def dot(A,B):
assert A.shape==B.shape
assert A.shape==B.shape

res=np.empty((A.shape,A.shape,B.shape),dtype=A.dtype)
for ii in nb.prange(A.shape):
for i in range(A.shape):
for j in range(B.shape):
acc=0.
for k in range(B.shape):
acc+=A[ii,i,k]*B[ii,k,j]
res[ii,i,j]=acc
return res

@nb.njit(fastmath=True,parallel=True)
def dot_22(A,B):
assert A.shape==B.shape
assert A.shape==2
assert A.shape==2
assert B.shape==2
assert B.shape==2

res=np.empty((A.shape,A.shape,B.shape),dtype=A.dtype)
for ii in nb.prange(A.shape):
res[ii,0,0]=A[ii,0,0]*B[ii,0,0]+A[ii,0,1]*B[ii,1,0]
res[ii,0,1]=A[ii,0,0]*B[ii,0,1]+A[ii,0,1]*B[ii,1,1]
res[ii,1,0]=A[ii,1,0]*B[ii,0,0]+A[ii,1,1]*B[ii,1,0]
res[ii,1,1]=A[ii,1,0]*B[ii,0,1]+A[ii,1,1]*B[ii,1,1]
return res
``````

Timings

``````A=np.random.rand(1000,2,2)
B=np.random.rand(1000,2,2)

X=A@B
X2=np.einsum("xik,xkj->xij",A,B)