How to get faster code than numpy.dot for matrix multiplication?

np.dot dispatches to BLAS when NumPy has been compiled to use BLAS, a BLAS implementation is available at run-time, your data has one of the dtypes float32, float64, complex32 or complex64, and the data is suitably aligned in memory. Otherwise, it defaults to using its own, slow, matrix multiplication routine. Checking your BLAS linkage is … Read more

Why list comprehension is much faster than numpy for multiplying arrays?

Creation of numpy arrays is much slower than creation of lists: In [153]: %timeit a = [[2,3,5],[3,6,2],[1,3,2]] 1000000 loops, best of 3: 308 ns per loop In [154]: %timeit a = np.array([[2,3,5],[3,6,2],[1,3,2]]) 100000 loops, best of 3: 2.27 µs per loop There can also fixed costs incurred by NumPy function calls before the meat of … Read more

Why is there huge performance hit in 2048×2048 versus 2047×2047 array multiplication?

This probably has do with conflicts in your L2 cache. Cache misses on matice1 are not the problem because they are accessed sequentially. However for matice2 if a full column fits in L2 (i.e when you access matice2[0, 0], matice2[1, 0], matice2[2, 0] … etc, nothing gets evicted) than there is no problem with cache … Read more

bsxfun implementation in matrix multiplication

Send x to the third dimension, so that singleton expansion would come into effect when bsxfun is used for multiplication with A, extending the product result to the third dimension. Then, perform the bsxfun multiplication – val = bsxfun(@times,A,permute(x,[3 1 2])) Now, val is a 3D matrix and the desired output is expected to be … Read more

Minimizing overhead due to the large number of Numpy dot calls

It depends on the size of the matrices Edit For larger nxn matrices (aprox. size 20) a BLAS call from compiled code is faster, for smaller matrices custom Numba or Cython Kernels are usually faster. The following method generates custom dot- functions for given input shapes. With this method it is also possible to benefit … Read more

Why is matrix multiplication faster with numpy than with ctypes in Python?

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it’s in Netlib). The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, … Read more