I will *start* by saying that the power of Pandas and NumPy arrays is derived from high-performance **vectorised** calculations on numeric arrays.^{1} The entire point of vectorised calculations is to avoid Python-level loops by moving calculations to highly optimised C code and utilising contiguous memory blocks.^{2}

### Python-level loops

Now we can look at some timings. Below are **all** Python-level loops which produce either `pd.Series`

, `np.ndarray`

or `list`

objects containing the same values. For the purposes of assignment to a series within a dataframe, the results are comparable.

```
# Python 3.6.5, NumPy 1.14.3, Pandas 0.23.0
np.random.seed(0)
N = 10**5
%timeit list(map(divide, df['A'], df['B'])) # 43.9 ms
%timeit np.vectorize(divide)(df['A'], df['B']) # 48.1 ms
%timeit [divide(a, b) for a, b in zip(df['A'], df['B'])] # 49.4 ms
%timeit [divide(a, b) for a, b in df[['A', 'B']].itertuples(index=False)] # 112 ms
%timeit df.apply(lambda row: divide(*row), axis=1, raw=True) # 760 ms
%timeit df.apply(lambda row: divide(row['A'], row['B']), axis=1) # 4.83 s
%timeit [divide(row['A'], row['B']) for _, row in df[['A', 'B']].iterrows()] # 11.6 s
```

Some takeaways:

- The
`tuple`

-based methods (the first 4) are a factor more efficient than`pd.Series`

-based methods (the last 3). `np.vectorize`

, list comprehension +`zip`

and`map`

methods, i.e. the top 3, all have roughly the same performance. This is because they use`tuple`

*and*bypass some Pandas overhead from`pd.DataFrame.itertuples`

.- There is a significant speed improvement from using
`raw=True`

with`pd.DataFrame.apply`

versus without. This option feeds NumPy arrays to the custom function instead of`pd.Series`

objects.

`pd.DataFrame.apply`

: just another loop

To see *exactly* the objects Pandas passes around, you can amend your function trivially:

```
def foo(row):
print(type(row))
assert False # because you only need to see this once
df.apply(lambda row: foo(row), axis=1)
```

Output: `<class 'pandas.core.series.Series'>`

. Creating, passing and querying a Pandas series object carries significant overheads relative to NumPy arrays. This shouldn’t be surprise: Pandas series include a decent amount of scaffolding to hold an index, values, attributes, etc.

Do the same exercise again with `raw=True`

and you’ll see `<class 'numpy.ndarray'>`

. All this is described in the docs, but seeing it is more convincing.

`np.vectorize`

: fake vectorisation

The docs for `np.vectorize`

has the following note:

The vectorized function evaluates

`pyfunc`

over successive tuples of

the input arrays like the python map function, except it uses the

broadcasting rules of numpy.

The “broadcasting rules” are irrelevant here, since the input arrays have the same dimensions. The parallel to `map`

is instructive, since the `map`

version above has almost identical performance. The source code shows what’s happening: `np.vectorize`

converts your input function into a Universal function (“ufunc”) via `np.frompyfunc`

. There is some optimisation, e.g. caching, which can lead to some performance improvement.

In short, `np.vectorize`

does what a Python-level loop *should* do, but `pd.DataFrame.apply`

adds a chunky overhead. There’s no JIT-compilation which you see with `numba`

(see below). It’s just a convenience.

### True vectorisation: what you *should* use

Why aren’t the above differences mentioned anywhere? Because the performance of truly vectorised calculations make them irrelevant:

```
%timeit np.where(df['B'] == 0, 0, df['A'] / df['B']) # 1.17 ms
%timeit (df['A'] / df['B']).replace([np.inf, -np.inf], 0) # 1.96 ms
```

Yes, that’s ~40x faster than the fastest of the above loopy solutions. Either of these are acceptable. In my opinion, the first is succinct, readable and efficient. Only look at other methods, e.g. `numba`

below, if performance is critical and this is part of your bottleneck.

`numba.njit`

: greater efficiency

When loops *are* considered viable they are usually optimised via `numba`

with underlying NumPy arrays to move as much as possible to C.

Indeed, `numba`

improves performance to *microseconds*. Without some cumbersome work, it will be difficult to get much more efficient than this.

```
from numba import njit
@njit
def divide(a, b):
res = np.empty(a.shape)
for i in range(len(a)):
if b[i] != 0:
res[i] = a[i] / b[i]
else:
res[i] = 0
return res
%timeit divide(df['A'].values, df['B'].values) # 717 µs
```

Using `@njit(parallel=True)`

may provide a further boost for larger arrays.

^{1} Numeric types include: `int`

, `float`

, `datetime`

, `bool`

, `category`

. They *exclude* `object`

dtype and can be held in contiguous memory blocks.

^{2}

There are at least 2 reasons why NumPy operations are efficient versus Python:

- Everything in Python is an object. This includes, unlike C, numbers. Python types therefore have an overhead which does not exist with native C types.
- NumPy methods are usually C-based. In addition, optimised algorithms

are used where possible.