Thanks for everyone’s replies. Here is another attempt at summarizing them. Pardon if I say too many “obvious” things: I knew nothing about least squares before, so everything was new to me.

## NOT polynomial interpolation

Polynomial interpolation is fitting a polynomial of degree `n`

given `n+1`

data points, e.g. finding a cubic that passes exactly through four given points. As said in the question, this was not want I wanted—I had a lot of points and wanted a small-degree polynomial (which will only *approximately* fit, unless we’ve been lucky)—but since some of the answers insisted on talking about it, I should mention them 🙂 Lagrange polynomial, Vandermonde matrix, etc.

## What is least-squares?

“Least squares” is a particular definition/criterion/”metric” of “how well” a polynomial fits. (There are others, but this is simplest.) Say you are trying to fit a polynomial

p(x,y) = a + bx + cy + dx^{2} + ey^{2} + fxy

to some given data points (x_{i},y_{i},Z_{i}) (where “Z_{i}” was “f(x_{i},y_{i})” in the question). With least-squares the problem is to find the “best” coefficients (a,b,c,d,e,f), such that what is minimized (kept “least”) is the “sum of squared residuals”, namely

S = ∑_{i} (a + bx_{i} + cy_{i} + dx_{i}^{2} + ey_{i}^{2} + fx_{i}y_{i} – Z_{i})^{2}

## Theory

The important idea is that if you look at S as a function of (a,b,c,d,e,f), then S is minimized at a point at which its gradient is 0. This means that for example ∂S/∂f=0, i.e. that

∑_{i}2(a + … + fx_{i}y_{i} – Z_{i})x_{i}y_{i} = 0

and similar equations for a, b, c, d, e.

Note that these are just linear equations in a…f. So we can solve them with Gaussian elimination or any of the usual methods.

This is still called “linear least squares”, because although the function we wanted was a quadratic polynomial, it is still linear *in the parameters* (a,b,c,d,e,f). Note that the same thing works when we want p(x,y) to be any “linear combination” of *arbitrary* functions f_{j}, instead of just a polynomial (= “linear combination of monomials”).

## Code

For the univariate case (when there is only variable x — the f_{j} are monomials x^{j}), there is Numpy’s `polyfit`

:

```
>>> import numpy
>>> xs = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> ys = [1.1, 3.9, 11.2, 21.5, 34.8, 51, 70.2, 92.3, 117.4, 145.5]
>>> p = numpy.poly1d(numpy.polyfit(xs, ys, deg=2))
>>> print p
2
1.517 x + 2.483 x + 0.4927
```

For the multivariate case, or linear least squares in general, there is SciPy. As explained in its documentation, it takes a matrix A of the values f_{j}(**x**_{i}). (The theory is that it finds the Moore-Penrose pseudoinverse of A.) With our above example involving (x_{i},y_{i},Z_{i}), fitting a polynomial means the f_{j} are the monomials x^{()}y^{()}. The following finds the best quadratic (or best polynomial of any other degree, if you change the “degree = 2” line):

```
from scipy import linalg
import random
n = 20
x = [100*random.random() for i in range(n)]
y = [100*random.random() for i in range(n)]
Z = [(x[i]+y[i])**2 + 0.01*random.random() for i in range(n)]
degree = 2
A = []
for i in range(n):
A.append([])
for xd in range(degree+1):
for yd in range(degree+1-xd):
A[i].append((x[i]**xd)*(y[i]**yd)) #f_j(x_i)
c,_,_,_ = linalg.lstsq(A,Z)
j = 0
for xd in range(0,degree+1):
for yd in range(0,degree+1-xd):
print " + (%.2f)x^%dy^%d" % (c[j], xd, yd),
j += 1
```

prints

```
+ (0.01)x^0y^0 + (-0.00)x^0y^1 + (1.00)x^0y^2 + (-0.00)x^1y^0 + (2.00)x^1y^1 + (1.00)x^2y^0
```

so it has discovered that the polynomial is x^{2}+2xy+y^{2}+0.01. [The last term is sometimes -0.01 and sometimes 0, which is to be expected because of the random noise we added.]

Alternatives to Python+Numpy/Scipy are R and Computer Algebra Systems: Sage, Mathematica, Matlab, Maple. Even Excel might be able to do it. Numerical Recipes discusses methods to implement it ourselves (in C, Fortran).

## Concerns

- It is strongly influenced by
**how the points are chosen**. When I had`x=y=range(20)`

instead of the random points, it always produced 1.33x^{2}+1.33xy+1.33y^{2}, which was puzzling… until I realised that because I always had`x[i]=y[i]`

, the polynomials were the same: x^{2}+2xy+y^{2}= 4x^{2}= (4/3)(x^{2}+xy+y^{2}). So the moral is that it is important to choose the points carefully to get the “right” polynomial. (If you can chose, you should choose Chebyshev nodes for polynomial interpolation; not sure if the same is true for least squares as well.) **Overfitting**: higher-degree polynomials can always fit the data better. If you change the`degree`

to 3 or 4 or 5, it still mostly recognizes the same quadratic polynomial (coefficients are 0 for higher-degree terms) but for larger degrees, it starts fitting higher-degree polynomials. But even with degree 6, taking larger n (more data points instead of 20, say 200) still fits the quadratic polynomial. So the moral is to avoid overfitting, for which it might help to take as many data points as possible.- There might be issues of numerical stability I don’t fully understand.
- If you don’t need a polynomial, you can obtain better fits with other kinds of functions, e.g. splines (piecewise polynomials).