Let’s take an example of a transform, sklearn.preprocessing.StandardScaler.
From the docs, this will:
Standardize features by removing the mean and scaling to unit variance
Suppose you’re working with code like the following.
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# X is features, y is label
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42
)
When you call StandardScaler.fit(X_train)
, what it does is calculate the mean and variance from the values in X_train
. Then calling .transform()
will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using fit_transform()
.
The reason you want to fit the scaler using only the training data is because you don’t want to bias your model with information from the test data.
If you fit()
to your test data, you’d compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.
Instead, you want to only transform the test data by using the parameters computed on the training data.