Let’s take an example of a transform, sklearn.preprocessing.StandardScaler.
From the docs, this will:
Standardize features by removing the mean and scaling to unit variance
Suppose you’re working with code like the following.
import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # X is features, y is label X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.33, random_state=42 )
When you call
StandardScaler.fit(X_train), what it does is calculate the mean and variance from the values in
X_train. Then calling
.transform() will transform all of the features by subtracting the mean and dividing by the variance. For convenience, these two function calls can be done in one step using
The reason you want to fit the scaler using only the training data is because you don’t want to bias your model with information from the test data.
fit() to your test data, you’d compute a new mean and variance for each feature. In theory these values may be very similar if your test and train sets have the same distribution, but in practice this is typically not the case.
Instead, you want to only transform the test data by using the parameters computed on the training data.