train-test-split – Make Me Engineer

Should Feature Selection be done before Train-Test Split or after?

May 14, 2023 by Tarik

It is not actually difficult to demonstrate why using the whole dataset (i.e. before splitting to train/test) for selecting features can lead you astray. Here is one such demonstration using random dummy data with Python and scikit-learn: import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics … Read more

Order between using validation, training and test sets

November 6, 2022 by Tarik

The Wikipedia article is not wrong; according to my own experience, this is a frequent point of confusion among newcomers to ML. There are two separate ways of approaching the problem: Either you use an explicit validation set to do hyperparameter search & tuning Or you use cross-validation So, the standard point is that you … Read more