feature-selection – Make Me Engineer

Should Feature Selection be done before Train-Test Split or after?

May 14, 2023 by Tarik

It is not actually difficult to demonstrate why using the whole dataset (i.e. before splitting to train/test) for selecting features can lead you astray. Here is one such demonstration using random dummy data with Python and scikit-learn: import numpy as np from sklearn.feature_selection import SelectKBest from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics … Read more

How are feature_importances in RandomForestClassifier determined?

May 7, 2023 by Tarik

There are indeed several ways to get feature “importances”. As often, there is no strict consensus about what this word means. In scikit-learn, we implement the importance as described in [1] (often cited, but unfortunately rarely read…). It is sometimes called “gini importance” or “mean decrease impurity” and is defined as the total decrease in … Read more

Feature/Variable importance after a PCA analysis

November 9, 2022 by Tarik

First of all, I assume that you call features the variables and not the samples/observations. In this case, you could do something like the following by creating a biplot function that shows everything in one plot. In this example, I am using the iris data. Before the example, please note that the basic idea when … Read more