scikit-learn – Make Me Engineer

scikit-learn random state in splitting dataset

June 11, 2023 by Tarik

It doesn’t matter if the random_state is 0 or 1 or any other integer. What matters is that it should be set the same value, if you want to validate your processing over multiple runs of the code. By the way I have seen random_state=42 used in many official examples of scikit as well as … Read more

TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

June 10, 2023 by Tarik

You need to convert the dtype object to unicode string as is clearly mentioned in the traceback. x = v.fit_transform(df[‘Review’].values.astype(‘U’)) ## Even astype(str) would work From the Doc page of TFIDF Vectorizer: fit_transform(raw_documents, y=None) Parameters: raw_documents : iterable an iterable which yields either str, unicode or file objects

How to find the importance of the features for a logistic regression model?

June 9, 2023 by Tarik

One of the simplest options to get a feeling for the “influence” of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data. Consider this example: import numpy as np from sklearn.linear_model import … Read more

Scikit Learn SVC decision_function and predict

June 9, 2023 by Tarik

I don’t fully understand your code, but let’s go trough the example of the documentation page you referenced: import numpy as np X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]]) y = np.array([1, 1, 2, 2]) from sklearn.svm import SVC clf = SVC() clf.fit(X, y) Now let’s apply both the decision function and … Read more

Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

June 8, 2023 by Tarik

You can access the feature_names using the following snippet: clf.named_steps[‘preprocessor’].transformers_[1][1]\ .named_steps[‘onehot’].get_feature_names(categorical_features) Using sklearn >= 0.21 version, we can make it even simpler: clf[‘preprocessor’].transformers_[1][1]\ [‘onehot’].get_feature_names(categorical_features) Reproducible example: import numpy as np import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.linear_model import LinearRegression … Read more

What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

June 7, 2023 by Tarik

For machine learning, you almost definitely want to use sklearn.OneHotEncoder. For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient. Note that sklearn.OneHotEncoder has been updated in the latest version so that it does accept strings for categorical variables, as well as integers. The crux of … Read more

Can anyone explain me StandardScaler?

June 6, 2023 by Tarik

Intro I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way — X.shape should be [number_of_samples, number_of_features]). Core of method The main idea is to normalize/standardize i.e. μ = 0 and σ … Read more

ModuleNotFoundError: No module named ‘sklearn’

June 6, 2023 by Tarik

You can just use pip for installing packages, even when you are using anaconda: pip install -U scikit-learn scipy matplotlib This should work for installing the package. And for Python 3.x just use pip3: pip3 install -U scikit-learn scipy matplotlib

ImportError: No module named sklearn.cross_validation

June 5, 2023 by Tarik

It must relate to the renaming and deprecation of cross_validation sub-module to model_selection. Try substituting cross_validation to model_selection

fit-transform on training data and transform on test data [duplicate]

May 31, 2023 by Tarik

Let’s take an example of a transform, sklearn.preprocessing.StandardScaler. From the docs, this will: Standardize features by removing the mean and scaling to unit variance Suppose you’re working with code like the following. import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # X is features, y is label X_train, X_test, y_train, y_test … Read more