TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

You need to convert the dtype object to unicode string as is clearly mentioned in the traceback. x = v.fit_transform(df[‘Review’].values.astype(‘U’)) ## Even astype(str) would work From the Doc page of TFIDF Vectorizer: fit_transform(raw_documents, y=None) Parameters: raw_documents : iterable an iterable which yields either str, unicode or file objects

How to find the importance of the features for a logistic regression model?

One of the simplest options to get a feeling for the “influence” of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data. Consider this example: import numpy as np from sklearn.linear_model import … Read more

Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

You can access the feature_names using the following snippet: clf.named_steps[‘preprocessor’].transformers_[1][1]\ .named_steps[‘onehot’].get_feature_names(categorical_features) Using sklearn >= 0.21 version, we can make it even simpler: clf[‘preprocessor’].transformers_[1][1]\ [‘onehot’].get_feature_names(categorical_features) Reproducible example: import numpy as np import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.linear_model import LinearRegression … Read more

What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

For machine learning, you almost definitely want to use sklearn.OneHotEncoder. For other tasks like simple analyses, you might be able to use pd.get_dummies, which is a bit more convenient. Note that sklearn.OneHotEncoder has been updated in the latest version so that it does accept strings for categorical variables, as well as integers. The crux of … Read more

Can anyone explain me StandardScaler?

Intro I assume that you have a matrix X where each row/line is a sample/observation and each column is a variable/feature (this is the expected input for any sklearn ML function by the way — X.shape should be [number_of_samples, number_of_features]). Core of method The main idea is to normalize/standardize i.e. μ = 0 and σ … Read more

fit-transform on training data and transform on test data [duplicate]

Let’s take an example of a transform, sklearn.preprocessing.StandardScaler. From the docs, this will: Standardize features by removing the mean and scaling to unit variance Suppose you’re working with code like the following. import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler # X is features, y is label X_train, X_test, y_train, y_test … Read more