tf-idf – Make Me Engineer

TfidfVectorizer in scikit-learn : ValueError: np.nan is an invalid document

June 10, 2023 by Tarik

You need to convert the dtype object to unicode string as is clearly mentioned in the traceback. x = v.fit_transform(df[‘Review’].values.astype(‘U’)) ## Even astype(str) would work From the Doc page of TFIDF Vectorizer: fit_transform(raw_documents, y=None) Parameters: raw_documents : iterable an iterable which yields either str, unicode or file objects

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

May 11, 2023 by Tarik

Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more

How to get word details from TF Vector RDD in Spark ML Lib?

May 4, 2023 by Tarik

Well, you can’t. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there. If you’re using a large hash and number of unique tokens is relatively low then you can try to … Read more

get cosine similarity between two documents in lucene

October 10, 2022 by Tarik

As Julia points out Sujit Pal’s example is very useful but the Lucene 4 API has substantial changes. Here is a version rewritten for Lucene 4. import java.io.IOException; import java.util.*; import org.apache.commons.math3.linear.*; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.core.SimpleAnalyzer; import org.apache.lucene.document.*; import org.apache.lucene.document.Field.Store; import org.apache.lucene.index.*; import org.apache.lucene.store.*; import org.apache.lucene.util.*; public class CosineDocumentSimilarity { public static final String CONTENT … Read more

Python: tf-idf-cosine: to find document similarity

September 2, 2022 by Tarik

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more