information-retrieval – Make Me Engineer

How to parse the data from Google Alerts?

May 27, 2023 by Tarik

When you create the alert, set the “Deliver To” to “Feed” and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database.

Fast/Optimize N-gram implementations in python

May 16, 2023 by Tarik

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don’t need the full list at the same time, the generator functions should be faster. import timeit from itertools import tee, izip, islice def … Read more

Python: tf-idf-cosine: to find document similarity

September 2, 2022 by Tarik

First off, if you want to extract count features and apply TF-IDF normalization and row-wise euclidean normalization you can do it in one operation with TfidfVectorizer: >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> from sklearn.datasets import fetch_20newsgroups >>> twenty = fetch_20newsgroups() >>> tfidf = TfidfVectorizer().fit_transform(twenty.data) >>> tfidf <11314×130088 sparse matrix of type ‘<type ‘numpy.float64′>’ with 1787553 … Read more