word2vec – Make Me Engineer

Convert word2vec bin file to text

June 10, 2023 by Tarik

I use this code to load binary model, then save the model to text file, from gensim.models.keyedvectors import KeyedVectors model = KeyedVectors.load_word2vec_format(‘path/to/GoogleNews-vectors-negative300.bin’, binary=True) model.save_word2vec_format(‘path/to/GoogleNews-vectors-negative300.txt’, binary=False) References: API and nullege. Note: Above code is for new version of gensim. For previous version, I used this code: from gensim.models import word2vec model = word2vec.Word2Vec.load_word2vec_format(‘path/to/GoogleNews-vectors-negative300.bin’, binary=True) model.save_word2vec_format(‘path/to/GoogleNews-vectors-negative300.txt’, binary=False)

How to get vector for a sentence from the word2vec of tokens in sentence

June 8, 2023 by Tarik

There are differet methods to get the sentence vectors : Doc2Vec : you can train your dataset using Doc2Vec and then use the sentence vectors. Average of Word2Vec vectors : You can just take the average of all the word vectors in a sentence. This average vector will represent your sentence vector. Average of Word2Vec … Read more

How to calculate the sentence similarity using word2vec model of gensim with python

November 7, 2022 by Tarik

This is actually a pretty challenging problem that you are asking. Computing sentence similarity requires building a grammatical model of the sentence, understanding equivalent structures (e.g. “he walked to the store yesterday” and “yesterday, he walked to the store”), finding similarity not just in the pronouns and verbs but also in the proper nouns, finding … Read more

How to speed up Gensim Word2vec model load time?

July 26, 2022 by Tarik

In recent gensim versions you can load a subset starting from the front of the file using the optional limit parameter to load_word2vec_format(). (The GoogleNews vectors seem to be in roughly most- to least- frequent order, so the first N are usually the N-sized subset you’d want. So use limit=500000 to get the most-frequent 500,000 … Read more

My Doc2Vec code, after many loops/epochs of training, isn’t giving good results. What might be wrong?

July 12, 2022 by Tarik

Do not call .train() multiple times in your own loop that tries to do alpha arithmetic. It’s unnecessary, and it’s error-prone. Specifically, in the above code, decrementing the original 0.025 alpha by 0.001 forty times results in (0.025 – 40*0.001) -0.015 final alpha, which would also have been negative for many of the training epochs. … Read more