Fast/Optimize N-gram implementations in python

Some attempts with some profiling. I thought using generators could improve the speed here. But the improvement was not noticeable compared to a slight modification of the original. But if you don’t need the full list at the same time, the generator functions should be faster. import timeit from itertools import tee, izip, islice def … Read more

Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

Check out NLTK package: http://www.nltk.org it has everything what you need For the cosine_similarity: def cosine_distance(u, v): “”” Returns the cosine of the angle between vectors v and u. This is equal to u.v / |u||v|. “”” return numpy.dot(u, v) / (math.sqrt(numpy.dot(u, u)) * math.sqrt(numpy.dot(v, v))) For ngrams: def ngrams(sequence, n, pad_left=False, pad_right=False, pad_symbol=None): “”” … Read more

Computing N Grams using Python

A short Pythonesque solution from this blog: def find_ngrams(input_list, n): return zip(*[input_list[i:] for i in range(n)]) Usage: >>> input_list = [‘all’, ‘this’, ‘happened’, ‘more’, ‘or’, ‘less’] >>> find_ngrams(input_list, 1) [(‘all’,), (‘this’,), (‘happened’,), (‘more’,), (‘or’,), (‘less’,)] >>> find_ngrams(input_list, 2) [(‘all’, ‘this’), (‘this’, ‘happened’), (‘happened’, ‘more’), (‘more’, ‘or’), (‘or’, ‘less’)] >>> find_ngrams(input_list, 3)) [(‘all’, ‘this’, ‘happened’), (‘this’, … Read more

N-gram generation from a sentence

I believe this would do what you want: import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(” “); for (int i = 0; i < words.length – n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] … Read more

Filename search with ElasticSearch

You have various problems with what you pasted: 1) Incorrect mapping When creating the index, you specify: “mappings”: { “files”: { But your type is actually file, not files. If you checked the mapping, you would see that immediately: curl -XGET ‘http://127.0.0.1:9200/files/_mapping?pretty=1’ # { # “files” : { # “files” : { # “properties” : … Read more