Finding how similar two strings are

Ok, so the standard algorithms are: 1) Hamming distance Only good for strings of the same length, but very efficient. Basically it simply counts the number of distinct characters. Not useful for fuzzy searching of natural language text. 2) Levenstein distance. The Levenstein distance measures distance in terms of the number of “operations” required to … Read more

Regex for existence of some words whose order doesn’t matter

See this regex: /^(?=.*Tim)(?=.*stupid).+/ Regex explanation: ^ Asserts position at start of string. (?=.*Tim) Asserts that “Tim” is present in the string. (?=.*stupid) Asserts that “stupid” is present in the string. .+Now that our phrases are present, this string is valid. Go ahead and use .+ or – .++ to match the entire string. To … Read more

Javascript fuzzy search that makes sense

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime’s search. https://github.com/farzher/fuzzysort The only typo it allows is a transpose. It’s pretty solid (1k stars, 0 issues), very fast, and handles your case easily: fuzzysort.go(‘int’, [‘international’, ‘splint’, ‘tinder’]) // [{highlighted: ‘*int*ernational’, … Read more

High performance fuzzy string comparison in Python, use Levenshtein or difflib [closed]

In case you’re interested in a quick visual comparison of Levenshtein and Difflib similarity, I calculated both for ~2.3 million book titles: import codecs, difflib, Levenshtein, distance with codecs.open(“titles.tsv”,”r”,”utf-8″) as f: title_list = f.read().split(“\n”)[:-1] for row in title_list: sr = row.lower().split(“\t”) diffl = difflib.SequenceMatcher(None, sr[3], sr[4]).ratio() lev = Levenshtein.ratio(sr[3], sr[4]) sor = 1 – distance.sorensen(sr[3], … Read more