Apply fuzzy matching across a dataframe column and save results in a new column

I couldn’t tell what you were doing. This is how I would do it. from fuzzywuzzy import fuzz from fuzzywuzzy import process Create a series of tuples to compare: compare = pd.MultiIndex.from_product([df1[‘Company’], df2[‘FDA Company’]]).to_series() Create a special function to calculate fuzzy metrics and return a series. def metrics(tup): return pd.Series([fuzz.ratio(*tup), fuzz.token_sort_ratio(*tup)], [‘ratio’, ‘token’]) Apply metrics … Read more

Javascript fuzzy search that makes sense

I tried using existing fuzzy libraries like fuse.js and also found them to be terrible, so I wrote one which behaves basically like sublime’s search. https://github.com/farzher/fuzzysort The only typo it allows is a transpose. It’s pretty solid (1k stars, 0 issues), very fast, and handles your case easily: fuzzysort.go(‘int’, [‘international’, ‘splint’, ‘tinder’]) // [{highlighted: ‘*int*ernational’, … Read more

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing). The drawbacks of SOUNDEX() are: its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end … Read more

A better similarity ranking algorithm for variable length strings

Simon White of Catalysoft wrote an article about a very clever algorithm that compares adjacent character pairs that works really well for my purposes: http://www.catalysoft.com/articles/StrikeAMatch.html Simon has a Java version of the algorithm and below I wrote a PL/Ruby version of it (taken from the plain ruby version done in the related forum entry comment … Read more

Fuzzy matching using T-SQL

I’ve found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I’ve had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I’ve then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and … Read more

How can I match fuzzy match strings from two datasets?

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching. As suggested by @C8H10N4O2, the stringdist method=”jw” creates the best matches for your example. As suggested by @dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and … Read more

Efficient string matching in Apache Spark

I wouldn’t use Spark in the first place, but if you are really committed to the particular stack, you can combine a bunch of ml transformers to get best matches. You’ll need Tokenizer (or split): import org.apache.spark.ml.feature.RegexTokenizer val tokenizer = new RegexTokenizer().setPattern(“”).setInputCol(“text”).setMinTokenLength(1).setOutputCol(“tokens”) NGram (for example 3-gram) import org.apache.spark.ml.feature.NGram val ngram = new NGram().setN(3).setInputCol(“tokens”).setOutputCol(“ngrams”) Vectorizer (for … Read more