Techniques for finding near duplicate records

If you’re just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering. I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own … Read more

Fuzzy String Comparison

There is a package called fuzzywuzzy. Install via pip: pip install fuzzywuzzy Simple usage: >>> from fuzzywuzzy import fuzz >>> fuzz.ratio(“this is a test”, “this is a test!”) 96 The package is built on top of difflib. Why not just use that, you ask? Apart from being a bit simpler, it has a number of … Read more

How can I match fuzzy match strings from two datasets?

Here is a solution using the fuzzyjoin package. It uses dplyr-like syntax and stringdist as one of the possible types of fuzzy matching. As suggested by @C8H10N4O2, the stringdist method=”jw” creates the best matches for your example. As suggested by @dgrtwo, the developer of fuzzyjoin, I used a large max_dist and then used dplyr::group_by and … Read more

Good Python modules for fuzzy string comparison? [closed]

difflib can do it. Example from the docs: >>> get_close_matches(‘appel’, [‘ape’, ‘apple’, ‘peach’, ‘puppy’]) [‘apple’, ‘ape’] >>> import keyword >>> get_close_matches(‘wheel’, keyword.kwlist) [‘while’] >>> get_close_matches(‘apple’, keyword.kwlist) [] >>> get_close_matches(‘accept’, keyword.kwlist) [‘except’] Check it out. It has other functions that can help you build something custom.

tech