Techniques for finding near duplicate records

If you’re just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering. I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own … Read more

Get the distinct sum of a joined table column

To get the result without subquery, you have to resort to advanced window function trickery: SELECT sum(count(*)) OVER () AS tickets_count , sum(min(a.revenue)) OVER () AS atendees_revenue FROM tickets t JOIN attendees a ON a.id = t.attendee_id GROUP BY t.attendee_id LIMIT 1; sqlfiddle How does it work? The key to understanding this is the sequence … Read more

How to delete duplicate entries?

Some of these approaches seem a little complicated, and I generally do this as: Given table table, want to unique it on (field1, field2) keeping the row with the max field3: DELETE FROM table USING table alias WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND table.max_field < alias.max_field For example, I have a table, … Read more

tech