duplicate-removal – Make Me Engineer

Techniques for finding near duplicate records

November 30, 2022 by Tarik

If you’re just doing small batches that are relatively well-formed, then the compare.linkage() or compare.dedup() functions in the RecordLinkage package should be a great starting point. But if you have big batches, then you might have to do some more tinkering. I use the functions jarowinkler(), levenshteinSim(), and soundex() in RecordLinkage to write my own … Read more

Removing duplicate columns and rows from a NumPy 2D array

November 21, 2022 by Tarik

This should do the trick: def unique_rows(a): a = np.ascontiguousarray(a) unique_a = np.unique(a.view([(”, a.dtype)]*a.shape[1])) return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1])) Example: >>> a = np.array([[1, 1], [2, 3], [1, 1], [5, 4], [2, 3]]) >>> unique_rows(a) array([[1, 1], [2, 3], [5, 4]])

Remove duplicates keeping entry with largest absolute value

November 19, 2022 by Tarik

First. Sort in the order putting the less desired items last within id groups aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value) Then: Remove items after the first within id groups aa[ !duplicated(aa$id), ] # take the first row within each id id value 2 1 2 4 2 -4 … Read more

Get the distinct sum of a joined table column

November 5, 2022 by Tarik

To get the result without subquery, you have to resort to advanced window function trickery: SELECT sum(count(*)) OVER () AS tickets_count , sum(min(a.revenue)) OVER () AS atendees_revenue FROM tickets t JOIN attendees a ON a.id = t.attendee_id GROUP BY t.attendee_id LIMIT 1; sqlfiddle How does it work? The key to understanding this is the sequence … Read more

duplicates in multiple columns

October 10, 2022 by Tarik

It works if you use duplicated twice: df[!(duplicated(df[c(“c”,”d”)]) | duplicated(df[c(“c”,”d”)], fromLast = TRUE)), ] a b c d 1 1 2 A 1001 4 4 8 C 1003 7 7 13 E 1005 8 8 14 E 1006

Delete duplicate records from a SQL table without a primary key

August 9, 2022 by Tarik

It is very simple. I tried in SQL Server 2008 DELETE SUB FROM (SELECT ROW_NUMBER() OVER (PARTITION BY EmpId, EmpName, EmpSSN ORDER BY EmpId) cnt FROM Employee) SUB WHERE SUB.cnt > 1

Delete duplicate rows (don’t delete all duplicate)

July 4, 2022 by Tarik

Try the steps described in this article: Removing duplicates from a PostgreSQL database. It describes a situation when you have to deal with huge amount of data which isn’t possible to group by. A simple solution would be this: DELETE FROM foo WHERE id NOT IN (SELECT min(id) –or max(id) FROM foo GROUP BY hash) … Read more

Delete rows that exist in another data frame? [duplicate]

June 23, 2022 by Tarik

You need the %in% operator. So, df1[!(df1$name %in% df2$name),] should give you what you want. df1$name %in% df2$name tests whether the values in df1$name are in df2$name The ! operator reverses the result.

How to delete duplicate entries?

May 30, 2022 by Tarik

Some of these approaches seem a little complicated, and I generally do this as: Given table table, want to unique it on (field1, field2) keeping the row with the max field3: DELETE FROM table USING table alias WHERE table.field1 = alias.field1 AND table.field2 = alias.field2 AND table.max_field < alias.max_field For example, I have a table, … Read more