I recently came across a couple methods for listing out instances of duplicate records in a table where the table doesn't have an explicit unique identifier. Here is a post that explains how to do this when a unique identifier is present.
If the table doesn't have an explicit primary key or other uniquely identifying
value, then we'll have to get some help from PostgreSQL's internal system
columns —
namely the ctid
.
The ctid
is:
The physical location of the row version within its table.
Let's use the example of the mailing_list
table with potential duplicate
email
values.
Here is the first approach:
delete from mailing_list
where ctid not in (
select min(ctid)
from mailing_list
group by email
);
This uses a subquery to find the first occurrence of every unique email and
then deletes the rest. The ctid
is the unique value that we can call the
min
aggregate on.
delete from mailing_list ml1
using mailing_list ml2
where ml1.ctid < ml2.ctid
and ml1.email = ml2.email;
This uses delete using
to join the table against itself as a cartesian
product to compare every entry to every other entry.