Rob Rob - 2 months ago 5
SQL Question

Find and remove duplicate entries where values can be swapped between two columns

I have the following data structure which contains relationships between nodes. The

relationships
table has a
from_id
and
to_id
the values in these can be swapped.

I need to find and remove ONLY the duplicates leaving a single row of each pair.

| id | node_from_id | node_to_id |
---------------------------------
| 1 | 100 | 200 |
| 2 | 200 | 100 |
| 3 | 200 | 300 |
| 4 | 300 | 200 |
| 5 | 200 | 300 |
| 6 | 300 | 400 |
| 7 | 500 | 400 |


After removal of duplicates I'd like the following to remain in the table

| id | node_from_id | node_to_id |
---------------------------------
| 1 | 100 | 200 |
| 3 | 200 | 300 |
| 6 | 300 | 400 |
| 7 | 500 | 400 |


I am able to return all the rows that match relatively simply but to enable them to be removed I only really want to return the actual duplicates to allow one of the key pair rows to remain.

Here's my query which currently returns ALL matching rows

SELECT *
FROM relationships AS rel1
WHERE EXISTS (SELECT *
FROM relationships AS rel2
WHERE
rel1.id <> rel2.id
AND
(
(
rel1.from_id = rel2.from_id AND
rel1.to_id = rel2.to_id
)
OR
(
rel1.from_id = rel2.to_id AND
rel1.to_id = rel2.from_id
)
)
)


I thought I might be able to use a window function and select only the rows with
row_number() > 1
but I don't seem to be able to use that inside the EXISTS sub query.

I have made the actual data for
relationships
available as CSV if that helps

Answer

I'd use smth like

select  min(id),
        least (node_from_id, node_to_id) node_from_id,
        greatest(node_from_id, node_to_id) node_to_id
from    relationships
group   by         
        least (node_from_id, node_to_id) ,
        greatest(node_from_id, node_to_id)