digoal / blog

OpenSource,Database,Business,Minds. git clone --depth 1 https://github.com/digoal/blog

Home Page:https://github.com/digoal/blog/blob/master/README.md

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

请教一下PG递归查重复数据的问题

java30kcoding opened this issue · comments

现在有表
create table user_info(
user_name varchar(8) not null,
tel varchar(11) not null,
id_no varchar(11) not null
);
想要查询name相同且tel和id_no不同 + in_no相同且tel和name不同的所有数据。
因为子公司问题导致出现了不少的脏数据,简单的SQL无法处理亿级别的数据量,在此提问。
感谢~

create table tbl (
rowid tid primary key,
user_name varchar(8) not null,
tel varchar(11) not null,
id_no varchar(11) not null
);

insert into tbl select ctid,user_name, tel, id_no from (select ctid , user_name, tel, id_no , row_number() over (partition by user_name, tel order by tel<>id_no ) as rn from user_info ) t
where rn=1
on conflict (rowid) do nothing;

insert into tbl select ctid,user_name, tel, id_no from (select ctid , user_name, tel, id_no , row_number() over (partition by in_no, tel order by tel<>name ) as rn from user_info ) t
where rn=1
on conflict (rowid) do nothing;


1,2,3
1,2,2
1,2,4
2,2,2
2,2,4
1,2,3

加入数据如上, 你要返回什么样的记录?

SELECT T1.*
FROM user_info T1,
user_info T2
WHERE T1.tel = T2.tel
AND (T1.id_no <> T2.id_no OR t1.user_name <> t2.user_name)
union
SELECT T1.*
FROM user_info T1,
user_info T2
WHERE T1.id_no = T2.id_no
AND (T1.tel <> t2.tel OR t1.user_name <> t2.user_name);
德哥,可能我描述的有些问题,SQL是这样的

SELECT T1.*
FROM user_info T1,
user_info T2
WHERE T1.tel = T2.tel
AND (T1.id_no <> T2.id_no OR t1.user_name <> t2.user_name)
union
SELECT T1.*
FROM user_info T1,
user_info T2
WHERE T1.id_no = T2.id_no
AND (T1.tel <> t2.tel OR t1.user_name <> t2.user_name);
德哥,可能我描述的有些问题,SQL是这样的

你这个查询可能有点问题:
例如这个表只有一条记录, 可能没有记录返回, 这是业务想要的结果吗.

postgres=# create table abc(c1 int, c2 int, c3 int);
CREATE TABLE
Time: 1.405 ms
postgres=# insert into abc values (1,2,3);
INSERT 0 1
Time: 0.854 ms
postgres=# select t1.* from abc t1 , abc t2 where t1.c1=t2.c1 and (t1.c2<>t2.c2 or t1.c3<>t2.c3);
c1 | c2 | c3
----+----+----
(0 rows)

Time: 0.371 ms

德哥,是这样的,目的是为了筛选各个分公司id_no(tel)相同其他信息数据作废,让他们重新注册;两个1,2,3一个1,2,4他们都是无效的;只有三要素唯一的时候才是有效的数据

用on conflict语法, ctid作为唯一标示.