提问人:upwork_abid 提问时间:11/13/2023 最后编辑:GSergupwork_abid 更新时间:11/14/2023 访问量:68
SQL Server 重复
SQL Server Duplicate
问:
我需要在一个名为 的表中查找重复项,该表包含大约 100k 条记录。重复项在列中具有相似的值,例如:lead
company
目标是仅保留最新的(在本例中为 95803)。但是,95803 存在一个问题,它在空格后有一些额外的字符。leadid
leadid
我尝试使用以下脚本,但它没有提供所需的结果:
select t1.*
FROM [dbo].[LEAD] t1
LEFT JOIN (
SELECT
company,
city,
MAX(leadid) AS keep_leadid
FROM [dbo].[LEAD]
GROUP BY company, city
) t2 ON t1.company = t2.company AND t1.city = t2.city
WHERE t1.leadid <> t2.keep_leadid
AND t1.company LIKE '%Uvalde Country%'
如果能协助完善剧本以达到预期成果,将不胜感激。
我想全部删除,除了这个:
有很多公司,有不同的字符串,我想对所有公司应用相同的脚本。
答:
0赞
Alex
11/13/2023
#1
这里有一个尝试,但需要注意的是,你对最短的公司名称感兴趣,并将与以最短的公司名称开头的公司相匹配,也不考虑城市:
declare @t table([sid] int not null identity(1,1), leadid int, company varchar(80));
insert into @t values(1,'company A');
insert into @t values(30,'company A INC');
insert into @t values(5,'company B');
insert into @t values(9,'company C');
insert into @t values(48,'company C INC');
--query to see join on companies that start with the same string
select *
from
@t a
LEFT join @t b on a.company = left(b.company, len(a.company))
;WITH CTE AS
(
--get max id per company
select MAX(case when a.leadid > b.leadid then a.leadid else b.leadid end) max_id,
case when len(a.company) < len(b.company) then a.company else b.company end company,
case when len(a.company) < len(b.company) then len(a.company) else len(b.company) end len_company
from
@t a
inner join @t b on a.company = left(b.company, len(a.company))
group by --group by shortest company?
case when len(a.company) < len(b.company) then a.company else b.company end,
case when len(a.company) < len(b.company) then len(a.company) else len(b.company) end
), CTE2 AS
(
--ROW NUMBER ON SHORTEST COMPANIES TO DISCARD ONES WITH MORE CHARACTERS AT THE END
select max_id, MAX(company) company, ROW_NUMBER() OVER(PARTITION BY max_id ORDER BY MIN(len_company)) rn
from CTE
group by max_id, len_company
)
SELECT max_id, company
FROM
CTE2
where
rn = 1;
评论