提问人:Seibar 提问时间:8/21/2008 最后编辑:TylerHSeibar 更新时间:10/8/2022 访问量:1355421
如何删除重复的行?
How can I remove duplicate rows?
问:
我需要从相当大的 SQL Server 表(即 300,000+ 行)中删除重复的行。
当然,由于标识字段的存在,这些行不会是完全重复的。RowID
我的桌子
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
我该怎么做?
答:
Microsoft 支持网站上有一篇关于删除重复项的好文章。这是相当保守的 - 他们让你在单独的步骤中完成所有事情 - 但它应该适用于大型桌子。
我过去曾使用过自连接来做到这一点,尽管它可能被 HAVING 子句美化:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
评论
这是另一篇关于删除重复项的好文章。
它讨论了为什么它很难:“SQL基于关系代数,在关系代数中不能出现重复,因为集合中不允许重复。"
临时表解决方案和两个 mysql 示例。
将来,您将在数据库级别或从应用程序的角度阻止它。我建议数据库级别,因为您的数据库应该负责维护引用完整性,开发人员只会造成问题;)
评论
假设没有 null,则使用唯一列,并将 RowId 作为要保留的行。然后,只需删除没有行 ID 的所有内容:GROUP BY
SELECT
MIN (or MAX)
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
如果您有 GUID 而不是整数,则可以将
MIN(RowId)
跟
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
评论
DELETE FROM MyTable WHERE RowId NOT IN (SELECT MIN(RowId) FROM MyTable GROUP BY Col1, Col2, Col3);
LEFT JOIN
NOT EXISTS
NOT IN
NOT EXISTS
NOT EXISTS
DELETE MyTable FROM MyTable
DELETE
哦,当然。使用临时表。如果你想要一个“有效”的、性能不高的语句,你可以这样做:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
基本上,对于表中的每一行,子选择会查找与所考虑的行完全相同的所有行的顶部 RowID。因此,您最终会得到一个表示“原始”非重复行的 RowID 列表。
从应用程序层面(不幸的是)。我同意防止重复的正确方法是在数据库级别通过使用唯一索引,但在 SQL Server 2005 中,索引只允许为 900 字节,而我的 varchar(2048) 字段将其吹走了。
我不知道它的性能如何,但我认为你可以编写一个触发器来强制执行这一点,即使你不能直接使用索引来做到这一点。像这样:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE @cnt AS INT
SELECT @cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF @cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
另外,varchar(2048) 对我来说听起来很可疑(生活中有些东西是 2048 字节,但这并不常见);它真的不应该是 varchar(max) 吗?
创建具有相同结构的新空白表
像这样执行查询
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) > 1
然后执行此查询
INSERT INTO tc_category1 SELECT * FROM tc_category GROUP BY category_id, application_id HAVING count(*) = 1
我有一个表,我需要保留不重复的行。 我不确定速度或效率。
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
评论
HAVING COUNT(*) > 1
另一种可能的方法是
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
我在上面使用,因为在出现平局时保留哪一行是任意的。ORDER BY (SELECT 0)
例如,要按顺序保留最新的,您可以使用RowID
ORDER BY RowID DESC
执行计划
此操作的执行计划通常比公认的答案更简单、更有效,因为它不需要自连接。
然而,情况并非总是如此。解决方案可能首选的一个位置是优先选择哈希聚合而不是流聚合的情况。GROUP BY
该解决方案将始终提供几乎相同的计划,而策略则更加灵活。ROW_NUMBER
GROUP BY
可能有利于哈希聚合方法的因素是
- 分区列上没有有用的索引
- 组相对较少,每组重复项相对较多
在第二种情况的极端版本中(如果组很少,每个组中有许多重复项),也可以考虑简单地插入要保留的行到新表中,然后-ing原始表并将它们复制回去,以最大程度地减少日志记录,而不是删除非常高比例的行。TRUNCATE
评论
uniqueidentifier
RowId
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
评论
以下查询可用于删除重复的行。此示例中的表作为标识列,具有重复数据的列是 和 。ID
Column1
Column2
Column3
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
以下脚本显示了 、 在一个查询中的用法,并返回包含重复列及其计数的结果。GROUP BY
HAVING
ORDER BY
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
评论
NOT IN
OUTER JOIN ... NULL
HAVING MAX(ID) IS NOT NULL
通过使用下面的查询,我们可以基于单列或多列删除重复的记录。下面的查询基于两列进行删除。表名为:和列名testing
empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
评论
我会提到这种方法,因为它可能会有所帮助,并且适用于所有 SQL 服务器: 通常只有一个 - 两个重复项,并且 ID 和重复项计数是已知的。在这种情况下:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
Quick and Dirty 删除完全重复的行(对于小表):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
评论
set identity_insert t1 on
另一个简单的解决方案可以在这里粘贴的链接中找到。这个很容易掌握,似乎对大多数类似的问题都有效。虽然它适用于 SQL Server,但使用的概念是可以接受的。
以下是链接页面的相关部分:
请考虑以下数据:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
那么我们如何删除这些重复的数据呢?
首先,使用以下代码在该表中插入标识列:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
使用以下代码解决此问题:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
评论
ROW_NUMBER
这将删除除第一行之外的重复行
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
参考 (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
评论
另一种方法是创建一个具有相同字段和唯一索引的新表。然后将所有数据从旧表移动到新表。自动 SQL SERVER 忽略(还有一个选项,说明如果存在重复值该怎么办:忽略、中断或 sth)重复值。因此,我们有相同的表,没有重复的行。如果您不想要唯一索引,则可以在传输数据后将其删除。
特别是对于较大的表,您可以使用 DTS(用于导入/导出数据的 SSIS 包)将所有数据快速传输到新的唯一索引表。对于 700 万行,只需几分钟。
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
评论
DELETE
FROM MyTable
WHERE NOT EXISTS (
SELECT min(RowID)
FROM Mytable
WHERE (SELECT RowID
FROM Mytable
GROUP BY Col1, Col2, Col3
))
);
我想我会分享我的解决方案,因为它在特殊情况下有效。 就我而言,具有重复值的表没有外键(因为这些值是从另一个数据库复制的)。
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS:在处理这样的事情时,我总是使用事务,这不仅确保了所有内容都作为一个整体执行,而且还允许我在不冒任何风险的情况下进行测试。但当然,无论如何你都应该进行备份,以确保......
我更喜欢 subquery\having count(*) > 1 解决方案而不是内部连接,因为我发现它更容易阅读,并且很容易转换为 SELECT 语句以在运行之前验证将删除的内容。
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
评论
MAX(id)
LIMIT 1000000
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
评论
使用 CTE。这个想法是联接形成重复记录的一列或多列,然后删除您喜欢的任何列:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
评论
这个查询对我来说显示出非常好的性能:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
它在 30 秒多一点的时间内从 2M 表中删除了 2M 行(50% 重复)
我想要预览要删除的行,并控制要保留哪些重复行。查看 http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1
评论
我更喜欢 CTE 从 sql server 表中删除重复的行
强烈建议关注这篇文章 ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
通过保持原创
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
不保持原样
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
评论
使用这个
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
alter table MyTable add sno int identity(1,1)
delete from MyTable where sno in
(
select sno from (
select *,
RANK() OVER ( PARTITION BY RowID,Col3 ORDER BY sno DESC )rank
From MyTable
)T
where rank>1
)
alter table MyTable
drop column sno
现在让我们看一下elasticalsearch表,该表有重复的行,Id是相同的uniq字段。我们知道,如果某个 id 按照组条件存在,那么我们可以删除该组范围之外的其他行。我的方式表明了这个标准。
这个线程的很多情况都处于与我类似的状态。只需根据删除重复(重复)行的情况更改目标组条件即可。
DELETE
FROM elasticalsearch
WHERE Id NOT IN
(SELECT min(Id)
FROM elasticalsearch
GROUP BY FirmId,FilterSearchString
)
干杯
评论
另一种方法:--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
评论
有时,在记录日期时使用软删除机制来指示已删除的日期。在这种情况下,可以使用语句根据重复条目更新此字段。UPDATE
UPDATE MY_TABLE
SET DELETED = getDate()
WHERE TABLE_ID IN (
SELECT x.TABLE_ID
FROM MY_TABLE x
JOIN (SELECT min(TABLE_ID) id, COL_1, COL_2, COL_3
FROM MY_TABLE d
GROUP BY d.COL_1, d.COL_2, d.COL_3
HAVING count(*) > 1) AS d ON d.COL_1 = x.COL_1
AND d.COL_2 = x.COL_2
AND d.COL_3 = x.COL_3
AND d.TABLE_ID <> x.TABLE_ID
/*WHERE x.COL_4 <> 'D' -- Additional filter*/)
这种方法对于包含 ~3000 万行的相当适度的表很有帮助,重复量和重复量都很高。
这是删除重复记录的最简单方法
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
评论
要获取重复的行:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
要删除重复的行:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
评论
DELETE FROM
SELECT
DELETE
MySQL error 1093
DELETE FROM ... LEFT OUTER JOIN
DECLARE @idsToKeep TABLE(rowid INT);
INSERT INTO @idsToKeep(rowid) SELECT MIN... GROUP BY ...
DELETE users WHERE rowid NOT IN (SELECT rowid FROM @idsToKeep);
我知道这个问题已经得到解答,但是我已经创建了非常有用的sp,它将为表重复项创建一个动态删除语句:
CREATE PROCEDURE sp_DeleteDuplicate @tableName varchar(100), @DebugMode int =1
AS
BEGIN
SET NOCOUNT ON;
IF(OBJECT_ID('tempdb..#tableMatrix') is not null) DROP TABLE #tableMatrix;
SELECT ROW_NUMBER() OVER(ORDER BY name) as rn,name into #tableMatrix FROM sys.columns where [object_id] = object_id(@tableName) ORDER BY name
DECLARE @MaxRow int = (SELECT MAX(rn) from #tableMatrix)
IF(@MaxRow is null)
RAISERROR ('I wasn''t able to find any columns for this table!',16,1)
ELSE
BEGIN
DECLARE @i int =1
DECLARE @Columns Varchar(max) ='';
WHILE (@i <= @MaxRow)
BEGIN
SET @Columns=@Columns+(SELECT '['+name+'],' from #tableMatrix where rn = @i)
SET @i = @i+1;
END
---DELETE LAST comma
SET @Columns = LEFT(@Columns,LEN(@Columns)-1)
DECLARE @Sql nvarchar(max) = '
WITH cteRowsToDelte
AS (
SELECT ROW_NUMBER() OVER (PARTITION BY '+@Columns+' ORDER BY ( SELECT 0)) as rowNumber,* FROM '+@tableName
+')
DELETE FROM cteRowsToDelte
WHERE rowNumber > 1;
'
SET NOCOUNT OFF;
IF(@DebugMode = 1)
SELECT @Sql
ELSE
EXEC sp_executesql @Sql
END
END
因此,如果您创建这样的表:
IF(OBJECT_ID('MyLitleTable') is not null)
DROP TABLE MyLitleTable
CREATE TABLE MyLitleTable
(
A Varchar(10),
B money,
C int
)
---------------------------------------------------------
INSERT INTO MyLitleTable VALUES
('ABC',100,1),
('ABC',100,1), -- only this row should be deleted
('ABC',101,1),
('ABC',100,2),
('ABCD',100,1)
-----------------------------------------------------------
exec sp_DeleteDuplicate 'MyLitleTable',0
它将从您的表中删除所有重复项。如果您在没有第二个参数的情况下运行它,它将返回要运行的 SQL 语句。
如果需要排除任何列,只需在调试模式下运行它,获取代码并根据需要对其进行修改。
我认为这会有所帮助。这里,ROW_NUMBER() OVER(PARTITION BY res1.标题 ORDER BY res1.Id)as num 已用于区分重复行。
delete FROM
(SELECT res1.*,ROW_NUMBER() OVER(PARTITION BY res1.Title ORDER BY res1.Id)as num
FROM
(select * from [dbo].[tbl_countries])as res1
)as res2
WHERE res2.num > 1
评论
如果重复行中的所有列都相同,则可以使用以下查询来删除重复的记录。
SELECT DISTINCT * INTO #TemNewTable FROM #OriginalTable
TRUNCATE TABLE #OriginalTable
INSERT INTO #OriginalTable SELECT * FROM #TemNewTable
DROP TABLE #TemNewTable
对于表结构
我的桌子
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
删除重复项的查询:
DELETE t1
FROM MyTable t1
INNER JOIN MyTable t2
WHERE t1.RowID > t2.RowID
AND t1.Col1 = t2.Col1
AND t1.Col2=t2.Col2
AND t1.Col3=t2.Col3;
我假设这是一种自动递增,其余列具有重复值。
RowID
基于两列删除重复项的其他方法
我发现这个查询更容易阅读和替换。
DELETE
FROM
TABLE_NAME
WHERE FIRST_COLUMNS
IN(
SELECT * FROM
( SELECT MIN(FIRST_COLUMNS)
FROM TABLE_NAME
GROUP BY
FIRST_COLUMNS,
SECOND_COLUMNS
HAVING COUNT(FIRST_COLUMNS) > 1
) temp
)
注意:最好在运行它之前。simulate query
在postgresql中删除表的重复行的非常简单的方法。
DELETE FROM table1 a
USING table1 b
WHERE a.id < b.id
AND a.column1 = b.column1
AND a.column2 = b.column2;
首先,您可以使用 MIN() 和 Group By 选择最小 RowId。我们将保留这些行。
SELECT MIN(RowId) as RowId
FROM MyTable
GROUP BY Col1, Col2, Col3
并删除 RowId 的那些不在选定的最小 RowId 中使用
DELETE FROM MyTable WHERE RowId Not IN()
最终查询:
DELETE FROM MyTable WHERE RowId Not IN(
SELECT MIN(RowId) as RowId
FROM MyTable
GROUP BY Col1, Col2, Col3
)
您也可以在 SQL Fiddle 中查看我的答案
删除重复记录
在这种情况下,大于运算符删除除第一条记录之外的所有记录
从用户中删除 u1 u1 加入用户 u2 在哪里 u1.id > u2.id 和 u1.email=u2.email
< 小于运算符,在这种情况下,删除除最后一条记录之外的所有记录
从用户中删除 u1 u1 加入用户 u2 在哪里 u1.id < u2.id 和 u1.email=u2.email
创建另一个将包含原始值的表:
CREATE TABLE table2 AS SELECT *, COUNT(*) FROM table1 GROUP BY name HAVING COUNT (*) > 0
评论
DELETE FROM
ROWID()