提问人:J S 提问时间:4/14/2023 最后编辑:J S 更新时间:4/14/2023 访问量:136
在 SQL Server 中,有没有办法在聚合时使用 GROUPING SETS 删除重复项,而不会进行巨大的基数扩展?
In SQL Server, is there a way when using GROUPING SETS to remove duplicates while aggregating, without huge cardinality expansion?
问:
一个简化的(有点傻的)例子会让我要问的问题更清楚一些。Bob/Tim 应为此处的唯一标识符。
DROP TABLE IF EXISTS #People
CREATE TABLE #People
(
Person varchar(10),
City varchar(20),
Department varchar(1),
PersonAge int
);
INSERT INTO #People VALUES ('Bob', 'New York', 'A', 40),
('Tim', 'New York', 'A', 30),
('Tim', 'New York', 'B', 30)
;WITH InitialGrouping AS
(
SELECT
Person,
City,
Department,
MAX(PersonAge) PersonAge,
COUNT(*) NumRows
FROM
#People
GROUP BY
Person, City, Department
) --SELECT * FROM InitialGrouping
SELECT
City,
Department,
AVG(PersonAge) AveragePersonAge,
COUNT(DISTINCT Person) PersonCount
FROM
InitialGrouping
GROUP BY
GROUPING SETS (City, Department, (City, Department))
ORDER BY
City, Department
;WITH InitialGrouping AS
(
SELECT
GROUPING_ID(Person,City,Department) GROUPID,
Person,
City,
Department,
MAX(PersonAge) PersonAge,
COUNT(*) NumRows
FROM
#People
GROUP BY
Person, GROUPING SETS (City, Department, (City, Department))
) --SELECT * FROM InitialGrouping
SELECT
City,
Department,
AVG(PersonAge) AveragePersonAge,
COUNT(DISTINCT Person) PersonCount
FROM
InitialGrouping
GROUP BY
GROUPID, City, Department
ORDER BY
City, Department
这两个示例在功能上并不等效。如果你看一下纽约的汇总行,你会发现第一个(33)的平均年龄是错误的,第二个(35)是正确的。第二种方法在功能上是正确的,但是,对于具有大量分组集的更复杂的示例,它绝对会在查询计划的串联步骤中爆发基数,从而导致性能不可接受。在这个简单示例中,您可以看到第一个查询将内容保持在 5 行,而第二个查询扩展到 8 行。在具有大量分组集的更复杂情况下,我看到基数在数百万行中扩展了很多倍。
有没有办法解决这个问题,可以防止事情扩展到超过最终输出的目标粒度?
下面的评论中提出了一个解决方案,该解决方案确实有效。这可能是最好的选择。它看起来像这样(我添加了一个额外的组):
DROP TABLE IF EXISTS #People
CREATE TABLE #People
(
Person varchar(10),
City varchar(20),
Department varchar(1),
Badge int,
PersonAge int
);
INSERT INTO #People VALUES ('Bob', 'New York', 'A', 1, 40),
('Tim', 'New York', 'A', 1, 30),
('Tim', 'New York', 'B', 2, 30),
('Tom', 'New York', 'B', 1, 30),
('Jim', 'Paris', 'A', 2, 50)
;WITH InitialGrouping AS
(
SELECT
Person,
City,
Department,
Badge,
ROW_NUMBER() OVER (PARTITION BY Person ORDER BY Person) OverallRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, City, Department, Badge ORDER BY Person) CityDepartmentBadgeRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, City, Department ORDER BY Person) CityDepartmentRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, City, Badge ORDER BY Person) CityBadgeRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, Badge, Department ORDER BY Person) DepartmentBadgeRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, City ORDER BY Person) CityRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, Department ORDER BY Person) DepartmentRowNum,
ROW_NUMBER() OVER (PARTITION BY Person, Badge ORDER BY Person) BadgeRowNum,
MAX(PersonAge) PersonAge
FROM
#People
GROUP BY
Person, City, Department, Badge
) --SELECT * FROM InitialGrouping
SELECT
GROUPING_ID(City,Department,Badge),
City,
Department,
Badge,
CASE WHEN GROUPING_ID(City,Department,Badge) = 7 THEN AVG(CASE WHEN OverallRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 6 THEN AVG(CASE WHEN BadgeRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 5 THEN AVG(CASE WHEN DepartmentRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 4 THEN AVG(CASE WHEN DepartmentBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 3 THEN AVG(CASE WHEN CityRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 2 THEN AVG(CASE WHEN CityBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 1 THEN AVG(CASE WHEN CityDepartmentRowNum = 1 THEN PersonAge ELSE NULL END)
WHEN GROUPING_ID(City,Department,Badge) = 0 THEN AVG(CASE WHEN CityDepartmentBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
END AS AveragePersonAge,
COUNT(DISTINCT Person) PersonCount
FROM
InitialGrouping
GROUP BY
GROUPING SETS (City, Department, Badge, (City, Department), (City,Badge), (Department,Badge), (City,Department,Badge),())
ORDER BY
City, Department
答:
0赞
abolfazl sadeghi
4/14/2023
#1
我使用第一种方法,但更改方法类型、顺序、聚合并使用窗口函数
select
City,
Department,
max(PersonAge) AveragePersonAge,
max( Person) PersonCount
from (
select
City
,Department
,AVG(PersonAge) over(partition by City ) as PersonAge
,
DENSE_RANK() OVER (PARTITION BY City ORDER BY Person ASC) +
DENSE_RANK() OVER (PARTITION BY City ORDER BY Person DESC) - 1 AS Person
from #People
)a
GROUP BY
GROUPING SETS (Department,City,(Department,City) )
ORDER BY
City, Department
评论
0赞
J S
4/14/2023
嗯,这适用于我的简化示例。然而,正如我所提到的,我的目标是扩展到更多的团体,除非我错过了一种方法来实现这一目标,否则它不会继续下去。由于我不能在这里添加代码:例如,如果您添加第三列,则将其命名为 [Group],然后添加第三个人称,称他为 Tom,并让他 30 岁。将 Tom 添加到与 Bob 和 Tim 不同的组中,但除此之外,将他放在“纽约”、“A”中。如果查看“A”的汇总,则会看到平均年龄值不正确的值 35,人员计数的值不正确的 2。
0赞
J S
4/14/2023
或者更好的是,在不添加组的情况下:为 Tom 添加该行,如下所示:('Tom', 'New York', 'B', 30),然后查看 New York 的汇总行。因此,这甚至不适用于简化的示例。
0赞
abolfazl sadeghi
4/14/2023
我想你说第一个代码是更好的代码,但结果不是真的,我现在解决了你的问题,你的第一个代码是正确的,我的更正我无法理解你的问题,你能告诉我你的问题并展示我的最终结果并解释me@JS
0赞
J S
4/14/2023
目标是在每个分组集中,我们应该正确计数该分组集中的不同人,以及这些不同人的平均年龄。不幸的是,您没有解决这个问题 - 您为确切的三个示例行解决了它。该解决方案应该能够泛化到任何数据集,而您的则不能。正如我所提到的,如果我们要添加第四行('Tom'、'New York'、'B'、30),您的代码将不再适用于纽约汇总,该汇总现在应该有 3 个不同的人,平均年龄为 33 岁。
0赞
abolfazl sadeghi
4/14/2023
我编辑了这个代码“不同的人和平均年龄”,你使用城市,部门在我的代码组中使用城市,部门,如果你删除这个代码的“部门”的“部门”,则你的代码“AVG(PersonAge)over(partition by City)as PersonAge”,平均为130%4=32@JS
上一个:特定用户的处理器资源利用率限制
评论
City
Department
City, Department