在 SQL Server 中,有没有办法在聚合时使用 GROUPING SETS 删除重复项,而不会进行巨大的基数扩展?

In SQL Server, is there a way when using GROUPING SETS to remove duplicates while aggregating, without huge cardinality expansion?

提问人:J S 提问时间:4/14/2023 最后编辑:J S 更新时间:4/14/2023 访问量:136

问:

一个简化的(有点傻的)例子会让我要问的问题更清楚一些。Bob/Tim 应为此处的唯一标识符。

DROP TABLE IF EXISTS #People

CREATE TABLE #People 
(
    Person varchar(10), 
    City varchar(20), 
    Department varchar(1), 
    PersonAge int
);

INSERT INTO #People VALUES ('Bob', 'New York', 'A', 40),
                           ('Tim', 'New York', 'A', 30),
                           ('Tim', 'New York', 'B', 30)

;WITH InitialGrouping AS 
(
    SELECT 
        Person,
        City,
        Department,
        MAX(PersonAge) PersonAge,
        COUNT(*) NumRows
    FROM
        #People
    GROUP BY 
        Person, City, Department
) --SELECT * FROM InitialGrouping
SELECT 
    City,
    Department,
    AVG(PersonAge) AveragePersonAge,
    COUNT(DISTINCT Person) PersonCount
FROM
    InitialGrouping
GROUP BY 
    GROUPING SETS (City, Department, (City, Department))
ORDER BY 
    City, Department

;WITH InitialGrouping AS 
(
    SELECT 
        GROUPING_ID(Person,City,Department) GROUPID,
        Person,
        City,
        Department,
        MAX(PersonAge) PersonAge,
        COUNT(*) NumRows
    FROM
        #People
    GROUP BY  
        Person, GROUPING SETS (City, Department, (City, Department))
) --SELECT * FROM InitialGrouping
SELECT 
    City,
    Department,
    AVG(PersonAge) AveragePersonAge,
    COUNT(DISTINCT Person) PersonCount
FROM 
    InitialGrouping
GROUP BY 
    GROUPID, City, Department
ORDER BY 
    City, Department

这两个示例在功能上并不等效。如果你看一下纽约的汇总行,你会发现第一个(33)的平均年龄是错误的,第二个(35)是正确的。第二种方法在功能上是正确的,但是,对于具有大量分组集的更复杂的示例,它绝对会在查询计划的串联步骤中爆发基数,从而导致性能不可接受。在这个简单示例中,您可以看到第一个查询将内容保持在 5 行,而第二个查询扩展到 8 行。在具有大量分组集的更复杂情况下,我看到基数在数百万行中扩展了很多倍。

有没有办法解决这个问题,可以防止事情扩展到超过最终输出的目标粒度?

下面的评论中提出了一个解决方案,该解决方案确实有效。这可能是最好的选择。它看起来像这样(我添加了一个额外的组):

DROP TABLE IF EXISTS #People

CREATE TABLE #People 
(
    Person varchar(10), 
    City varchar(20), 
    Department varchar(1), 
    Badge int,
    PersonAge int
);

INSERT INTO #People VALUES ('Bob', 'New York', 'A', 1, 40),
                           ('Tim', 'New York', 'A', 1, 30),
                           ('Tim', 'New York', 'B', 2, 30),
                           ('Tom', 'New York', 'B', 1, 30),
                           ('Jim', 'Paris', 'A', 2, 50)


;WITH InitialGrouping AS 
(
    SELECT 
        Person,
        City,
        Department,
        Badge,
        ROW_NUMBER() OVER (PARTITION BY Person ORDER BY Person) OverallRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, City, Department, Badge ORDER BY Person) CityDepartmentBadgeRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, City, Department ORDER BY Person) CityDepartmentRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, City, Badge ORDER BY Person) CityBadgeRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, Badge, Department ORDER BY Person) DepartmentBadgeRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, City ORDER BY Person) CityRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, Department ORDER BY Person) DepartmentRowNum,
        ROW_NUMBER() OVER (PARTITION BY Person, Badge ORDER BY Person) BadgeRowNum,
        MAX(PersonAge) PersonAge
    FROM
        #People
    GROUP BY 
        Person, City, Department, Badge
) --SELECT * FROM InitialGrouping


SELECT 
    GROUPING_ID(City,Department,Badge),
    City,
    Department,
    Badge,
    CASE WHEN GROUPING_ID(City,Department,Badge) = 7 THEN AVG(CASE WHEN OverallRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 6 THEN AVG(CASE WHEN BadgeRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 5 THEN AVG(CASE WHEN DepartmentRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 4 THEN AVG(CASE WHEN DepartmentBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 3 THEN AVG(CASE WHEN CityRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 2 THEN AVG(CASE WHEN CityBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 1 THEN AVG(CASE WHEN CityDepartmentRowNum = 1 THEN PersonAge ELSE NULL END)
         WHEN GROUPING_ID(City,Department,Badge) = 0 THEN AVG(CASE WHEN CityDepartmentBadgeRowNum = 1 THEN PersonAge ELSE NULL END)
         END AS AveragePersonAge,
    COUNT(DISTINCT Person) PersonCount
FROM
    InitialGrouping
GROUP BY 
    GROUPING SETS (City, Department, Badge, (City, Department), (City,Badge), (Department,Badge), (City,Department,Badge),())
ORDER BY 
    City, Department
sql-server sql性能

评论

0赞 J S 4/14/2023
@marc_s我应该注意 SO 上是否有特定的 SQL 格式?
0赞 lptr 4/14/2023
如果使用 row_number() over(partition by grouping set,Person) 和 agreggate case when grouping_id() then avg(case when groupingrn=1 then Age) 对每个分组集的行进行“重复数据删除”,则计划与第一个/不正确的示例相同,并添加了序列计算。
0赞 Charlieface 4/14/2023
你真的需要按和按和按分组集吗?CityDepartmentCity, Department
0赞 J S 4/14/2023
正如我所提到的,这里的想法是,我们需要一个解决方案,该解决方案可以扩展到具有各种集合的更多分组列。所以,是的,在一些示例中,您需要 City、Department、(City、Department),您可以想象其他几个分组列和您可能想要分组的任意组。@lptr 的解决方案确实有效,但缺点是随着集合数量的增加而变得不守规矩,因为您需要为每个分组使用不同的列,然后使用大型 case 语句。不过,这可能是最不幸的选择。

答:

0赞 abolfazl sadeghi 4/14/2023 #1

我使用第一种方法,但更改方法类型、顺序、聚合并使用窗口函数

 select  

    City,
    Department,
    max(PersonAge) AveragePersonAge,
    max( Person) PersonCount 

from (

    select 
             City
            ,Department
            ,AVG(PersonAge) over(partition by City )  as PersonAge
            ,

    DENSE_RANK() OVER (PARTITION BY City ORDER BY Person ASC) +
DENSE_RANK() OVER (PARTITION BY City ORDER BY Person DESC) - 1 AS Person     

     from #People
)a
GROUP BY 
    GROUPING SETS (Department,City,(Department,City) )
ORDER BY 
    City, Department
    

评论

0赞 J S 4/14/2023
嗯,这适用于我的简化示例。然而,正如我所提到的,我的目标是扩展到更多的团体,除非我错过了一种方法来实现这一目标,否则它不会继续下去。由于我不能在这里添加代码:例如,如果您添加第三列,则将其命名为 [Group],然后添加第三个人称,称他为 Tom,并让他 30 岁。将 Tom 添加到与 Bob 和 Tim 不同的组中,但除此之外,将他放在“纽约”、“A”中。如果查看“A”的汇总,则会看到平均年龄值不正确的值 35,人员计数的值不正确的 2。
0赞 J S 4/14/2023
或者更好的是,在不添加组的情况下:为 Tom 添加该行,如下所示:('Tom', 'New York', 'B', 30),然后查看 New York 的汇总行。因此,这甚至不适用于简化的示例。
0赞 abolfazl sadeghi 4/14/2023
我想你说第一个代码是更好的代码,但结果不是真的,我现在解决了你的问题,你的第一个代码是正确的,我的更正我无法理解你的问题,你能告诉我你的问题并展示我的最终结果并解释me@JS
0赞 J S 4/14/2023
目标是在每个分组集中,我们应该正确计数该分组集中的不同人,以及这些不同人的平均年龄。不幸的是,您没有解决这个问题 - 您为确切的三个示例行解决了它。该解决方案应该能够泛化到任何数据集,而您的则不能。正如我所提到的,如果我们要添加第四行('Tom'、'New York'、'B'、30),您的代码将不再适用于纽约汇总,该汇总现在应该有 3 个不同的人,平均年龄为 33 岁。
0赞 abolfazl sadeghi 4/14/2023
我编辑了这个代码“不同的人和平均年龄”,你使用城市,部门在我的代码组中使用城市,部门,如果你删除这个代码的“部门”的“部门”,则你的代码“AVG(PersonAge)over(partition by City)as PersonAge”,平均为130%4=32@JS