在接触点表创建 (SQL) 中按问题分组

Group by Problem in Touchpoint Table Creation (SQL)

提问人:Wagner André Yamada Vieira 提问时间:1/9/2023 更新时间:1/9/2023 访问量:25

问:

我正在 Azure Databricks 环境中处理具有以下数据集的 SQL 查询:

CREATE OR REPLACE TABLE touchpoints_table
(
  List  STRING,
  Path_Lenght INT
);

INSERT INTO touchpoints_table VALUES
  ('BBB, AAA, CCC', 3),  
  ('BBB', 1),
  ('DDD, AAA', 2), 
  ('DDD, BBB, AAA, EEE, CCC', 5), 
  ('EEE, AAA, EEE, CCC', 4);
  
SELECT * FROM touchpoints_table
|     |             List           |   Path_length   | 
|  0  | BBB, AAA, CCC              |       3         |  
|  1  | CCC                        |       1         |   
|  2  | DDD, AAA                   |       2         |   
|  3  | DDD, BBB, AAA, EEE, CCC    |       5         |  
|  4  | EEE, AAA, EEE, CCC         |       4         |

该任务包括生成下表:

|     |   Content       |   Unique    |  Started  |  Middleway  |  Finished  |
|  0  |     AAA         |     0       |     0     |      3      |      1     | 
|  1  |     BBB         |     0       |     1     |      1      |      0     |
|  2  |     CCC         |     1       |     0     |      0      |      3     |
|  3  |     DDD         |     0       |     2     |      0      |      0     |
|  4  |     EEE         |     0       |     1     |      2      |      0     |

其中,列包含以下内容:

  • 内容在列表中找到的元素
  • 唯一:元素在列表中单独出现的次数
  • Started:元素在开头出现的次数
  • Finished:元素在末尾出现的次数
  • 中间:元素在开始和结束之间出现的次数。

使用以下查询,我几乎得到了结果,但不知何故,分组无法正常工作

WITH tb1 AS(
  SELECT 
    CAST(touch_array AS STRING) AS touch_list,
    EXPLODE(touch_array) AS explode_list,
    ROW_NUMBER()OVER(PARTITION BY CAST(touch_array AS STRING) ORDER BY (SELECT 1)) touch_count,
    COUNT(*)OVER(PARTITION BY touch_array) touch_lenght
  FROM (SELECT SPLIT(List, ',') AS touch_array FROM touchpoints_table) 
  )
  SELECT
     explode_list AS Content,
     SUM(CASE WHEN touch_lenght=1 THEN 1 ELSE 0 END) AS Unique,
     SUM(CASE WHEN touch_count=1 AND touch_lenght > 1 THEN 1 ELSE 0 END) AS Started,
     SUM(CASE WHEN touch_count>1 AND touch_count < touch_lenght THEN 1 ELSE 0 END) AS Middleway,
     SUM(CASE WHEN touch_count>1 AND touch_count = touch_lenght THEN 1 ELSE 0 END) AS Finished  
  FROM tb1 
  GROUP BY explode_list
  ORDER BY explode_list    
|     |   Content       |   Unique    |  Started  |  Middleway  |  Finished  |
|  0  |     AAA         |     0       |     0     |      3      |      1     | 
|  1  |     BBB         |     0       |     0     |      1      |      0     |
|  2  |     CCC         |     0       |     0     |      0      |      3     |
|  3  |     EEE         |     0       |     0     |      2      |      0     |
|  4  |     BBB         |     1       |     1     |      0      |      0     |
|  5  |     DDD         |     0       |     2     |      0      |      0     |
|  6  |     EEE         |     0       |     1     |      0      |      0     |

你能帮我建议一个解决这个任务的代码吗?

DataFrame 数据分析 Databricks-SQL

评论


答:

0赞 George Joseph 1/9/2023 #1

这是使用 sql server 执行此操作的一种方法。

with main_data
  as (
select  list
       ,ltrim(x.value) as split_val             
       ,x.ordinal 
       ,case when x.ordinal=1 and tt.path_length=1 then
                  'unique'
             when x.ordinal=1 then 
                  'start' 
             when x.ordinal=(tt.path_length+1)/2 then
                  'middle'
             when x.ordinal=tt.path_length then
                   'end'
          end as pos
  from touchpoints_table tt
CROSS APPLY STRING_SPLIT(list,',',1) x
      )
select split_val
       ,count(case when pos='unique' then 1 end) as unique_cnt
       ,count(case when pos='start' then 1 end) as start_cnt
       ,count(case when pos='middle' then 1 end) as middle_cnt
       ,count(case when pos='end' then 1 end) as end_cnt
  from main_data
group by split_val

+-----------+------------+-----------+------------+---------+
| split_val | unique_cnt | start_cnt | middle_cnt | end_cnt |
+-----------+------------+-----------+------------+---------+
| AAA       |          0 |         0 |          3 |       1 |
| BBB       |          0 |         1 |          0 |       0 |
| CCC       |          1 |         0 |          0 |       3 |
| DDD       |          0 |         2 |          0 |       0 |
| EEE       |          0 |         1 |          0 |       0 |
+-----------+------------+-----------+------------+---------+