提问人:zeroes_ones 提问时间:9/29/2023 最后编辑:Erwin Brandstetterzeroes_ones 更新时间:9/30/2023 访问量:122
返回计数和平均值的有效方法,无需将多个 CTE 连接在一起
Efficient way to return counts and averages without joining together multiple CTEs
问:
在我的 PostgreSQL 14.8 数据库中,我有一个名为的表,如下所示:orders
CREATE TABLE orders (
user_id int
, order_id int
, order_date date
, quantity int
, revenue float
, product text
);
INSERT INTO orders VALUES
(1, 1, '2021-03-05', 1, 15, 'books'),
(1, 2, '2022-03-07', 1, 3, 'music'),
(1, 3, '2022-06-15', 1, 900, 'travel'),
(1, 4, '2021-11-17', 2, 25, 'books'),
(2, 5, '2022-08-03', 2, 32, 'books'),
(2, 6, '2021-04-12', 2, 4, 'music'),
(2, 7, '2021-06-29', 3, 9, 'books'),
(2, 8, '2022-11-03', 1, 8, 'music'),
(3, 9, '2022-11-07', 1, 575, 'food'),
(3, 10, '2022-11-20', 2, 95, 'food'),
(3, 11, '2022-11-20', 1, 95, 'food'),
(4, 12, '2022-11-20', 2, 95, 'books'),
(4, 13, '2022-11-21', 1, 95, 'food'),
(4, 14, '2022-11-23', 4, 17, 'books'),
(5, 15, '2022-11-20', 1, 95, 'food'),
(5, 16, '2022-11-25', 2, 95, 'books'),
(5, 17, '2022-11-29', 1, 95, 'food');
请参阅 SQL fiddle:http://sqlfiddle.com/#!17/3dc69/1
我想得到以下信息:
在首次购买图书(首先根据 )作为(在本例中为客户 ID 1 和 4)的客户中,我希望获得:order_date
product
A) 该队列购买的书籍的平均值(在本例中为 2.25,即 1、2、2 和 4 的平均值);quantity
B) 这些购买的总和(在本例中为 152)。revenue
我尝试使用多个 CTE 来实现此目的。这是我的尝试:
WITH all_orders AS (
SELECT
*
, ROW_NUMBER() OVER (PARTITION BY o.user_id ORDER BY o.order_date) AS row_num
FROM
orders o
),
first_product_is_books AS (
SELECT
DISTINCT(user_id) AS usr_id
FROM
all_orders
WHERE product = 'books' AND row_num = 1
),
temp_results AS (
SELECT
*
FROM
all_orders ao
JOIN
first_product_is_books AS fp
ON ao.user_id = fp.usr_id
)
SELECT
avg(quantity)
, sum(revenue)
FROM
temp_results tr
WHERE
tr.product = 'books'
参见小提琴:http://sqlfiddle.com/#!17/3dc69/1
这适用于玩具数据集,但不适用于生产数据集,其中的表大约有 400,000 条记录。我认为这个查询很笨拙,没有真正优化。有没有更有效的方法?
答:
这个怎么样?
with order_cte as
(
SELECT
*,
ROW_NUMBER() OVER(partition by user_id order by order_date) as row_num
from orders
)
SELECT
AVG(QUANTITY)AVG_QUANTITY
, SUM(REVENUE)TOTAL_REV
FROM ORDERS
WHERE PRODUCT = 'books'
and USER_ID IN (
SELECT
USER_ID
FROM ORDER_CTE
WHERE PRODUCT = 'books'
AND ROW_NUM = 1
)
评论
此版本的查询更易于阅读。
WITH first_orders AS (
SELECT
DISTINCT
user_id,
FIRST_VALUE(product) OVER (PARTITION BY user_id ORDER BY order_date) as first_product
FROM orders
GROUP BY user_id, order_date, product
)
SELECT
AVG(o.quantity) AS avg_quantity,
SUM(o.revenue) AS total_revenue
FROM orders o
JOIN first_orders fo
ON o.user_id = fo.user_id
WHERE fo.first_product = 'books' AND o.product = 'books';
评论
更简单的:DISTINCT ON
SELECT avg(o.quantity) AS avg_quantity
, sum(o.revenue) AS total_revenue
FROM (
SELECT DISTINCT ON (user_id)
user_id, product
FROM orders
ORDER BY user_id, order_date
) init
JOIN orders o USING (user_id, product)
WHERE init.product = 'books';
哪种查询样式最有效主要取决于基数。总共有多少行,有多少用户,有多少图书订单,有多少图书的初始订单,平均行大小,平均大小。product
如果还有一个表,则每个相关表包含一行,则可以更简单,更快捷。users
user_id
您需要一个索引来支持查询。理想情况下,在 .同样,最好的指数取决于上述未披露的细节。(user_id, order_date, product)
并且必须是唯一的,否则存在歧义。(user_id, order_date)
看:
评论
DISTINCT ON
你的问题的逻辑是有缺陷的(下面的所有代码都可以在这里找到)!
> A) the average quantity of books that this cohort purchased (in this
> case, 2.25, which is the average of 1, 2, 2, and 4), and;
>
> B) the total revenue of each of these purchases (in this case, 152).
您希望两个用户(1 和 4)的两本书的平均购买数量,但四次购买的总收入不是 152,而是 323!
这是由下式计算的(参见小提琴中的表人口)
user_1 = (1 x 15) + (2 x 25) = 15 + 50 = 65
user_2 = (2 x 95) + (4 x 17) = 190 + 68 = 258
65 + 258 = 323
323 是两个用户的总收入的正确总数。
顺便说一句,152 = 15 + 25 + 95 + 17,即在计算中没有考虑数量!
该表是根据您的数据构建和填充的。
然后,我运行了以下 SQL:
WITH cte AS
(
SELECT
user_id, -- some of these fields are not necessary
product, -- they just help in getting the full picture.
order_date,
SUM(revenue * quantity) OVER (PARTITION BY user_id, product) AS rev,
AVG(quantity) OVER (PARTITION BY user_id, product) AS avg_qty,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY user_id, order_date) AS rn
FROM
orders
ORDER BY user_id, order_date
)
SELECT
SUM(rev) AS tot_rev,
ROUND(AVG(avg_qty), 2) AS avg_q
FROM
cte
WHERE
product = 'books' AND rn = 1;
结果:
tot_rev avg_q
323 2.25
瞧 - 正确答案!
评论
Among customers who first purchased a book as a product (in this case, customer IDs 1 and 4),
- 我不明白 - 为什么客户 2(也有两次购买)不包括在内?1 和 2 的区别是什么?