提问人:askingquestions 提问时间:10/15/2023 最后编辑:askingquestions 更新时间:10/16/2023 访问量:123
选择每所学校的 N 个最新行,但跳过同一学生的重复行
Select N latest rows per school, but skip duplicates from the same student
问:
在Postgres db(引擎版本14.6)中,我有两个表:和.它们如下所示:school_students
id_cards
CREATE TABLE school_students (
id BIGSERIAL PRIMARY KEY,
school_id character varying NOT NULL,
student_id character varying NOT NULL
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX school_students_pkey ON school_students(id int8_ops);
CREATE UNIQUE INDEX school_students_school_student_idx ON school_students(school_id text_ops,student_id text_ops);
CREATE TABLE id_cards (
id BIGSERIAL PRIMARY KEY,
school_student_id bigint NOT NULL REFERENCES school_students(id)
);
-- Indices -------------------------------------------------------
CREATE UNIQUE INDEX id_cards_pkey ON id_cards(id int8_ops);
CREATE INDEX id_cards_school_student_id_idx ON id_cards(school_student_id int8_ops);
身份证会定期重新发给学生,因此每个学生可以有任意数量的身份证。每张卡都是独一无二的。每个学生+学校实体都是独一无二的。一个学生可能属于多所学校。
目前系统中有 11 所学校,每所学校的学生人数从 ~200 到 ~20000 不等。大多数学生只有一张身份证,但他们可以随时签发新身份证,并且他们拥有的身份证数量没有限制。
下面是示例数据的摆弄。
数据库前面是一个带有资源的 API,其工作是获取每个学校 + 学生最近创建的身份证列表并检索列表,限制为每个 .该 API 的查询频率要足够高,因此它必须具有高性能。school_ids
school_id
我有以下查询可以不受任何限制地获得我需要的东西:
SELECT id FROM id_cards
WHERE id IN (
SELECT MAX(c.id) FROM id_cards c
JOIN school_students ss ON c.school_student_id = ss.id
WHERE
school_id in (1, 2, 3) -- up to 100 ids per application code. in practice, there are only a dozen or so schools
GROUP BY
school_id,
student_id
);
但是,在我的子查询中,我想将 where 子句中列出的每个返回的学校+学生记录的数量限制为 100 条,以避免在给定学校可能有大量学生的情况下出现失控查询。例如,为学校 1 最多 50 名最近添加的学生签发的最新身份证,为学校 2 最多 50 名最近添加的学生颁发的身份证,对查询中列出的每所学校重复。school_id
是否有性能查询可以完成此操作?
答:
您可以通过构建子查询和dense_rank
查询来执行此操作。
首先,获取每个学生/学校的最新卡片。(注意:ID 是时间排序的不良替代项。添加日期时间列。)
select
*,
dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards
接下来,我们将其用作子查询,以获取学校向学生发放的最新卡片的顺序。最新的卡有 .student_card_order = 1
with ordered_student_cards as (
select
*,
dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards
)
select
*,
dense_rank() over(partition by school_id order by card_id desc) as school_card_order
from ordered_student_cards
where student_card_order = 1
最后,我们每所学校只能获取前两个。school_card_order <= 2;
with ordered_student_cards as (
select
*,
dense_rank() over(partition by school_id, student_id order by card_id desc) as student_card_order
from school_id_cards
), ordered_school_cards as (
select
*,
dense_rank() over(partition by school_id order by card_id desc) as school_card_order
from ordered_student_cards
where student_card_order = 1
)
select card_id
from ordered_school_cards
where school_card_order <= 2;
演示。
可能有一种更紧凑或更高性能的方法可以做到这一点,但窗口函数和子查询是分解复杂查询的一种方式。
评论
残缺不全的关系模型使得更有效的查询技术变得不可能。
上述视图与性能优化无关。视图基本上只是一个存储的查询。它没有自己的数据或索引。视图是一种便利功能,对性能没有帮助。(不要与 MATERIALIZED VIEW
混淆。
主表仅具有 .没有办法立即从这个数字中推断出一所学校。因此,也无法添加索引来加快手头的查询速度。索引适用于 Postgres(以及大多数其他 RDBMS)中的一个表。如果没有匹配的索引,我最初建议的智能查询是无用的。id_cards
school_student_id
也就是说,对于数据库中的几行,即使是对两个表进行(单次!)顺序扫描的“蛮力”查询(除非添加或更改关系模型,否则唯一剩下的选项)也不会那么昂贵。这应该是尽可能好的:MATERIALIZED VIEW
SELECT card_id
FROM (
SELECT card_id
, row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
FROM (
SELECT DISTINCT ON (s.school_id, s.student_id)
s.school_id, c.id AS card_id
FROM id_cards c
JOIN school_students s ON s.id = c.school_student_id
ORDER BY s.school_id, s.student_id, c.id DESC
) sub1
) sub2
WHERE rn <= 2; -- your limit
相关:
大约:DISTINCT ON
关于(可选)优化:ROWS UNBOUNDED PRECEDING
如果我们知道所有结果行都在 N 个最新的条目中,我们就可以使用它,并优化上面的查询:
SELECT card_id
FROM (
SELECT card_id
, row_number() OVER (PARTITION BY school_id ORDER BY card_id DESC ROWS UNBOUNDED PRECEDING) AS rn
FROM (
SELECT DISTINCT ON (s.school_id, s.student_id)
s.*, c.id AS card_id
FROM (
SELECT *
FROM id_cards
ORDER BY id DESC
LIMIT 1000 -- this covers the latest 2 for all schools (!)
) c
JOIN school_students s ON s.id = c.school_student_id
ORDER BY s.school_id, s.student_id, c.id DESC
) sub1
) sub2
WHERE rn <= 2;
请注意,小提琴中的数据分布(非常不切实际!)添加了单个学校的所有最新行,因此此优化将不适用!
此外,在更新的问题中,增加的限制为 100(高于 2),使此优化变得不那么多汁(甚至适用)。
仅从索引中读取前 N 行(和排序)比读取几个 100k 行要快。在这种情况下(但不是上面的一般查询),将 PK 转换为覆盖索引并从中获取仅索引扫描会有所帮助,如下所示:
ALTER TABLE id_cards
DROP CONSTRAINT id_cards_pkey
, ADD CONSTRAINT id_cards_pkey PRIMARY KEY (id) INCLUDE (school_student_id);
看:
再说一次,根据序列号选择“最新”条目一开始就不可靠。并发写入负载和其他内部结构可能会扰乱序列。此类 ID 仅保证是唯一的,不一定是连续的。添加的时间戳更合理。(加上如何打破联系的规则。
顺便说一句,Postgres 16(在某种程度上已经是第 15 页)对 Postgres 14 进行了多项改进,有助于提高这些查询的性能。比较:
除其他外,不再有帮助,因为优化现在是内置的。(也无伤大雅。ROWS UNBOUNDED PRECEDING
初始问题的答案已过时
(虽然仍然假设一个表。school_id_cards
如果您的桌子很大,您希望避免昂贵的整张桌子顺序扫描。使用智能查询从匹配的索引中选取具有 index(-only) 扫描的合格行。速度快得多。
通常,数据库中应该存在某种“学校”表,每个相关学校只有一行。使查询更简单、更快捷:
WITH RECURSIVE latest_card AS (
SELECT c.*
FROM school s
CROSS JOIN LATERAL (
SELECT c.school_id, c.card_id, ARRAY[c.student_id] AS leading_ids
FROM school_id_cards c
WHERE c.school_id = s.school_id
ORDER BY c.card_id DESC
LIMIT 1
) c
UNION ALL
SELECT c.*
FROM latest_card l
JOIN LATERAL (
SELECT l.school_id, c.card_id, l.leading_ids || student_id
FROM school_id_cards c
WHERE c.school_id = l.school_id
AND c.card_id < l.card_id
AND c.student_id <> ALL (l.leading_ids)
ORDER BY c.card_id DESC
LIMIT 1
) C ON cardinality(l.leading_ids) < 2 -- your limit per school here!
)
SELECT card_id
FROM latest_card
ORDER BY card_id;
就像你所展示的那样,这对每所学校的小限制来说是很好的。对于较大的限制,我会切换到不同的查询。
关于递归 CTE (rCTE) 的使用:
确保有一个匹配的索引,如
CREATE INDEX ON school_id_cards (school_id DESC, card_id DESC);
具有(默认)升序排序顺序的更简单的索引几乎不会更糟。Postgres 可以向后扫描 B 树索引。只有相反的排序顺序会不太理想。
如果没有表:school
WITH RECURSIVE latest_card AS (
(
SELECT DISTINCT ON (school_id)
school_id, card_id, ARRAY[student_id] AS leading_ids
FROM school_id_cards c
ORDER BY school_id DESC, card_id DESC
)
UNION ALL
SELECT c.*
FROM latest_card l
JOIN LATERAL (
SELECT l.school_id, c.card_id, l.leading_ids || student_id
FROM school_id_cards c
WHERE c.school_id = l.school_id
AND c.card_id < l.card_id
AND c.student_id <> ALL (l.leading_ids)
ORDER BY c.card_id DESC
LIMIT 1
) C ON cardinality(l.leading_ids) < 2 -- your limit per school here!
)
SELECT card_id
FROM latest_card
ORDER BY card_id;
大约:DISTINCT ON
你可以用另一个嵌套的 rCTE 替换非递归术语来生成学校列表(可能使用最新的卡片来启动事情)......
但真的应该有一张桌子。如果没有,请创建它。school
评论
schools
school_students
id_cards
schools
students
school_id_cards
school_students
id_cards
where school_id in
FOREIGN TABLE
,以避免滚动自己的表。
评论