提问人:O.rka 提问时间:9/16/2023 最后编辑:O.rka 更新时间:9/18/2023 访问量:138
在 Python 中查找重叠间隔之间的最长间隔?
Find longest interval between overlapping intervals in Python?
问:
我有一个元组形式的区间列表,如下所示:
data = [(5,10), (3,10), (13,15), (12,18), (20,29), (25,30)]
元组中的每个项目都有 2 个值(即开始、结束),不同区间之间可能存在重叠,也可能没有重叠。如果有重叠的间隔,我只想选择最长的间隔。此测试中的输出如下:
output = [(3,10), (12,18), (20,29)]
如何使用标准库、numpy
或 pandas
在 Python 中执行此操作?
我开始做一些像这样幼稚的事情,但我认为这不会很好地扩展......我也宁愿不使用NetworkX
import networkx as nx
data = [(5,10), (3,10), (13,15), (12,18), (20,29), (25,30)]
graph = nx.Graph()
n = len(data)
for i, a in enumerate(data):
a_seq = set(range(a[0], a[1] + 1))
for j in range(i+1, n):
b = data[j]
b_seq = set(range(b[0], b[1] + 1))
n_overlap = len(a_seq & b_seq)
if n_overlap:
graph.add_edge(a, b, weight=n_overlap)
output = list()
for nodes in nx.connected_components(graph):
lengths = dict()
for node in nodes:
start, end = node
lengths[node] = end - start
longest_interval, length_of_interval = sorted(lengths.items(), key=lambda x:x[1], reverse=True)[0]
output.append(longest_interval)
我假设有更好的方法,但现在它正在逃避我。
编辑:任务中可能会有一些混淆,但我无法混合和匹配间隔(例如,(20,30)无效,因为它不是开始间隔之一)。
答:
这实际上是图论的一个很好的例子,但是应该可以从邻接矩阵开始只使用 numpy。
邻接矩阵基本上将 a 分配给连接到所有其他索引的每两个列表元素索引,从而在 2 维数组中定义无向图。1
0
经典的方法是先实现深度优先搜索或以这种方式实现其他方法,但是 math.stackexchange 上的这个答案有一个非常好的方法,使用拉普拉斯矩阵并查找特征值和特征向量。(两者都是用 numpy 实现的)
from itertools import groupby
import numpy as np
data = [(5, 10), (3, 10), (13, 15), (12, 18), (20, 29), (25, 30)]
def overlap(a, b):
return int((a[0] <= b[1]) and (b[0] <= a[1]))
l = len(data)
# Create Adjecency Matrix
A = np.array([[(overlap(data[i], data[j])) for i in range(l)] for j in range(l)])
# Create Laplacian
L = np.eye(l) - A
# Get Eigenvectors
_, eig_vect = np.linalg.eig(L)
# get inidces of non-zero entries
r, c = np.nonzero(eig_vect.transpose())
# group non-zero entries and remove duplicate groups
group_ix = {
tuple(val[1] for val in group)
for _, group in groupby(list(enumerate(c)), key=lambda x: r[x[0]])
}
# find largest interval in data for every group
largest_obj = [
data[group[np.argmax([data[i][1] - data[i][0] for i in group])]]
for group in group_ix
]
这将为您提供的数据提供所需的答案。要微调您看到的“属于一个组”的内容,您可以相应地调整函数。overlap
评论
data = [(1086, 1211), (1860, 2581), (2686, 2886), (2925, 3187), (3194, 3319), (3607, 3726), (3748, 3900), (3973, 6159), (4189, 6159), (6163, 6255), (6600, 6707)]
(3973, 6159), (4189, 6159)
r, c = np.nonzero(eig_vect.transpose())
你的帖子提到不使用任何库,所以这就是我所做的。此外,您在评论中提到了许多说明,因此此代码可以处理所有边缘情况。
def find_max_length_in_overlap_groups(intervals):
#get sorted list of intervals by starting point
aux = sorted(intervals, key=lambda x: x[0])
new_group = [aux[0]]
output = []
for interval in aux[1:]:
if interval[0] > new_group[-1][-1]:
#if the interval does not overlap with the last interval we start a new group
output.append(
#find interval of maximum length
max(new_group, key=lambda interval: interval[1] - interval[0])
)
#start new group
new_group = [interval]
else:
#otherwise just append it onto the same group if it overlaps
new_group.append(interval)
#take care of the last group, doing the same as above
output.append(max(new_group, key=lambda interval: interval[1] - interval[0]))
return output
此方法基于此处发现的类似问题的解决方案,并具有更完整的解释。此方法不保留输出的原始顺序。如果希望,则必须稍微修改函数,方法是存储原始索引,然后按它们进行排序:
def find_max_length_in_overlap_groups(intervals):
#store the indices
indices={val:ind for ind, val in enumerate(intervals)}
aux = sorted(intervals, key=lambda x: x[0])
new_group = [aux[0]]
output = []
for interval in aux[1:]:
if interval[0] > new_group[-1][-1]:
output.append(
max(new_group, key=lambda interval: interval[1] - interval[0])
)
new_group = [interval]
else:
new_group.append(interval)
output.append(max(new_group, key=lambda interval: interval[1] - interval[0]))
#sort the output according to indices in input to preserve order
return sorted(output, key = lambda x: indices[x])
两种算法具有相同的时间复杂度:由于排序O(n log n)
如果性能并不重要,则可以使用间隔树轻松实现这一点。
from typing import Iterable, List, Tuple
from intervaltree import IntervalTree
def merge_overlaps_by_intervaltree(intervals: Iterable[Tuple[int, int]]) -> List[Tuple[int, int]]:
tree = IntervalTree()
for interval in intervals:
tree.addi(begin=interval[0], end=interval[1], data=interval)
tree.merge_overlaps(data_reducer=lambda x, y: x if x[1] - x[0] >= y[1] - y[0] else y)
return [interval.data for interval in sorted(tree)]
但是,这个库为您的目标做了一些不必要的工作,所以我实现了一个更快的库。
from typing import Iterable, List, Tuple
import operator
def merge_overlaps(intervals: List[Tuple[int, int]], merge_adjacent: bool = False) -> Iterable[Tuple[int, int]]:
# You can delete this line if intervals are guaranteed to be sorted.
intervals = sorted(intervals)
# The start/stop of the group of overlapping intervals currently being updated.
group_start, group_stop = intervals[0]
# The longest interval currently found in the current group.
longest_interval = (group_start, group_stop)
longest_length = group_stop - group_start
# Whether adjacent (e.g. [(1,3),(3,10)]) are considered overlapping or not.
is_overlapped = operator.le if merge_adjacent else operator.lt
for interval_start, interval_stop in intervals[1:]:
interval_length = interval_stop - interval_start
if is_overlapped(interval_start, group_stop): # This interval can be merged into the group.
if interval_length > longest_length: # Update the longest interval.
longest_interval = (interval_start, interval_stop)
longest_length = interval_length
if interval_stop > group_stop: # Update the group range.
group_stop = interval_stop
else: # This interval cannot be merged into the group. A new group must be created.
yield longest_interval # Before creating a new group, output the previous group.
group_start, group_stop = interval_start, interval_stop
longest_interval = (interval_start, interval_stop)
longest_length = interval_length
yield longest_interval
即使使用较大的输入,这也应该可以快速工作,因为它不会创建与输入量成比例增长或重叠的中间对象(排序间隔除外)。
请注意,关于我在评论中询问的相邻间隔,您可以使用参数更改行为。merge_adjacent
例:
data = [(5, 10), (3, 10), (13, 15), (12, 18), (20, 29), (25, 30)]
print(list(merge_overlaps(data)))
# [(3, 10), (12, 18), (20, 29)]
data = [(1, 5), (4, 7), (6, 8)]
print(list(merge_overlaps(data)))
# [(1, 5)]
data = [(1, 3), (3, 10)]
print(list(merge_overlaps(data)))
# [(1, 3), (3, 10)]
data = [(1, 3), (3, 10)]
print(list(merge_overlaps(data, merge_adjacent=True)))
# [(3, 10)]
评论
lambda
operator
typing
评论