提问人:Keith 提问时间:12/5/2012 最后编辑:jfsKeith 更新时间:4/19/2014 访问量:2924
将列表列表替换为“精简”列表,同时保持顺序
Replace list of list with "condensed" list of list while maintaining order
问:
我有一个列表列表,如我附加的代码所示。如果有任何共同的值,我想链接每个子列表。然后,我想用精简的列表列表替换列表列表。示例:如果我有一个我想要的列表。如果我有,我想要.如果我有我想要,或者我不在乎哪一个。[[1,2,3],[3,4]]
[1,2,3,4]
[[4,3],[1,2,3]]
[4,3,1,2]
[[1,2,3],[a,b],[3,4],[b,c]]
[[1,2,3,4],[a,b,c]]
[[a,b,c],[1,2,3,4]]
我快到了......
我的问题是,当我有一个我想要的案例时,但使用我当前的代码,我得到了[[1,2,3],[10,5],[3,8,5]]
[1,2,3,10,5,8]
[1,2,3,8,10,5]
这是我的代码:
import itertools
a = [1,2,3]
b = [3,4]
i = [21,22]
c = [88,7,8]
e = [5,4]
d = [3, 50]
f = [8,9]
g= [9,10]
h = [20,21]
lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
#I have a lot of list but not very many different lists
def any_overlap(a, b):
sb = set(b)
return any(itertools.imap(sb.__contains__, a))
def find_uniq(lst):
''' return the uniq parts of lst'''
seen = set()
seen_add = seen.add
return [ x for x in lst if x not in seen and not seen_add(x)]
def overlap_inlist(o_lst, lstoflst):
'''
Search for overlap, using "any_overlap", of a list( o_lst) in a list of lists (lstoflst).
If there is overlap add the uniq part of the found list to the search list, and keep
track of where that list was found
'''
used_lst =[ ]
n_lst =[ ]
for lst_num, each_lst in enumerate(lstoflst):
if any_overlap(o_lst,each_lst):
n_lst.extend(each_lst)
used_lst.append(lst_num)
n_lst= find_uniq(n_lst)
return n_lst, used_lst
def comb_list(lst):
'''
For each list in a list of list find all the overlaps using 'ovelap_inlist'.
Update the list each time to delete the found lists. Return the final combined list.
'''
for updated_lst in lst:
n_lst, used_lst = overlap_inlist(updated_lst,lst)
lst[:] = [x for i,x in enumerate(lst) if i not in used_lst]
lst.insert(0,n_lst)
return lst
comb_lst = comb_list(lst)
print comb_lst
此脚本的输出是:
[[88, 7, 8, 9, 10], [1, 2, 3, 4, 50, 5], [21, 22, 20]]
我想要它,所以密钥按原始顺序排列,例如:
[[88, 7, 8, 9, 10], [1, 2, 3, 4, 5, 50,], [21, 22, 20]]
5 和 50 在新的 lst 中切换[2]
我对python有点陌生。我将不胜感激该问题的任何解决方案或对我当前代码的评论。我不是计算机科学家,我想可能有某种算法可以快速做到这一点(也许来自集合论?如果有这样的算法,请为我指出正确的方向。
我可能会让这种方式变得更加复杂...... 谢谢!!
答:
我确信有一种更简洁的方法可以做到这一点,但我从一条特定的道路开始,并做了我必须做的事情,使其在没有任何重构的情况下工作。
lookup = {}
out = []
index = 0
for grp in lst:
keys = [lookup.get(num, None) for num in grp]
keys = [key for key in keys if key is not None]
if len(keys):
if len(set(keys)) != 1:
for num in grp:
out[keys[0]].append(num)
seen = set()
keys = [key for key in keys if key not in seen and not seen.add(key)]
for key in keys[1:]:
out[keys[0]].extend(out[key])
del out[key]
seen = set()
out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
else:
for num in grp:
lookup[num] = keys[0]
out[keys[0]].extend(grp)
seen = set()
out[keys[0]] = [item for item in out[keys[0]] if item not in seen and not seen.add(item)]
else:
out.append(grp)
for num in grp:
lookup[num] = index
index += 1
print out
感谢 @Steven 对该套装的列表缩减技术。
输出
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
评论
lst == [[1,2,3], [10,5], [3,8,5]]
out == [[1, 2, 3, 8, 5, 10], [10, 5]]
[1, 2, 3, 8, 5, 10]
[[1, 2, 3, 8, 5, 10]]
[1, 2, 3, 8, 5, 10]
(您的代码)
下面是一个蛮力方法(可能更容易理解):
from itertools import chain
def condense(*lists):
# remember original positions
positions = {}
for pos, item in enumerate(chain(*lists)):
if item not in positions:
positions[item] = pos
# condense disregarding order
sets = condense_sets(map(set, lists))
# restore order
result = [sorted(s, key=positions.get) for s in sets]
return result if len(result) != 1 else result[0]
def condense_sets(sets):
result = []
for candidate in sets:
for current in result:
if candidate & current: # found overlap
current |= candidate # combine (merge sets)
# new items from candidate may create an overlap
# between current set and the remaining result sets
result = condense_sets(result) # merge such sets
break
else: # no common elements found (or result is empty)
result.append(candidate)
return result
例
>>> condense([1,2,3], [10,5], [3,8,5])
[1, 2, 3, 10, 5, 8]
>>> a = [1,2,3]
>>> b = [3,4]
>>> i = [21,22]
>>> c = [88,7,8]
>>> e = [5,4]
>>> d = [3, 50]
>>> f = [8,9]
>>> g= [9,10]
>>> h = [20,21]
>>> condense(*[a,b,c,i,e,d,f,g,h,a,c,i]*1000)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
>>> condense([1], [2, 3, 2])
[[1], [2, 3]]
性能比较
condense_*()
函数来自这个问题的答案。 来自问题的输入列表(不同大小), - @Blckknght答案中的测试列表(不同大小)。查看源代码。lst_OP
lst_BK
测量表明,基于“不相交集”和“无向图的连接分量”概念的解决方案在两种类型的输入上执行相似。
name time ratio comment
condense_hynekcer 5.79 msec 1.00 lst_OP
condense_hynekcer2 7.4 msec 1.28 lst_OP
condense_pillmuncher2 11.5 msec 1.99 lst_OP
condense_blckknght 16.8 msec 2.91 lst_OP
condense_jfs 26 msec 4.49 lst_OP
condense_BK 30.5 msec 5.28 lst_OP
condense_blckknght2 30.9 msec 5.34 lst_OP
condense_blckknght3 32.5 msec 5.62 lst_OP
name time ratio comment
condense_blckknght 964 usec 1.00 lst_BK
condense_blckknght2 1.41 msec 1.47 lst_BK
condense_blckknght3 1.46 msec 1.51 lst_BK
condense_hynekcer2 1.95 msec 2.03 lst_BK
condense_pillmuncher2 2.1 msec 2.18 lst_BK
condense_hynekcer 2.12 msec 2.20 lst_BK
condense_BK 3.39 msec 3.52 lst_BK
condense_jfs 544 msec 563.66 lst_BK
name time ratio comment
condense_hynekcer 6.86 msec 1.00 lst_rnd
condense_jfs 16.8 msec 2.44 lst_rnd
condense_blckknght 28.6 msec 4.16 lst_rnd
condense_blckknght2 56.1 msec 8.18 lst_rnd
condense_blckknght3 56.3 msec 8.21 lst_rnd
condense_BK 70.2 msec 10.23 lst_rnd
condense_pillmuncher2 324 msec 47.22 lst_rnd
condense_hynekcer2 334 msec 48.61 lst_rnd
要重现结果,请克隆 gist 并运行 measure-performance-condense-lists.py
评论
此解决方案仅使用有序词典。
deepcopy() 是必需的,如果希望原始副本保持不变。
from collections import OrderedDict
from copy import deepcopy
def treat(passed_list):
L = deepcopy(passed_list)
dic = OrderedDict()
for subl in L:
for x in subl:
if x not in dic:
dic[x] = subl
print 'dic at start'
print '\n'.join('%-3s : %s' % (a,dic[a])
for a in dic) + '\n'
for sublist in L:
short = []
short.extend(el for el in sublist
if el not in short)
seen = []
for k,val in dic.iteritems():
if val is sublist:
break
if k in short:
if val not in seen:
seen.append(val)
sumseen = []
for elseen in seen:
for y in elseen:
sumseen.append(y)
dic[y] = sumseen
if seen:
for el in sublist:
if el not in sumseen:
sumseen.append(el)
dic[el] = sumseen
sublist[:] = short
cumul = []
cumul.extend(lu for lu in dic.itervalues()
if lu not in cumul)
return cumul
plus = [[1,2,3,2,1],[10,5,5,5,10],
[8,5,3,3,5],[45,50,12,45,40,12]]
lst = [[1,2,3], [10,5], [3,8,5]]
for one_list in (plus,lst):
print 'one_list before == %r\n' % one_list
print 'treat(one_list) == %r\n' % treat(one_list)
print 'one_list after == %r\n' % one_list
print '===================================='
结果
one_list before == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
dic at start
1 : [1, 2, 3, 2, 1]
2 : [1, 2, 3, 2, 1]
3 : [1, 2, 3, 2, 1]
10 : [10, 5, 5, 5, 10]
5 : [10, 5, 5, 5, 10]
8 : [8, 5, 3, 3, 5]
45 : [45, 50, 12, 45, 40, 12]
50 : [45, 50, 12, 45, 40, 12]
12 : [45, 50, 12, 45, 40, 12]
40 : [45, 50, 12, 45, 40, 12]
treat(one_list) == [[1, 2, 3, 10, 5, 8], [45, 50, 12, 40]]
one_list after == [[1, 2, 3, 2, 1], [10, 5, 5, 5, 10], [8, 5, 3, 3, 5], [45, 50, 12, 45, 40, 12]]
====================================
one_list before == [[1, 2, 3], [10, 5], [3, 8, 5]]
dic at start
1 : [1, 2, 3]
2 : [1, 2, 3]
3 : [1, 2, 3]
10 : [10, 5]
5 : [10, 5]
8 : [3, 8, 5]
treat(one_list) == [[1, 2, 3, 10, 5, 8]]
one_list after == [[1, 2, 3], [10, 5], [3, 8, 5]]
====================================
这种解决方案有一个不便之处:它比 J.F. Sebastian 的解决方案慢 2 到 3 倍。
这是我的方法。它使用“不相交集”的概念来首先识别哪些子列表相互重叠,然后将它们连接在一起,消除重复。
from collections import OrderedDict
def disjoint_set_find(djs, node): # disjoint set find, with path compression
if node not in djs: # base case, node is a root of a set
return node
djs[node] = disjoint_set_find(djs, djs[node]) # recurse, caching results
return djs[node]
def disjoint_set_union(djs, first, second):
first = disjoint_set_find(djs, first) # find root of first set
second = disjoint_set_find(djs, second) # and of second set
if first < second: # make smaller root the root of the new combined set
djs[second] = first
elif second < first:
djs[first] = second
# deliberately ignore the case where first == second (same set)
def condenseBK(*master_list):
values = OrderedDict() # maps values to the first sublist containing them
overlaps = {} # maps sublist indexes to each other to form a disjoint set
for i, sublist in enumerate(master_list):
for v in sublist:
if v not in values: # no overlap, so just store value
values[v] = i
else: # overlap detected, do a disjoint set union
disjoint_set_union(overlaps, values[v], i)
output = [] # results
output_map = {} # map from original indexes to output indexes
for v, i, in values.items(): # iterate over values in order
root = disjoint_set_find(overlaps, i)
if root not in output_map:
output_map[i] = len(output)
output.append([]) # create new output sublists as necessary
output[output_map[root]].append(v)
return output
示例输出:
>>> a = [1,2,3]
>>> b = [3,4]
>>> c = [88,7,8]
>>> d = [3, 50]
>>> e = [5,4]
>>> f = [8,9]
>>> g = [9,10]
>>> h = [20,21]
>>> i = [21,22]
>>> lst = [a,b,c,i,e,d,f,g,h,a,c,i]*1000
>>> condenseBK(*lst)
[[1, 2, 3, 4, 5, 50], [88, 7, 8, 9, 10], [21, 22, 20]]
算法说明:
根据要求,以下是对我的代码如何工作的解释。
前两个函数实现不相交集的 and 操作。数据结构是通过将节点映射到其父节点的字典实现的。任何不是字典键的节点都是集合的节点。该操作返回包含给定 .为了提高性能,我实现了“路径压缩”,它减少了随着时间的推移所需的递归步骤的数量。该操作将包含其参数和 的集合组合在一起。find
union
root
find
node
union
first
second
主要功能由两部分组成。首先,它设置了几个数据结构,然后使用它们来构建输出。condense
values
是一个 OrderedDictionary,它从每个值映射到它所包含的第一个子列表的索引。每个值相加的顺序用于以正确的顺序生成输出。
overlaps
是用于不相交集的字典。其节点是重叠子列表的整数索引。
第一个循环填充这两个数据结构。它们遍历子列表及其内容。如果以前未看到某个值,则会将其添加到字典中。如果已看到它,则当前子列表与包含该值的上一个子列表重叠。values
为了解决重叠问题,代码对包含两个子列表的不相交集进行并集。
输出内置在列表中。由于输出子列表可能比输入中的子列表少,因此我们需要一个额外的字典来在旧索引(来自输入)到应用于输出列表的新索引之间映射。output
为了填充输出列表,我们遍历值,由于使用了 OrderedDict 类,因此按照添加它们的顺序进行迭代。使用不相交集,它可以正确地组合重叠列表。
当有大量列表需要处理且不会立即重叠时,此算法具有非常好的性能。例如,这组 200 个三元素列表最终全部重叠,但只有在处理完前 100 个后,您才会开始看到重叠:
lst2 = [list(range(4*i, 4*(i+1)-1)) for i in range(100)] + \
[list(range(4*i+2, 4*(i+1)+1)) for i in range(100)]
评论
find
你的问题本质上是一个图论问题,即连接组件的问题,对每个组件的元素顺序有额外的要求。
在程序中,所有列表的集合形成一个无向图,其中每个列表都是图中的一个节点。如果两个列表具有共同元素,则它们直接连接,如果存在两个列表直接或间接连接,则称为间接连接。例如,给定三个列表 [a, b]、[b, c] 和 [c, d],那么 [a, b] 和 [b, c] 是直接连接的,还有 [b, c] 和 [c, d],但 [a, b] 和 [c, d] 是间接连接的,因为虽然它们不共享公共元素,但它们都与同一个列表 [b, c] 共享元素。
如果组中的所有节点都已连接(直接或间接),并且图中没有其他节点连接到组中的任何节点,则该组节点是连接的组件。
有一种相当简单的线性时间算法,可以在无向图中生成所有连接的分量。使用它,我们可以定义一个函数,该函数生成所有压缩的不相交列表列表,同时保持其元素的顺序:
from itertools import imap, combinations_with_replacement
from collections import defaultdict
def connected_components(edges):
neighbors = defaultdict(set)
for a, b in edges:
neighbors[a].add(b)
neighbors[b].add(a)
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield component(node)
def condense(lists):
tuples = combinations_with_replacement(enumerate(imap(tuple, lists)), 2)
overlapping = ((a, b) for a, b in tuples
if not set(a[1]).isdisjoint(b[1]))
seen = set()
see = seen.add
for component in connected_components(overlapping):
yield [item for each in sorted(component)
for item in each[1]
if not (item in seen or see(item))]
print list(condense([[1, 2, 3], [10, 5], [3, 8, 5], [9]]))
print list(condense([[1, 2, 3], [5, 6], [3, 4], [6, 7]]))
结果:
[[1, 2, 3, 10, 5, 8], [9]]
[[5, 6, 7], [1, 2, 3, 4]]
的时间复杂度是二次的,因为每个列表都必须针对其他每个列表进行测试,以确定它们是否具有共同元素。因此,性能很糟糕。我们能以某种方式改进它吗?是的。condense()
如果两个列表具有公共元素,则它们直接连接。我们可以扭转这种关系:如果两个元素属于同一个列表,则直接连接,如果存在第三个元素(直接或间接)连接到它们,则间接连接。例如,给定两个列表 [a, b] 和 [b, c],则 a 和 b 是直接连接的,b 和 c 也是直接连接的,因此 a 和 c 是间接连接的。如果我们也采用J.F.Sebastians的想法,即存储每个元素第一次出现的位置,我们可以这样实现它:
def condense(lists):
neighbors = defaultdict(set)
positions = {}
position = 0
for each in lists:
for item in each:
neighbors[item].update(each)
if item not in positions:
positions[item] = position
position += 1
seen = set()
def component(node, neighbors=neighbors, seen=seen, see=seen.add):
unseen = set([node])
next_unseen = unseen.pop
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
yield node
for node in neighbors:
if node not in seen:
yield sorted(component(node), key=positions.get)
它仍然使用连接组件算法,但这次我们将元素视为连接,而不是列表。结果与以前相同,但由于时间复杂度现在是线性的,因此它的运行速度要快得多。
我试图编写一个快速且可读的解决方案。如果我知道的话,它永远不会比其他解决方案慢很多,但有时可能会快得多,因为它还针对更长的子列表或作为任何现有组子集的许多子列表进行了优化。(这是由问题的文本“我有很多列表,但不是很多不同的列表”的文本所激发的。该代码仅对可能比原始数据少得多的压缩数据使用较少的内存。例如,它可以与从实时过程收集数据的生成器一起工作。复杂度的估计值为 O(n log n)。我认为任何使用排序的算法都不可能具有线性复杂性。
def condense(lists):
groups = {} # items divided into groups {id(the_set): the_set,...}
members = {} # mapping from item to group
positions = {} # mapping from item to sequential ordering
iposition = 0 # counter for positions
for sublist in lists:
if not sublist or members.get(sublist[0], set()).issuperset(sublist):
continue # speed-up condition if all is from one group
common = set([x for x in sublist if x in members])
if common: # any common group can be a destination for merge
dst_group = members[common.pop()]
common = common - dst_group # are more groups common for sublist?
while common:
grp = members[common.pop()]
if len(grp) > len(dst_group): # merge shorter into longer grp
grp, dst_group = dst_group, grp
dst_group.update(grp)
for item in grp:
members[item] = dst_group
del groups[id(grp)]
common = common - dst_group
else: # or create a new group if nothing common
dst_group = set()
groups[id(dst_group)] = dst_group
newitems = []
for item in sublist: # merge also new items
if item not in positions:
positions[item] = iposition
iposition += 1
newitems.append(item)
members[item] = dst_group
dst_group.update(newitems)
return [sorted(x, key=positions.get) for x in groups.values()]
对于长度超过大约 8 个项目的子列表,它比 pillmuncher2 更快,因为它可以同时处理更多项目。对于具有许多相似子列表或许多子列表的列表(这些子列表是任何现有组的子集)的列表,它的速度也非常快。对于lst_OP来说,它比 pillmuncher2 快 25%,但对于lst_BK来说,它比 pillmuncher2 慢 15%。
具有长子列表的测试数据的一个示例是 。[list(range(30)) + [-i] for i in range(100)]
我特意写了“common = common - dst_group”,而不是使用集合运算符或“set.difference_update”,因为如果右侧的集合比左侧的集合大得多,则就地更新是无效的。-=
改进了 pillmuncher 的解决方案,更易于阅读。由于用 list.append 替换了生成器,因此修改速度比原来的要慢一些。可能是最易读的快速解决方案。
# Modified pillmuncher's solution
from collections import defaultdict
def condense(lists):
neighbors = defaultdict(set) # mapping from items to sublists
positions = {} # mapping from items to sequential ordering
position = 0
for each in lists:
for item in each:
neighbors[item].update(each)
if item not in positions:
positions[item] = position
position += 1
seen = set()
see = seen.add
for node in neighbors:
if node not in seen:
unseen = set([node]) # this is a "todo" set
next_unseen = unseen.pop # method alias, not called now
group = [] # collects the output
while unseen:
node = next_unseen()
see(node)
unseen |= neighbors[node] - seen
group.append(node)
yield sorted(group, key=positions.get)
评论
class List(list): pass
rv = dict()
def condense_step():
"""find and merge one overlapping pair in rv"""
for i, iv in rv.items():
for j, jv in rv.items():
if i != j and i.intersection(j):
m = i.union(j)
del rv[i]
del rv[j]
rv.setdefault(m, [])
rv[m] += iv
rv[m] += jv
return True
def unique(l):
"""flatten list-of-lists excluding duplicates"""
seen = set()
for i in sum(l, []):
if i not in seen:
seen.add(i)
yield i
def condense(inp):
rv.clear()
inp = map(List, inp)
for i in range(len(inp)):
inp[i].order = i
rv.setdefault(frozenset(inp[i]), [])
rv[frozenset(inp[i])].append(inp[i])
while condense_step():
pass
for v in rv.values():
v.sort(key=lambda x: x.order)
return [list(unique(i)) for i in rv.values()]
评论
[[i] for i in range(1000)] + [[i, -i] for i in range(1000)]
评论
[10, 5]
[1,2,3,8]