如何在没有__hash__的情况下删除对象列表中的重复项

How to remove duplicates in list of objects without __hash__

提问人:BioGeek 提问时间:10/10/2017 更新时间:10/10/2017 访问量:518

问:

我有一个自定义对象列表,我想从中删除重复项。通常,您可以通过定义对象 和 然后获取对象列表来执行此操作。我已经定义了,但我无法想出一种很好的实现方法,以便它为相等的对象返回相同的值。__eq____hash__set__eq____hash__

更具体地说,我有一个派生自 ete3 工具包中的类的类。我定义了两个对象,如果它们的 Robinson-Foulds 距离为零,则它们相等。Tree

from ete3 import Tree

class MyTree(Tree):

    def __init__(self, *args, **kwargs):
        super(MyTree, self).__init__(*args, **kwargs)

    def __eq__(self, other):
        rf = self.robinson_foulds(other, unrooted_trees=True)
        return not bool(rf[0])

newicks = ['((D, C), (A, B),(E));',
           '((D, B), (A, C),(E));',
           '((D, A), (B, C),(E));',
           '((C, D), (A, B),(E));',
           '((C, B), (A, D),(E));',
           '((C, A), (B, D),(E));',
           '((B, D), (A, C),(E));',
           '((B, C), (A, D),(E));',
           '((B, A), (C, D),(E));',
           '((A, D), (B, C),(E));',
           '((A, C), (B, D),(E));',
           '((A, B), (C, D),(E));']

trees = [MyTree(newick) for newick in newicks]

print len(trees)       # 12
print len(set(trees))  # also 12, not what I want!

两者都返回 12,但这不是我想要的,因为几个对象彼此相等:print len(trees)print len(set(trees))

from itertools import product
for t1, t2 in product(newicks, repeat=2):
    if t1 != t2:
        mt1 = MyTree(t1)
        mt2 = MyTree(t2)
        if mt1 == mt2:
            print t1, '==', t2

返回:

((D, C), (A, B),(E)); == ((C, D), (A, B),(E));
((D, C), (A, B),(E)); == ((B, A), (C, D),(E));
((D, C), (A, B),(E)); == ((A, B), (C, D),(E));
((D, B), (A, C),(E)); == ((C, A), (B, D),(E));
((D, B), (A, C),(E)); == ((B, D), (A, C),(E));
((D, B), (A, C),(E)); == ((A, C), (B, D),(E));
((D, A), (B, C),(E)); == ((C, B), (A, D),(E));
((D, A), (B, C),(E)); == ((B, C), (A, D),(E));
((D, A), (B, C),(E)); == ((A, D), (B, C),(E));
((C, D), (A, B),(E)); == ((D, C), (A, B),(E));
((C, D), (A, B),(E)); == ((B, A), (C, D),(E));
((C, D), (A, B),(E)); == ((A, B), (C, D),(E));
((C, B), (A, D),(E)); == ((D, A), (B, C),(E));
((C, B), (A, D),(E)); == ((B, C), (A, D),(E));
((C, B), (A, D),(E)); == ((A, D), (B, C),(E));
((C, A), (B, D),(E)); == ((D, B), (A, C),(E));
((C, A), (B, D),(E)); == ((B, D), (A, C),(E));
((C, A), (B, D),(E)); == ((A, C), (B, D),(E));
((B, D), (A, C),(E)); == ((D, B), (A, C),(E));
((B, D), (A, C),(E)); == ((C, A), (B, D),(E));
((B, D), (A, C),(E)); == ((A, C), (B, D),(E));
((B, C), (A, D),(E)); == ((D, A), (B, C),(E));
((B, C), (A, D),(E)); == ((C, B), (A, D),(E));
((B, C), (A, D),(E)); == ((A, D), (B, C),(E));
((B, A), (C, D),(E)); == ((D, C), (A, B),(E));
((B, A), (C, D),(E)); == ((C, D), (A, B),(E));
((B, A), (C, D),(E)); == ((A, B), (C, D),(E));
((A, D), (B, C),(E)); == ((D, A), (B, C),(E));
((A, D), (B, C),(E)); == ((C, B), (A, D),(E));
((A, D), (B, C),(E)); == ((B, C), (A, D),(E));
((A, C), (B, D),(E)); == ((D, B), (A, C),(E));
((A, C), (B, D),(E)); == ((C, A), (B, D),(E));
((A, C), (B, D),(E)); == ((B, D), (A, C),(E));
((A, B), (C, D),(E)); == ((D, C), (A, B),(E));
((A, B), (C, D),(E)); == ((C, D), (A, B),(E));
((A, B), (C, D),(E)); == ((B, A), (C, D),(E));

所以我的问题是:

  • 对于我的情况来说,什么是一个好的实施方式?__hash__set(trees)
  • 或者如何在没有定义的情况下从列表中删除相等的对象?__hash__
Python 哈希 生物信息学 平等 ETE3

评论

4赞 Leon 10/10/2017
IMO,解决此问题和类似问题的最安全方法是为数据结构定义规范表示,然后使有问题的操作使用规范表示。
0赞 quamrana 10/10/2017
如果你能保证每个实例都由一个字符串初始化,为什么不保存这个字符串,并在你自己的方法中返回它的哈希值呢?MyTree__hash__()
2赞 Leon 10/10/2017
@quamrana从不同字符串(具有不同的哈希值)初始化的 2 个对象可以比较相等(如示例中所示)MyTree
2赞 Alex Hall 10/10/2017
您是否尝试过简单地循环和删除重复项?这不是最快的算法,但根据你有多少棵树,它可能无关紧要。
2赞 bli 10/12/2017
最后我做到了(直接使用 frozensets,因此无需定义自定义哈希): stackoverflow.com/a/46707857/1878788

答: 暂无答案