使用 python 3，在字符串中查找最大数字的最胖方法是什么？-解网

问：

我正在使用 SMILES，这是一种分子的字符串表示，它使用匹配的数字对来表示环。我正在使用一个相对较大的数据集，1,9 百万个字符串，为此我有一个数据增强程序（SMILES 随机化），每个原始 SMILES 可以找到多达 30 千个独特的等效 SMILES（=字符串）。我最终有非常多的字符串要测试（略低于 1000 亿）。我想要最快的方法来查找 SMILES（字符串）中使用的最大数字。一些 SMILES 功能可以被视为潜在的边缘情况：数字“0”始终被排除在外，到目前为止，我还没有找到高于 6 的数字，但理论上它可以上升到 9。有些 SMILES（字符串）可以没有环，因此没有数字

我尝试了以下 2 个功能：

def find_biggest_digits_1(smiles=str()):
    """
    find the biggest digits by iterating throught SMILES, test if character is digits then type cast as integer and check if bigger than previous then max digits
    """
    ### instantiate local variable
    max_digit=0 

    ### iterate through SMILES

    for character in smiles:
        if character.isdigit():
            digit= int(character)
            if digit > max_digit:
                max_digit= digit
            else:
                pass
        else:
            pass
    
    return max_digit

def find_biggest_digits_2(smiles=str()):
    """
    find the biggest digits by iterating throught possible max digit possible and test if corresponding character is found in string
    """
    ### instantiate local variable
    max_digit=0 

    ### iterate through max digits possible solution
    for i in range(1,10):
        digit=10-i
        if smiles.find(str(digit))!=-1:
            max_digit=digit
            break
        else:
            pass
    
    return max_digit

当我从数据集中随机获取 SMILES 时，我得到以下结果：

%%timeit
find_biggest_digits_1(smiles="COC(=O)c1cc(NC(=O)Nc2cnNc2)c(F)cc1F")

每个环路 2.14 μs ± 64 ns（平均 ± 标准开发 7 次运行，每次 100000 个环路）

%%timeit
find_biggest_digits_2(smiles="COC(=O)c1cc(NC(=O)Nc2cnNc2)c(F)cc1F")

每个环路 1.96 μs ± 78.2 ns（平均 7 次运行的标准开发±，每个环路 100000 个环路）

函数 2 比函数 1 快一点，当我在没有数据增强过程（1.9 百万字符串）的情况下对整个 datastet 使用多处理时，这或多或少得到了证实。对于想要重现多重处理的最勇敢的灵魂，我使用了 10 个内核以及来自 MOSES 的训练、测试和scaffold_test数据集的串联（https://github.com/molecularsets/moses)

%%timeit
with mp.Pool(mp.cpu_count()-2) as pool:
    results= pool.map(find_biggest_digits_1,df["sanitized_smiles"])
    pool.close()
    pool.join()

每个循环 1.37 s ± 69 ms（平均 7 次运行，每次 1 次循环的 ± 标准开发）

%%timeit
with mp.Pool(mp.cpu_count()-2) as pool:
    results= pool.map(find_biggest_digits_2,df["sanitized_smiles"])
    pool.close()
    pool.join()

每个循环 1.34 s ± 50.8 ms（平均 7 次运行，每次 1 次循环的 ± 标准开发）

从答案中编辑解决方案：

def find_biggest_digits_1_b(smiles=str()):
    """
    find the biggest digits by iterating throught SMILES, test if character is digits then type cast as integer and check if bigger than previous then max digits
    """
    ### instantiate local variable
    max_digit=0 

    ### iterate through SMILES

    for character in smiles:
        if character.isdigit():
            digit= int(character)
            if digit > max_digit:
                max_digit= digit
                if digit>=9:
                    break
     
    return max_digit

用随机微笑进行测试：

%%timeit
find_biggest_digits_1_b(smiles="COC(=O)c1cc(NC(=O)Nc2cnNc2)c(F)cc1F")

每个环路 2.12 μs ± 30.5 ns（平均 7 次运行的标准开发±，每个环路 100000 个环路）

具有 1.9 百万字符串的多处理：每个循环 1.42 s ± 77.3 ms（±平均 7 次运行的标准开发，每次 1 次循环）

对于边缘情况可能更快，但总体性能与原始解决方案相似

提出的第二个解决方案：

def find_biggest_digits_3(smiles=str()):
    """
    find the biggest digits by using list/tuple comprehension with try and except for no ring (digit) in smiles 
    """
    try :
       return max(c for c in smiles if c.isdigit())
    except:
        return 0

测试：随机微笑（和以前一样）：每个环路 2.51 μs ± 83.8 ns（平均 7 次运行的标准开发±，每次 100000 个环路）具有 1.9 百万字符串的多处理：每个循环 1.5 s ± 92 ms（平均 7 次运行，每次 1 次循环的标准化开发±）

该解决方案更简洁，但比其他解决方案慢一点

字符串性能大数据数字

解决方案 1 可以更快：删除 / 语句，并在确认后使用 when 。这意味着在最坏的情况下，解决方案 1 会迭代整个字符串，但现在它会在找到最高值时停止。如果相反地执行，解决方案 2 可能会更快：从 9 开始，然后向后工作。这样，解决方案 2 也可以在某些方面提前完成，但您最终仍可能多次迭代字符串。我希望通过这些更改，解决方案 1 会更快，因为两者遍历相同的距离。elsepassbreak;max_digit >= 9digit > max_digitmax_digit

0赞 larsks 6/27/2023

我认为更简单的实现是.您可以将其包装在 try/except 块中，以便在没有数字时捕获 ValueError。find_biggest_digitmax(c for c in smiles if c.isdigit())

0赞 Etienne Reboul 6/27/2023

我已经测试了这两种解决方案，发现对于第一个解决方案，它的整体性能相同，对于第二个解决方案，它更简洁但有点慢

答： 暂无答案

上一个：在不删除 python 中的前导零的情况下获取十六进制数的位数

下一个：通过postgresql检查列中的四位数条目

使用 python 3，在字符串中查找最大数字的最胖方法是什么？

With python 3, what is the fattest way to find biggest digit in a string?

评论