CS50 第 6 周 DNA 程序错误地识别 DNA 序列

cs50 week 6 dna program incorrectly identifies dna sequence

提问人:Kaley 提问时间:11/6/2023 最后编辑:Kaley 更新时间:11/10/2023 访问量:60

问:

我的代码有点工作,除了它在工作内容上有选择性。它为特定序列提供了正确的名称,但对于其他序列,它会搞砸。

例如,它将正确识别一条链属于 Bob,但会将假定的“不匹配”链与“Charlie”进行匹配,后者甚至不存在于 cs50 提供给我们的列表中。

这真的很奇怪,我已经将我的代码与其他人进行了对比检查,他们似乎大多相似。不知道为什么会这样,希望能得到一些帮助。

import csv
import sys

def main():

    # TODO: Check for command-line usage
    if len(sys.argv) != 3:
        sys.exit("Usage: python dna.py data.csv sequence.txt")

    # TODO: Read database file into a variable
    database = []

    with open(sys.argv[1], 'r') as file:
        reader = csv.DictReader(file)

        for row in reader:
            database.append(row)
 
    # TODO: Read DNA sequence file into a variable
    with open(sys.argv[2], 'r') as file:
        dna_sequence = file.read()

    # TODO: Find longest match of each STR in DNA sequence
    subsequences = list(database[0].keys())[1:]

    results = {}
    for subsequence in subsequences:
        match = 0
        results[subsequence] = longest_match(dna_sequence, subsequence)
        match += 1

    # TODO: Check database for matching profiles
    for person in database:
        for subsequence in subsequences:
            if int(person[subsequence]) == results[subsequence]:
                match += 1
        
            if match == len(subsequence):
                print(person["name"])
                return 

    print("No match")
    return


def longest_match(sequence, subsequence):
    """Returns length of longest run of subsequence in sequence."""

    # Initialize variables
    longest_run = 0
    subsequence_length = len(subsequence)
    sequence_length = len(sequence)

    # Check each character in sequence for most consecutive runs of subsequence
    for i in range(sequence_length):

        # Initialize count of consecutive runs
        count = 0

        # Check for a subsequence match in a "substring" (a subset of characters) within
        #sequence
        # If a match, move substring to next potential match in sequence
        # Continue moving substring and checking for matches until out of consecutive matches
        while True:

            # Adjust substring start and end
            start = i + count * subsequence_length
            end = start + subsequence_length

            # If there is a match in the substring
            if sequence[start:end] == subsequence:
                count += 1
        
            # If there is no match in the substring
            else:
                break
    
        # Update most consecutive matches found
        longest_run = max(longest_run, count)

    # After checking for runs at each character in seqeuence, return longest run found
    return longest_run

main()
python cs50 DNA 序列

评论


答:

0赞 kcw78 11/10/2023 #1

你还在努力吗?如果是这样,则有 2 个数据库和 20 个序列需要测试。(它们在 DNA PSET 的末尾列出了正确答案。哪一个给你上面的错误?我怀疑这是第三次测试。它显示以 .您的程序应输出 .python dna.py databases/small.csv sequences/3.txtNo match

当我这样做时,您的程序输出而不是 .
您需要检查的子序列包括: 您的子序列计数为:

这与 small.csv 文件中的任何人不匹配。
查理很接近,但他的DNA亚序列计数是:
CharlieNo match['AGATC', 'AATG', 'TATC']{'AGATC': 3, 'AATG': 3, 'TATC': 5}('AGATC', '3'), ('AATG', '2'), ('TATC', '5')

当您将每个人与子序列计数进行比较时,会发生错误。有 3 件事需要解决:

  1. 的值是在上一个循环 () 中设置的。In 需要在循环中。matchfor subsequence in subsequences:for person in database:
  2. 需要修改要测试的缩进。(这是在第二个 for 循环中。matchsubsequence in subsequences:
  3. 您正在针对 进行测试。想想吧。。。。matchlen(subsequence)

我进行了这些更改,它适用于所有 4 个测试和我尝试过的 3 个测试。small.csvlarge.csv