提问人:Gerald Vasquez Aleman 提问时间:8/27/2023 更新时间:8/28/2023 访问量:42
使用 R 过滤相似的 DNA 序列
Filter similar DNA sequences using R
问:
我有两个包含早期阶段 (TIMEPOINT_1) 和后期 (TIMEPOINT_2) DNA 序列的表格。我想在TIMEPOINT_2表中筛选出相似性阈值为 95% 的TIMEPOINT_1表中的序列。我尝试使用“stringdistmatrix”函数并创建相似矩阵,但我没有达到预期的结果。有没有办法在 R 中做到这一点?
下面是表结构的示例:
# Creating df TIMEPOINT_1
sequences <- c(
"ACCTTCAGGCAACCTTCAGGCA",
"ACCTTCGAGCAGCCATCAGGCA",
"ACCCGTCCTAGGATCGATCAGGCA",
"TCGAAGTGCATGCATGCTTACGTA",
"CGTGCAAAGCGTGACGTTAGCGT")
sequence_names <- c("time1_seq1", "time1_seq2", "time1_seq3", "time1_seq4", "time1_seq5")
TIMEPOINT_1 <- data.frame(name = sequence_names, sequence = sequences)
# Creating df TIMEPOINT_2
sequences <- c(
"ACCTTCGGGCAACCTTCAGGCA",
"ACCTTCGTGCGGGCCATCAGGCA",
"ACCCGTCCTAGGATCGATCAGGCA",
"TCGAAGTGCATGCATGCTTAAGTA",
"CGTGCAAAGCGTGACTGCACGTGGT")
sequence_names <- c("time2_seq1", "time2_seq2", "time2_seq3", "time2_seq4", "time2_seq5")
TIMEPOINT_2 <- data.frame(name = sequence_names, sequence = sequences)
预期结果:包含TIMEPOINT_1表中匹配序列的TIMEPOINT_2表。
答:
0赞
Matt B
8/28/2023
#1
如果我很好地理解了你的目标,我会执行一个简单的内部合并:
df <- merge(TIMEPOINT_1, TIMEPOINT_2, by = "sequence", all = F) ## Inner merge is set with all = F
df
上一个:COUNTIF 未被识别为公式
下一个:使用休眠工具自动创建序列
评论
stringdist
fuzzyjoin::stringdist_*