使用 R 过滤相似的 DNA 序列

Filter similar DNA sequences using R

提问人:Gerald Vasquez Aleman 提问时间:8/27/2023 更新时间:8/28/2023 访问量:42

问:

我有两个包含早期阶段 (TIMEPOINT_1) 和后期 (TIMEPOINT_2) DNA 序列的表格。我想在TIMEPOINT_2表中筛选出相似性阈值为 95% 的TIMEPOINT_1表中的序列。我尝试使用“stringdistmatrix”函数并创建相似矩阵,但我没有达到预期的结果。有没有办法在 R 中做到这一点?

下面是表结构的示例:

# Creating df TIMEPOINT_1
sequences <- c(
  "ACCTTCAGGCAACCTTCAGGCA",
  "ACCTTCGAGCAGCCATCAGGCA",
  "ACCCGTCCTAGGATCGATCAGGCA",
  "TCGAAGTGCATGCATGCTTACGTA",
  "CGTGCAAAGCGTGACGTTAGCGT")
sequence_names <- c("time1_seq1", "time1_seq2", "time1_seq3", "time1_seq4", "time1_seq5")
TIMEPOINT_1 <- data.frame(name = sequence_names, sequence = sequences)

# Creating df TIMEPOINT_2
sequences <- c(
  "ACCTTCGGGCAACCTTCAGGCA",
  "ACCTTCGTGCGGGCCATCAGGCA",
  "ACCCGTCCTAGGATCGATCAGGCA",
  "TCGAAGTGCATGCATGCTTAAGTA",
  "CGTGCAAAGCGTGACTGCACGTGGT")
sequence_names <- c("time2_seq1", "time2_seq2", "time2_seq3", "time2_seq4", "time2_seq5")
TIMEPOINT_2 <- data.frame(name = sequence_names, sequence = sequences)

预期结果:包含TIMEPOINT_1表中匹配序列的TIMEPOINT_2表。

R 序列 相似性

评论

0赞 r2evans 8/27/2023
这在 SO 上出现得足够频繁,请查找和stringdistfuzzyjoin::stringdist_*
1赞 Chris 8/27/2023
其中(stringdist::stringsim(sequences1, sequences2) >= 0.950) [1] 1 3 4.没有 which(,返回相似性向量,[1] 0.9545455 0.8695652 1.0000000 0.9583333 0.8000000。只是重命名为 1 和 2 以区分,但它们在 stringsim(a, b 或 (b, a 中的顺序不会改变比较。

答:

0赞 Matt B 8/28/2023 #1

如果我很好地理解了你的目标,我会执行一个简单的内部合并:

df <- merge(TIMEPOINT_1, TIMEPOINT_2, by = "sequence", all = F) ## Inner merge is set with all = F
df