读取带分隔符的文本文件时如何展平嵌套结构?

How to flatten nested structure when reading delimited text file?

提问人:Jason Grotto 提问时间:10/17/2023 最后编辑:jay.sfJason Grotto 更新时间:10/17/2023 访问量:91

问:

我有一个带分隔符的文本文件,如下所示:

AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|

我一直在努力转换它 -- 使用 for 循环等 -- 使每个 ABV 记录都是一行,上面是 AAA 记录,下面是所有 ABV 记录,所以它看起来像这样:

AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|

但我似乎无法让它工作。任何协助将不胜感激。

r for 循环 嵌套

评论


答:

1赞 Sirius 10/17/2023 #1

下面是一种方法:


library(stringr)
library(zoo)

txt <- 
"AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
"

txt <- str_replace_all(txt,regex("\n(?=ABP)"),"")

l <- setdiff(str_split_1(txt,"\n"),"")

aaa <- str_match(l,"^AAA.*") |> na.locf()
abv <- str_match(l,"^ABV.*")

i <- !is.na(abv)

l2 <- paste( aaa[i], abv[i], sep="" )

l3 <- paste(l2,collapse="\n")
cat(l3,"\n")

# save it to a file
cat(l3,"\n", file="outfile.txt")

输出:

AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E| 

简而言之,它就是这样做的:

  • 删除之前的换行符ABP
  • 创建一个相同大小的向量,并将行复制到下一行,直到出现新的向量AAAAAA
  • 获取行(现在也包含 s)ABVABP
  • 将 AAA 字符串列表与 ABV 字符串列表连接起来
  • 将它们全部放回一个文件中
3赞 Ritchie Sacramento 10/17/2023 #2

如果我们计算按行分隔的字段数,我们可以观察到以下模式:|

(fields <- count.fields(textConnection(txt), sep = "|"))
[1] 11  7  6  6  6  7  6  6  6  7  6  6  6 11  7  6  6  6  7  6  6  6  7  6  6  6

假设模式在您的数据中成立,我们可以使用以下内容来连接这些行:

text <- readLines(textConnection(txt))

l1 <- fields == 11L
l2 <- fields == 7L
l3 <- fields == 6L

l2_id = cumsum(l2)

paste0(unlist(Map(paste, text[l1], split(text[l2], cumsum(l1)[l2]), sep = "")), tapply(text[l3], l2_id[l3], paste, collapse = ""))

[1] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|"  
[2] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
[3] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|" 
[4] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"  
[5] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
[6] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"

数据:

txt <- "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|"

评论

0赞 Chris 10/17/2023
为了检验序列假设的规律性,max_seq @florian,为 13。这是一个很好的方法。你能扩展一下吗?(fieldsl2_id = cumsum(l2)
0赞 Ritchie Sacramento 10/17/2023
@Chris - 当然,它只是一种在遇到 a 时为每次运行值分配唯一 ID 的方法,即 会变成.7c(11, 7, 6, 6, 6, 7, 6, 6, 6)c(0, 1, 1, 1, 1, 2, 2, 2, 2)
0赞 Chris 10/17/2023
什么表明 for id'ing 的选择与 (或 l3) 是 max_seq() 的开头,或者它只是“派生索引”,我错误地看到了 ur's above 和 max_seq 之间的关系,因为一个说多长时间,另一个说什么序列,但我们经常看到用于推导索引......据我所知,说了这么多。l2l1cumsum
0赞 jay.sf 10/17/2023 #3

首先,用于查找“AAA”位置(在 bash 中可能更快),然后在跳过时,计算“ABV”,并附加标头。最后。grepscanunlist

> pos <- system('cat foo.txt | grep -n -o AAA foo.txt | cut -d: -f1', intern=TRUE)
> # pos <- grep('AAA', readLines('foo.txt'))  ## alternatively just using R
> lapply(seq_along(pos), \(i) {
+   .skip <- (as.numeric(pos) - 1L)[i]
+   .nmax <- as.numeric(pos[i + 1L]) - .skip - 1L
+   r <- scan('foo.txt', what=character(), skip=.skip, nmax=.nmax, qui=T)
+   len <- sum(grepl('ABV', r))
+   sapply((seq_len(len) - 1L)*4L, \(j) 
+          paste(r[1], paste(r[2:5 + j], collapse=''), sep=''))
+ }) |> unlist()
 [1] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|"  
 [2] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
 [3] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|" 
 [4] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"  
 [5] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
 [6] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
 [7] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-2|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
 [8] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"  
 [9] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
[10] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-2|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"            
[11] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
[12] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-2|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"

假设条目以“AAA”开头,子条目的长度为 4。这可以根据需要进行调整。


数据:

AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
ABV|E_ASLVW-2|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASHWW-2|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
ABV|E_ASLVW-2|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|