提问人:Jason Grotto 提问时间:10/17/2023 最后编辑:jay.sfJason Grotto 更新时间:10/17/2023 访问量:91
读取带分隔符的文本文件时如何展平嵌套结构?
How to flatten nested structure when reading delimited text file?
问:
我有一个带分隔符的文本文件,如下所示:
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
我一直在努力转换它 -- 使用 for 循环等 -- 使每个 ABV 记录都是一行,上面是 AAA 记录,下面是所有 ABV 记录,所以它看起来像这样:
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|
但我似乎无法让它工作。任何协助将不胜感激。
答:
1赞
Sirius
10/17/2023
#1
下面是一种方法:
library(stringr)
library(zoo)
txt <-
"AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
"
txt <- str_replace_all(txt,regex("\n(?=ABP)"),"")
l <- setdiff(str_split_1(txt,"\n"),"")
aaa <- str_match(l,"^AAA.*") |> na.locf()
abv <- str_match(l,"^ABV.*")
i <- !is.na(abv)
l2 <- paste( aaa[i], abv[i], sep="" )
l3 <- paste(l2,collapse="\n")
cat(l3,"\n")
# save it to a file
cat(l3,"\n", file="outfile.txt")
输出:
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|
简而言之,它就是这样做的:
- 删除之前的换行符
ABP
- 创建一个相同大小的向量,并将行复制到下一行,直到出现新的向量
AAA
AAA
- 获取行(现在也包含 s)
ABV
ABP
- 将 AAA 字符串列表与 ABV 字符串列表连接起来
- 将它们全部放回一个文件中
3赞
Ritchie Sacramento
10/17/2023
#2
如果我们计算按行分隔的字段数,我们可以观察到以下模式:|
(fields <- count.fields(textConnection(txt), sep = "|"))
[1] 11 7 6 6 6 7 6 6 6 7 6 6 6 11 7 6 6 6 7 6 6 6 7 6 6 6
假设模式在您的数据中成立,我们可以使用以下内容来连接这些行:
text <- readLines(textConnection(txt))
l1 <- fields == 11L
l2 <- fields == 7L
l3 <- fields == 6L
l2_id = cumsum(l2)
paste0(unlist(Map(paste, text[l1], split(text[l2], cumsum(l1)[l2]), sep = "")), tapply(text[l3], l2_id[l3], paste, collapse = ""))
[1] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|"
[2] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[3] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|"
[4] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"
[5] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[6] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
数据:
txt <- "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|"
评论
0赞
Ritchie Sacramento
10/17/2023
@Chris - 当然,它只是一种在遇到 a 时为每次运行值分配唯一 ID 的方法,即 会变成.7
c(11, 7, 6, 6, 6, 7, 6, 6, 6)
c(0, 1, 1, 1, 1, 2, 2, 2, 2)
0赞
Chris
10/17/2023
什么表明 for id'ing 的选择与 (或 l3) 是 max_seq() 的开头,或者它只是“派生索引”,我错误地看到了 ur's above 和 max_seq 之间的关系,因为一个说多长时间,另一个说什么序列,但我们经常看到用于推导索引......据我所知,说了这么多。l2
l1
cumsum
0赞
jay.sf
10/17/2023
#3
首先,用于查找“AAA”位置(在 bash 中可能更快),然后在跳过时,计算“ABV”,并附加标头。最后。grep
scan
unlist
> pos <- system('cat foo.txt | grep -n -o AAA foo.txt | cut -d: -f1', intern=TRUE)
> # pos <- grep('AAA', readLines('foo.txt')) ## alternatively just using R
> lapply(seq_along(pos), \(i) {
+ .skip <- (as.numeric(pos) - 1L)[i]
+ .nmax <- as.numeric(pos[i + 1L]) - .skip - 1L
+ r <- scan('foo.txt', what=character(), skip=.skip, nmax=.nmax, qui=T)
+ len <- sum(grepl('ABV', r))
+ sapply((seq_len(len) - 1L)*4L, \(j)
+ paste(r[1], paste(r[2:5 + j], collapse=''), sep=''))
+ }) |> unlist()
[1] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ABERDARE|20161113|RF|16|20171229|ABP|1|F|.051|I|ABP|2|F|.047|I|ABP|3|F|.019|I|"
[2] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASHWW-1|20161113|RF|16|20171229|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[3] "AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|ABV|E_ASLVW-1|20161113|RF|16|20171229|ABP|1|F|1.204|E|ABP|2|F|.974|E|ABP|3|F|1.025|E|"
[4] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"
[5] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[6] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
[7] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-2|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
[8] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ABERDARE|20161117|RF|16|20180105|ABP|1|F|.051|I|ABP|2|F|.048|I|ABP|3|F|.041|I|"
[9] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-1|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[10] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASHWW-2|20161117|RF|16|20180105|ABP|1|F|0|E|ABP|2|F|0|E|ABP|3|F|0|E|"
[11] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-1|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
[12] "AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|ABV|E_ASLVW-2|20161117|RF|16|20180105|ABP|1|F|5.487|E|ABP|2|F|5.485|E|ABP|3|F|5.484|E|"
假设条目以“AAA”开头,子条目的长度为 4。这可以根据需要进行调整。
数据:
AAA|C0421002|D|20180102091621|CD|UKDC|PB|PORTAL|301873|OPER|
ABV|E_ABERDARE|20161113|RF|16|20171229|
ABP|1|F|.051|I|
ABP|2|F|.047|I|
ABP|3|F|.019|I|
ABV|E_ASHWW-1|20161113|RF|16|20171229|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161113|RF|16|20171229|
ABP|1|F|1.204|E|
ABP|2|F|.974|E|
ABP|3|F|1.025|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
ABV|E_ASLVW-2|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
AAA|C0421002|D|20180108092341|CD|UKDC|PB|PORTAL|302513|OPER|
ABV|E_ABERDARE|20161117|RF|16|20180105|
ABP|1|F|.051|I|
ABP|2|F|.048|I|
ABP|3|F|.041|I|
ABV|E_ASHWW-1|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASHWW-2|20161117|RF|16|20180105|
ABP|1|F|0|E|
ABP|2|F|0|E|
ABP|3|F|0|E|
ABV|E_ASLVW-1|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
ABV|E_ASLVW-2|20161117|RF|16|20180105|
ABP|1|F|5.487|E|
ABP|2|F|5.485|E|
ABP|3|F|5.484|E|
评论