从 R 中的文本文件中读取顶级块

Reading top level blocks from a text file in R

提问人:Aku-Ville Lehtimäki 提问时间:10/18/2023 最后编辑:Aku-Ville Lehtimäki 更新时间:10/18/2023 访问量:35

问:

我正在使用包含块的 Rwith 文件,例如

block name { block contents can be anything: strings, numbers or even curly braces {} or whatever}

blockn4m3 containing numbers {
                                 Can be something junk like: 
                   ans{a{a[sf'asödfä'asdösdö'äasdö'äasdö}}}
}}

然后我想将它们提取到一个向量中,以便:

"block name { block contents can be anything strings, numbers or even brackets {} or whatever}","blockn4m3 containing numbers {
                                 Can be something junk like: 
                   ans{a{a[sf'asödfä'asdösdö'äasdö'äasdö}}}
}}"

我假设正则表达式不起作用,因为块中可以有大括号(和嵌套块)?

所以我想也许我只是逐个字符读取每个文件,然后我写了一个以下函数:

separateBlocksFromFile <- \(file) {
  input <- file %>% readLines %>% {paste(., collapse = "\n")}
  blocks <- c()
  blockNumber = 1 #We start from the first block
  netBracketValue = 0 #0, when reading a block name
  for(i in 1:nchar(input)) {
    currentCharacter = substr(input,i,i)
    
    #Did we enter a block?
    netBracketValue = netBracketValue + (currentCharacter == "{")
    
    #Write the character into its correct place.
    
    #Previous characters in the current block...
    previousCharacters <- ifelse(is.na(blocks[blockNumber]),"",blocks[blockNumber])
    #...are put before current character
    blocks[blockNumber] <- paste0(previousCharacters,currentCharacter)
    
    
    #Did we exit a block? If so, the netBracketValue becomes 0 here.
    netBracketValue = netBracketValue - (currentCharacter == "}")
    
    #Block number is updated, if needed.
    #Updated when we pass "}" character and the character ends a block i.e.
    #netBracketValue == 0
    blockNumber <- blockNumber + (netBracketValue == 0)*(currentCharacter == "}")
  }
  
  return(blocks)
}

虽然这可行,但在处理较大的文件时,解决方案往往有点慢。我想知道是否有更快的方法来实现这一目标?

编辑:块内容在打开 { 之前不能有关闭 }。如果是这样的话,那么就无法确定我们是否退出了区块。

R substr 大括号

评论

2赞 Konrad Rudolph 10/18/2023
不要逐步连接结果字符串,这可能是迄今为止最慢的部分。相反,请注意当前块的起始位置,并在找到块的末尾后,提取从上次开始到当前结束的子字符串。或者,相反,记下所有开始和结束位置,然后在结束时调用一次,以根据开始和结束位置的向量获取所有子字符串。substring()

答: 暂无答案