删除文本文件中的某些行，然后使用 fread（）转换为表格的有效方法-解网

问：

我有一个数据文件，如下所示格式。有 5 列和大约 2000000 行

# some text
# some more text
#
#       Column names                                            Units                         
#       ------------------------------------------------------  ------------------------ 
#@   1  "aaaa"                                                  "s"                       
#@   2  "bbbbbbb "                                              "kg"                     
#@   3  "cccccccc"                                              "m"                     
#@   4  "dddddddd"                                              "lb"                     
#@   5  "eeeeeeee"                                              "m"                     

2 4 5 6 7 
7 8 9 3 2 
...
...
...

# row 145800
# row 145801
# row 145802 
# row 145803
# row 145804

3 4 6 7 9

这个想法是使用 fread（）创建一个数据帧。在此之前，我需要跳过包含“#”字符的行。此示例中的一个问题是“#”也出现在中间的某个地方的文本文件，如第 145800 行至 145804 行。因此，我将数据拆分为两个不同的字符向量，然后将它们合并以删除第 145800 行中的“#”以145804。保留带有“#@”的行的原因是列名。稍后将它们映射到列后，我将删除它们

# pathof data file 
path <-  "C:/data.txt"

# read original data file. 
# Does the same as readLines() - inspired by https://stackoverflow.com/questions/32920031/how-to-use-fread-as-readlines-without-auto-column-detection
lines_original <- fread(path, sep= "?", header = FALSE)[[1L]] 

# Read the first 100 lines of the file into a character vector
lines_subset<- fread(path, sep= "?", header = FALSE, nrows = 100)[[1L]]


# Identify the lines that contain the special character in the first 100 lines
special_lines_1 <- grep("\\#", lines_subset)

# Identify the lines that contain the special character in the entire file
special_lines_2 <- grep("\\#", lines_original)


# Subset of lines_subset containing "#" 
lines_1 <- lines_subset[special_lines_1] 


# Subset of lines_original  containing "#" 

lines_2 <- lines_original[-special_lines_2]

# merging lines_1 and lines_2 so that "#" is removed everywhere apart from first 100 lines 
lines_new <- c(lines_1, lines_2)


skip <- tail(grep("\\#", readLines(textConnection(lines_new))),1)

我现在想使用下面的代码将lines_new转换为数据帧

df <- fread(text = lines_new, skip = skip,  header = FALSE)

如您所见，我多次使用 fread（）调用，有没有办法避免在最后使用 fread（），因为数据已经导入内存中？

R UTF-8 铎料

@AndreWildberg 我希望程序找出 n。所以我使用了skip <- tail（grep（“\\#”， readLines（path）），1）应该给出 38。但我有这些行 145800 - 148804以某种方式引入了特殊字符。所以我想删除文本文件中的那些行 - 类似于 grep（“\\#”， readLines（p_header where n>100））

0赞 Andre Wildberg 4/13/2023

如果您知道您想要最多前 100 行，为什么不读入这些行并在之后使用 grep 过滤呢？这样可以避免两次读取文件。

答：

0赞 amin fathullah 4/14/2023 #1

也许您可以使用这种方法。

将文本文件加载为字符串，
删除带有“#”前缀的行。我使用正则表达式，
使用 fread 转换为表格。

这是代码：

readLines('tes.txt') %>% 
  grep('^[^\\#]', ., value = T) %>% 
  fread(text = .)

上一个：在 shiny-server 上提供 shiny 应用程序时的编码问题

下一个：Windows 上的 RStudio 无法在数据框中使用 UTF-8 符号（大约等于 ≈）

删除文本文件中的某些行，然后使用 fread（）转换为表格的有效方法

Efficient way to Remove certain lines in a text file and then convert to table using fread()

评论

删除文本文件中的某些行，然后使用 fread（） 转换为表格的有效方法

Efficient way to Remove certain lines in a text file and then convert to table using fread()

评论

删除文本文件中的某些行，然后使用 fread（）转换为表格的有效方法