尝试解析网页,下标越界

Trying to parse webpage, subscript out of bounds

提问人:Jorge Martínez 提问时间:7/3/2020 最后编辑:Jorge Martínez 更新时间:7/5/2020 访问量:45

问:

我正在尝试从网页 coches.net(购买汽车的页面)中提取信息,但我在浏览时发现的一些代码有问题。只是为了澄清这一点,我没有编码经验,所以我迷路了。我尝试了几件事,但无法让它工作。

R 给我的错误消息是这样的:。翻译过来的意思是“下标越界。Error in str_split(string = titulo, pattern = " ")[[1]] : subíndice fuera de los límites

寻找我在这里找到的解决方案:https://stackoverrun.com/es/q/4074347 问题与我的表为我正在下载的信息创建的行/列数有关。但是,我想不出解决方案。

完整的代码是这样的:(编辑 V1,去掉“Marca”后)

  start <- Sys.time()
  
  list.of.packages <- c("tidyverse", "rvest", "httr")
  new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
  if(length(new.packages)>0) {install.packages(new.packages)}
  
  library(tidyverse)
  library(rvest)
  library(httr)
  
  
  desktop_agents <-  c('Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                       'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                       'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/602.2.14 (KHTML, like Gecko) Version/10.0.1 Safari/602.2.14',
                       'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                       'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36',
                       'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36',
                       'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
                       'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0')
  
  
  line <- data.frame("Titulo", "Precio", "Provincia", "Motor", "Año", "Kilometros", "Fecha subida","Link")
  write.table(line, file = ruta, sep = ",", append = TRUE, quote = TRUE, col.names = FALSE, row.names = FALSE, na = "")
  
  
  for (counter in (1:paginas)) {
    url <- paste0("https://www.coches.net/segunda-mano/?pg=", as.character(counter))
    print(url)
    
    x <- GET(url, add_headers('user-agent' = desktop_agents[sample(1:10, 1)]))
    bloque <- x %>% read_html() %>% html_nodes(".mt-Card-body")
    
    for (p in (1:length(bloque))) {
      titulo <- bloque[p] %>% html_nodes(".mt-CardAd-title .mt-CardAd-titleHiglight") %>% html_text()
           
      precio <- bloque[p] %>% html_nodes(".mt-CardAd-price .mt-CardAd-titleHiglight") %>% html_text()
      precio <- str_replace(string = precio, pattern = " €", replacement = "")
      precio <- str_replace(string = precio, pattern = "\\.", replacement = "")
      precio <- as.numeric(precio)
      
      info <- bloque[p] %>% html_nodes(".mt-CardAd-attribute") %>% html_text()
      prov <- info[1]
      motor <- info[2]
      año <- info[3]
      
      km <- info[4]
      km <- str_replace(string = km, pattern = "\\.", replacement = "")
      km <- as.numeric((str_replace(string = km, pattern = " km", replacement = "")))
      
      fechasubida <- bloque[p] %>% html_nodes(".mt-CardAdDate-time") %>% html_text()
      
      link <- bloque[p] %>% html_nodes(".mt-CardAd-link") %>% html_attr(name = "href")
      link <- paste0("https://www.coches.net", link[1])
      
      print(paste(titulo, precio, prov, motor, año, km, fechasubida, link))
      line <- data.frame(titulo, precio, prov, motor, año, km, fechasubida, link)
      write.table(line, file = ruta, sep = ",", append = TRUE, quote = TRUE, col.names = FALSE, row.names = FALSE, na = "")
    }
  }
  
  end <- Sys.time()
  diff <- end - start
  print(paste("Cochisto ha descargado el 100% de los anuncios en", diff))
}

将不胜感激。

r html 解析

评论

0赞 QHarr 7/5/2020
也许在拆分之前测试字符串中是否存在“ ”
0赞 Jorge Martínez 7/5/2020
在你的建议下,我尝试了一些东西。我所做的是删除对“Marca”变量的所有引用,这样我就不必应用拆分了。但是 R 给了我这些新错误: 并且:Error in data.frame(titulo, precio, prov, motor, año, km, fechasubida, : arguments imply differing number of rows: 0, 1Error durante el wrapup: regular expression is invalid UTF-8
0赞 QHarr 7/5/2020
我的意思是,这样你就可以缩小问题案例的范围,然后更好地决定需要做什么

答: 暂无答案