Webscraping html tables with variable length - 在构造数据帧时，如何确保我的数据最终位于正确的列中？

Webscraping html tables with variable length - How do I make sure my data ends up in the correct columns when constructing a dataframe?

提问人：Moritz 提问时间：6/20/2020 最后编辑：alistaireMoritz 更新时间：6/20/2020 访问量：73

问：

我（初级到中级 R 用户）正在尝试对柏林大量（~12k）建筑物的数据进行网络抓取。

这些信息可以在柏林遗产局的网页上找到（每栋建筑一个，所以 12k），看起来都是这样的（网站是德语的，我感兴趣的数据是中间的表格，以 Obj.-Dok-Nr.： 09XXXXXXX 开头，以下几行，包含郊区、地址等）。

所有 url 结尾对应于建筑物的内部 ID，表的每个字段都有以下 css 选择器：obj_dok_nr=09xxxxxx (09097890-09010001)

.denkmal_detail_head+ .denkmal_detail_body tr:nth-child(n) td+ td

其中 n 是从 1 开始计数的整数。

我已经将所有 09XXXXXXX ID 放在一个名为 denkmal_df 的单独数据帧中，这是我从其他地方找到的 json 文件构建的。

我写了这段代码来检索数据：

get_URL <- function(key) { #assemble an URL from a denkmal key
  url <- paste0("https://www.stadtentwicklung.berlin.de/denkmal/liste_karte_datenbank/de/denkmaldatenbank/daobj.php?obj_dok_nr=", key)
  return(url)
}

get_Data <- function(url, i, static_css_1, static_css_2) { #Webscraper core, needs to be more flexible, but gets the job done
  css_key <- paste0(static_css_1,i,static_css_2)#Assemble css key to extract information
  print(css_key)
  data <- url %>% read_html() %>% html_nodes(css = css_key ) %>% html_text() #extract information
  return(data)
}

web_data <- matrix(as.character(NA), nrow = 13, ncol = 13) %>% data.frame() #%>% as_tibble() # prepare Dataframe, faster than creating it in the loop # Why can't I create the tibble directly? matrix(NA, nrow = length(denkmal_df$key), ncol = 13) %>% tibble results in a tibble with one column
temp_data <- NA
css1 <- ".denkmal_detail_head tr:nth-child(" #define css structure # possible improvement: more flexibility
css2 <- ") td+ td"

for(key in 1:length(denkmal_df$keys)){ #loop over all the denkmal keys, using an index called key
      i <- 1
      temp_data <- NA
      url <- get_URL(denkmal_df$keys[key]) #retrieve the url by passing the actual denkmal key to the get_URL function
      print(url)
      while (is_empty(temp_data) == F){ #start scraping the website belonging to the current denkmal, until no information can be found (temp_data = empty)
        temp_data <- get_Data(url, i, css1, css2) #retrieve the first category of data, which is indexed on the website by a css path + a number (1 for the first item, 2 for the second...) -> i
        if (i >= ncol(web_data) & is_empty(temp_data) == F) { #check if all columns in the data frame are full, but we still got data from a new category
          web_data[key, i+1:i+5] <- as.character(NA) #expand dataframe with a bunch of NA columns
        }

        if (is_empty(temp_data) == T) {#writing empty values into a dataframe throws an error, so we convert to NA
          temp_data <- as.character(NA)
          web_data[key, i] <- temp_data
          temp_data <- character(0) # and back to empty, to exit the while loop
        }
        else {
          web_data[key, i] <- temp_data # write data to the data frame in row = key(corresponds to postion of the actual key), column = i (represents the category)
          i <- i+1 # increase i to extract the second/etc category, rinse and repeat until no categories are left (temp_data = empty)
        }

      }
    }

虽然它可以很好地抓取数据，但生成的数据帧却是一团糟。由于某些 html 表的条目比其他表多（比较 this 和 this，多个地址很常见），因此值无处不在：（注意有些已经被 tibble 格式删减了~）

> as_tibble(web_data)
# A tibble: 13 x 13
   X1      X2          X3      X4       X5                                           X6           X7           X8          X9      X10          X11        X12    X13  
   <chr>   <chr>       <chr>   <chr>    <chr>                                        <chr>        <chr>        <chr>       <chr>   <chr>        <chr>      <chr>  <chr>
 1 090978~ Mitte       Gesund~ Putbuss~ 12                                           Swinemünder~ Gesamtanlage Schule & B~ NA      NA           NA         NA     NA   
 2 090978~ Charlotten~ Westend Messeda~ 11 & 12                                      Hammarskjöl~ Baudenkmal   Kongressge~ NA      NA           NA         NA     NA   
 3 090978~ Mitte       Tierga~ Rauchst~ 4 & 5 & 6                                    Stülerstraße 2 & 4        Thomas-Deh~ 1 & 3 ~ Gartendenkm~ Siedlungs~ NA     NA   
 4 090978~ Reinickend~ Tegel   Am Tege~ 2 & 4 & 6 & 8 & 8A & 8B & 8C & 8D & 8E & 10~ Gartendenkm~ Siedlungsgr~ NA          NA      NA           NA         NA     NA   
 5 090978~ Charlotten~ Wilmer~ Prager ~ 4 & 5                                        Prager Stra~ 13           Prinzregen~ 97      Asschaffenb~ Gesamtanl~ Stadt~ NA   
 6 090978~ Mitte       Tierga~ Reichpi~ 48 & 50                                      Gesamtanlage Forschungse~ NA          NA      NA           NA         NA     NA   
 7 090978~ Mitte       Tierga~ Rauchst~ 4 & 5 & 6                                    Stülerstraße 2 & 4        Thomas-Deh~ 1 & 3 ~ Gesamtanlage Wohnanlage NA     NA   
 8 090978~ Mitte       Tierga~ Pohlstr~ 77                                           Baudenkmal   Wohn- und G~ NA          NA      NA           NA         NA     NA   
 9 090978~ Mitte       Tierga~ Lützowu~ 1A & 1B & 2 & 2A & 3 & 3A & 4 & 4A & 5 & 5A  Gesamtanlage Wohnanlage   NA          NA      NA           NA         NA     NA   
10 090978~ Mitte       Tierga~ Lützows~ 44 & 44A & 45 & 45A & 45B & 45C & 45D & 45E~ Gesamtanlage Wohnanlage ~ NA          NA      NA           NA         NA     NA

我想要每个街道名称（例如。“Pohlstr（aße）”、“Stülerstraße”）表示每个建筑与所有其他街道名称、所有建筑类型（例如。“Wohnanlage”、“Schule”）等。我怎样才能做到这一点？

我已经尝试将整个 html 表抓取到数据帧中，但这得到了类似的结果。我无法知道条目的最大数量，除非为所有 12k html 站点运行整个循环。（另外，如果我现有的代码可以以某种方式改进，请随时提供提示

r html 解析数据清理 rvest

答：

1赞 Dave2e 6/20/2020 #1

这个页面是一团糟，但使用一些棘手的CSS选择器，这可能会回答您的问题。给定的页面有 11 个需要解析的房子。？

看看这是否至少部分正确。

有关代码的说明，请参阅注释。

library(rvest)
library(dplyr)

url<-"https://www.stadtentwicklung.berlin.de/denkmal/liste_karte_datenbank/de/denkmaldatenbank/daobj.php?obj_dok_nr=09097874"
page <- read_html(url)
#select nodes
#find the denkmal_detail_body node after the  table.denkmal_detail_sub with 1 intermedidary
infolist<- page %>% html_nodes("table.denkmal_detail_sub + * + table.denkmal_detail_body")
houses <- infolist %>% html_table()

#convert the list of nodes into data frames
dfs<-lapply(houses, function(house){
   #transform to a single row dataframe
   df<-as.data.frame(t(house$X2))
   #rename the columns
   names(df) <-house$X1
   df
})
#bind into the answer
answer <-bind_rows(dfs)


answer

        Teil-Nr.: Sachbegriff:         Strasse:                                                  Hausnummer:
1  09097874,T,001   Stadtvilla Am Tegeler Hafen                                                            2
2  09097874,T,002   Stadtvilla Am Tegeler Hafen                                                            4
3  09097874,T,003   Stadtvilla  Am Tegelerhafen                                                            6
4  09097874,T,004   Stadtvilla Am Tegeler Hafen                                                            8
5  09097874,T,005   Wohnanlage Am Tegeler Hafen                                       8A & 8B & 8C & 8D & 8E
6  09097874,T,006   Stadtvilla Am Tegeler Hafen                                                           10
7  09097874,T,007   Stadtvilla Am Tegeler Hafen                                                           12
8  09097874,T,008   Wohnanlage Am Tegeler Hafen                             14 & 16 & 18 & 20 & 22 & 24 & 26
9  09097874,T,009   Wohnanlage Am Tegeler Hafen 28 & 28A & 28B & 28C & 28D & 28E & 28F & 28G & 28H & 30 & 32
10  09097874,T,10   Wohnanlage Am Tegeler Hafen                                       34 & 36 & 38 & 40 & 42
11 09097874,T,011   Stadtvilla Am Tegeler Hafen                                                           44
                                                                                      Entwurf:
1                              Moore, Charles Willard & Ruble, John & Yudell, Buzz (Architekt)
2                                                    Steinebach, Karl-Heinz & Weber, Friedrich
3                                                      Stern, Robert Arthur Morton (Architekt)
4                                                                           Tigermann, Stanley
5              Bangert, Dietrich & Jansen, Bernd & Scholz, Stefan & Schultes, Axel (Architekt)
6                                                                Portoghesi, Paolo (Architekt)
7                                                                            Grumbach, Antoine
8  Steinebach, Karl-Heinz & Weber, Friedrich & Poly, Regina (Architekt & Landschaftsarchitekt)
9                              Moore, Charles Willard & Ruble, John & Yudell, Buzz (Architekt)
10             Bangert, Dietrich & Jansen, Bernd & Scholz, Stefan & Schultes, Axel (Architekt)
11                                                                    Hejduk, John (Architekt)

1赞 QHarr 6/21/2020

好。比我第一次尝试 .denkmal_detail_body ~ div + .denkmal_detail_body 更强大的 css

上一个：尝试解析网页，下标越界

下一个：从 html_text（）返回的 Rvest 抓取网页内容

Webscraping html tables with variable length - 在构造数据帧时，如何确保我的数据最终位于正确的列中？

Webscraping html tables with variable length - How do I make sure my data ends up in the correct columns when constructing a dataframe?

评论

评论