问：

问题陈述

我正在努力从 Kimovil 抓取许多手机上的技术数据。
我知道这个答案，但这只能访问手机的价格。是否有类似的 json 网页来收集技术数据，即例如此 Kimovil 网页上显示的所有内容？

我找到了另一种选择（总结如下），但它相当冗长和复杂......欢迎任何其他解决方案！

我的解决方案

到目前为止，我所做的是：

1. Kimovil 组织

在 json 文件中定义 Kimovil 中数据的组织

{
    "Technical sheet": {
        "Release Date": {
            "Release date": "Release date"
        }
    },
    "Design & Materials": {
        "Structure": {
            "Size": "Dimensions [mm]",
            "Weight": "Weight [g]",
            "Usable surface": "Screen ratio [%]",
            "Materials": "Materials"
        },
        "Screen": {
            "Diagonal": "Screen size [in]"
        }
    },
...

然后阅读到data_orga
将其“适应”到特定手机（在 Kimovil 上，这些部分被称为例如“Google Pixel 7 的技术表”）

2. 使用 Python 请求获取网页 HTML 内容：

import requests
response = requests.get("https://www.kimovil.com/en/where-to-buy-google-pixel-7")

3. 搜索和保存数据

使用 BeautifulSoup 遍历各部分、小节等并提取相关数据：

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Dict where I store the information, (in the end will have eg. data["Weight [g]"] = "197 g")
data = {}

# Iterate through the sections
sections = self.soup.find_all("section", class_="kc-container")
  for section in sections:
    section_name = section.get("aria-label")

    if section_name not in data_orga:
      continue  # next section

    # Iterate throught the subsections
    subsections = section.find_all("h3", class_="k-h4")
    for subsection in subsections:
      subsection_name = subsection.get_text(strip=True)

      if subsection_name not in data_orga[section_name]:
        continue  # next subsection

      # Find the table directly after the subsection
      table = subsection.find_next("table", class_="k-dltable")
      
      # Iterate through the tags
      tags = table.find_all(delim1)
      for tag in tags:
        tag_name = tag.get_text(strip=True)

        if tag_name not in data_orga[section_name][subsection_name]:
          continue  # next tag
        
        # Get the information
        content = tag.find_next(delim2)
        info = content.get_text(strip=True)
        
        # Save the description in the data dict
        info_name = self.data_orga[section_name][subsection_name][tag_name]
        data[info_name] = info

挑战

事实上，某些小节由划定，而另一些小节由划定，这使得它更加复杂。同样，关于主摄像头的小节称为“双/三/四后置摄像头”，这也使搜索变得复杂。<h3><div>

这是可管理的，但 json 文件会使这变得更加容易和直接......

我的整个代码要长得多，分为几个类和文件。如果缺少/不清楚部分，请告诉我：smiley：

python html json 网页抓取

从 Kimovil.com 抓取手机规格

Scrape phone specs from Kimovil.com

问题陈述

我的解决方案

1. Kimovil 组织

2. 使用 Python 请求获取网页 HTML 内容：

3. 搜索和保存数据

挑战

评论