抓取多个网站并将输出保存在不同的文本文件中-解网

问：

我有一个包含两列（和）的 Excel 工作表。我遍历了Excel文件，并使用Beautifulsoup从各个网站获取文章标题和文章详细信息。Url_idUrl

现在我想创建一个以作为文件名的文本文件，然后存储与文本文件对应的网站输出。url_idurl_id

obnoxious screen shot; probably remove?

代码正在抓取我需要的所有数据。
正在创建文本文件，但代码在所有文本文件中写入相同的数据。

less useless, but still useless

一切都运行良好，除了它多次循环浏览文本文件，然后在所有文件上编写相同的帖子标题和内容。

import os

import requests
import pandas as pd
from bs4 import BeautifulSoup

#Reading the excel file
data_ex = pd.read_excel('input.xlsx')
#Getting the url and url_id columns from the excel file
urls = data_ex.URL
url_id = data_ex.URL_ID

# print(url_id)

for url in urls:
    #Connecting to each url
    res = requests.get(url)

    page = res.text
    soup = BeautifulSoup(page, 'html.parser')
    #Title of each url post
    article_title = soup.find_all(name="h1", class_="entry-title")

    article_texts = []
    article_details = []


    for details in article_title:
        #print(id)
        text = details.getText()
        article_texts.append(text)
        #Post content corresponding to the url title
        article_writeup = soup.find(class_="td-post-content tagdiv-type").getText()


        for id in url_id:
            for story in article_texts:
                #specify folder to create files
                folder = 'files_folder'
                #Create the folder if it doesn't exist
                if not os.path.exists(folder):
                    os.makedirs(folder)
                # List of filenames
            filenames = [f"{id}.txt"]

           # Loop through the filenames and create text files
            for filename in filenames:
                file_path = os.path.join(folder, filename)
            with open(file_path, 'w',  encoding="utf-8") as file:
                # Perform any operations you want with the file
                file.write((f"{story}\n {article_writeup}"))
            #print(f"File '{filename}' created.")

蟒蛇熊猫 Excel BeautifulSoup

for details in article_title:
    for story in article_texts:
        for id in url_id:
            filenames = [f"{id}.txt"]
            for filename in filenames:
                file_path = os.path.join(folder, filename)
                with open(file_path, 'w',  encoding="utf-8") as file:
                    file.write((f"{story}\n {article_writeup}"))

以这种方式阅读它，因为你每个人都将遍历每一个并将其写入一个文件。因此，对于一个，它将遍历所有 url id 并为每个文件编写相同的故事。storyarticle_textidurl_idstoryf"{id}.txt"

编辑：

在阅读了您的代码后，我不得不猜测您需要什么，因为您没有提供数据的外观（输入）以及您希望它如何格式化（输出）。这是基于您的代码的更新版本。

from pathlib import Path

import pandas as pd
import requests
from bs4 import BeautifulSoup

data_ex = pd.read_excel("input.xlsx")
urls = data_ex.URL
url_ids = data_ex.URL_ID

folder = "files_folder"
Path(folder).mkdir(parents=True, exist_ok=True)

for url, url_id in zip(urls, url_ids):
    res = requests.get(url, timeout=60)

    page = res.text
    soup = BeautifulSoup(page, "html.parser")
    article_title = soup.find_all(name="h1", class_="entry-title")

    article_texts = []
    article_details = []

    for details in article_title:
        text = details.getText()
        article_writeup = soup.find(class_="td-post-content tagdiv-type").getText()

        file_path = Path(folder, f"{url_id}.txt")
        with pathlib.Path.open(file_path, "w", encoding="utf-8") as f:
            f.write(f"{text}\n {article_writeup}")

抓取多个网站并将输出保存在不同的文本文件中

Scraping multiple websites and saving the outputs in different text files

评论

评论