优化 Python 脚本以实现高效的 YouTube 频道数据提取和组织

Refining Python Script for Efficient YouTube Channel Data Extraction and Organization

提问人:Scrupptor 提问时间:11/10/2023 更新时间:11/10/2023 访问量:30

问:

我沉浸在一个专注于组织 YouTube 频道的个人项目中,我面临着需要解决的具体挑战。在 YouTube 上,主要有三种类型的内容:视频短片直播

YouTube 上内容类型的上下文视频:传统的 YouTube 格式,没有持续时间限制。SHORTS:短视频,不超过 60 秒,采用竖屏格式。直播:鼓励与观众实时互动的直播

视频部分的挑战:在 YouTube 的“视频”部分,多样性是显而易见的。它包括不符合 SHORTS 分类的预定直播视频和短片。 定时直播视频:创作者定时按特定日期和时间进行直播。这些视频出现在 VIDEOS 中,而不是 LIVE 中。 未分类的短片:时长不超过 60 秒且不符合归类为“短片”(例如竖排格式等)的特定要求的视频。

值得一提的是,我正在使用 YouTube Data API v3 来有效地提取信息。

我想分享一下,我是这个领域的新手,正在学习过程中。感谢您的耐心和您可以提供的任何指导。如果您发现我的方法有任何笨拙之处,我将很高兴收到改进建议。

这是我的 Python 脚本 YouTube Channel Scraper 的相关部分:

import os
import re
import requests
from googleapiclient.discovery import build
from datetime import datetime, timedelta

# Function to extract YouTube channel ID from the provided link
def get_channel_id(channel_link):
    try:
        # Make a request to the provided YouTube channel link
        response = requests.get(channel_link)

        # Check if the request was successful (status code 200)
        if response.status_code == 200:
            # Define a regex pattern to extract the channel ID from the XML feed link
            pattern = r"https://www.youtube.com/feeds/videos.xml\?channel_id=([A-Za-z0-9_-]+)"
            # Search for the pattern in the response text
            match = re.search(pattern, response.text)

            # If a match is found, return the extracted channel ID
            if match:
                return match.group(1)
            else:
                print("No channel found in the provided link.")
        else:
            print("Could not access the link. Make sure the link is valid.")
    except requests.exceptions.RequestException as e:
        print("An error occurred while making the request:", str(e))
    except Exception as e:
        print("An error occurred:", str(e))

    return None

# Function to parse the duration string and extract hours, minutes, and seconds
def parse_duration(duration):
    duration = duration[2:]
    hours, minutes, seconds = 0, 0, 0

    if 'H' in duration:
        hours = int(duration.split('H')[0])
        duration = duration.split('H')[1]
    if 'M' in duration:
        minutes = int(duration.split('M')[0])
        duration = duration.split('M')[1]
    if 'S' in duration:
        seconds = int(duration.split('S')[0])

    return hours, minutes, seconds

# Function to save video information to a file
def save_video_info_to_file(output_file, video_info):
    title = video_info["snippet"]["title"]
    views = video_info["statistics"].get("viewCount", "N/A")
    likes = video_info["statistics"].get("likeCount", "N/A")
    upload_date = video_info["snippet"]["publishedAt"]
    hours, minutes, seconds = parse_duration(video_info["contentDetails"]["duration"])

    # Convert upload date to GMT-5 timezone, this is my time zone
    upload_datetime = datetime.fromisoformat(upload_date[:-1])
    upload_datetime_gmt5 = upload_datetime - timedelta(hours=5)

    # Adjust for videos uploaded before 5 AM GMT-5
    if upload_datetime_gmt5.hour < 5:
        upload_datetime_gmt5 -= timedelta(days=1)

    formatted_upload_date = upload_datetime_gmt5.strftime("%d/%m/%Y")
    formatted_upload_time = upload_datetime_gmt5.strftime("%H:%M:%S")

    duration_str = ""
    if hours > 0:
        duration_str += f"{hours} hour{'s' if hours > 1 else ''}"
    if minutes > 0:
        if duration_str:
            duration_str += ", "
        duration_str += f"{minutes} minute{'s' if minutes > 1 else ''}"
    if seconds > 0:
        if duration_str:
            duration_str += " and "
        duration_str += f"{seconds} second{'s' if seconds > 1 else ''}"

    # Write video information to the output file
    with open(output_file, "a", encoding="utf-8") as file:
        file.write("Title: " + title + "\n")
        file.write("Upload Date: " + formatted_upload_date + "\n")
        file.write("Upload Time: " + formatted_upload_time + "\n")
        file.write("Duration: " + duration_str + "\n")
        file.write("Views: " + str(views) + "\n")
        file.write("Likes: " + str(likes) + "\n\n\n")

# Function to get channel name and save video information to a file
def get_channel_name(channel_id):
    api_key = "[YOUR API HERE]"
    gmt_offset = -5

    # Build the YouTube API service
    youtube = build("youtube", "v3", developerKey=api_key)

    videos_info = []

    # Fetch videos information from the channel
    next_page_token = None
    while True:
        videos_response = youtube.search().list(
            part="id",
            channelId=channel_id,
            maxResults=50,
            pageToken=next_page_token
        ).execute()

        video_ids = [item["id"]["videoId"] for item in videos_response.get("items", []) if "videoId" in item.get("id", {})]

        videos_details_response = youtube.videos().list(
            part="snippet,statistics,contentDetails",
            id=",".join(video_ids)
        ).execute()

        videos_info.extend(videos_details_response["items"])

        next_page_token = videos_response.get("nextPageToken")
        if not next_page_token:
            break

    videos_info.sort(key=lambda x: x["snippet"]["publishedAt"], reverse=True)

    # Fetch channel information
    channel_info = youtube.channels().list(
        part="snippet",
        id=channel_id
    ).execute()

    # Get the channel name or use a default if not available
    if channel_info.get("items"):
        channel_name = channel_info["items"][0]["snippet"]["title"]
    else:
        channel_name = "Unknown Channel"

    # Modify the channel name for file naming
    channel_name = re.sub(r'[^\w\s]', '', channel_name)
    channel_name = channel_name.replace(" ", "_")

    # Set the output file name
    output_file = f"{channel_name}.txt"

    # Save video information to the output file
    for video_info in videos_info:
        save_video_info_to_file(output_file, video_info)

    print("Information has been saved to the file:", output_file)

# Entry point of the script
if __name__ == "__main__":
    # Prompt user to input the YouTube channel link
    channel_link = input("Enter the YouTube channel link: ")
    # Get the channel ID from the provided link
    channel_id = get_channel_id(channel_link)
    # If a valid channel ID is obtained, get the channel name and save video information
    if channel_id:
        get_channel_name(channel_id)

考虑到这些复杂性,我的目标是优化脚本以实现更准确、更高效的分类。

我一直在深入研究完善这个 Python 脚本的复杂性,以实现高效的 YouTube 频道数据提取和组织。到目前为止,在我的尝试中,我尝试优化正则表达式模式,以便更好地提取通道 ID 并微调持续时间解析逻辑。

我希望这些调整能够提高脚本对视频进行分类的准确性,尤其是在“视频”部分。然而,结果并不像预期的那样。我正在寻求您的专业知识,以获得新的见解和建议。

我提前感谢您的宝贵贡献和耐心!

python-3.x 优化 提取 youtube-data-api

评论

0赞 Benjamin Loison 11/10/2023
这些关于直播短片的答案能解决你的问题吗?

答: 暂无答案