提问人:Scrupptor 提问时间:11/10/2023 更新时间:11/10/2023 访问量:30
优化 Python 脚本以实现高效的 YouTube 频道数据提取和组织
Refining Python Script for Efficient YouTube Channel Data Extraction and Organization
问:
我沉浸在一个专注于组织 YouTube 频道的个人项目中,我面临着需要解决的具体挑战。在 YouTube 上,主要有三种类型的内容:视频、短片和直播。
YouTube 上内容类型的上下文: 视频:传统的 YouTube 格式,没有持续时间限制。SHORTS:短视频,不超过 60 秒,采用竖屏格式。直播:鼓励与观众实时互动的直播。
视频部分的挑战:在 YouTube 的“视频”部分,多样性是显而易见的。它包括不符合 SHORTS 分类的预定直播视频和短片。 定时直播视频:创作者定时按特定日期和时间进行直播。这些视频出现在 VIDEOS 中,而不是 LIVE 中。 未分类的短片:时长不超过 60 秒且不符合归类为“短片”(例如竖排格式等)的特定要求的视频。
值得一提的是,我正在使用 YouTube Data API v3 来有效地提取信息。
我想分享一下,我是这个领域的新手,正在学习过程中。感谢您的耐心和您可以提供的任何指导。如果您发现我的方法有任何笨拙之处,我将很高兴收到改进建议。
这是我的 Python 脚本 YouTube Channel Scraper 的相关部分:
import os
import re
import requests
from googleapiclient.discovery import build
from datetime import datetime, timedelta
# Function to extract YouTube channel ID from the provided link
def get_channel_id(channel_link):
try:
# Make a request to the provided YouTube channel link
response = requests.get(channel_link)
# Check if the request was successful (status code 200)
if response.status_code == 200:
# Define a regex pattern to extract the channel ID from the XML feed link
pattern = r"https://www.youtube.com/feeds/videos.xml\?channel_id=([A-Za-z0-9_-]+)"
# Search for the pattern in the response text
match = re.search(pattern, response.text)
# If a match is found, return the extracted channel ID
if match:
return match.group(1)
else:
print("No channel found in the provided link.")
else:
print("Could not access the link. Make sure the link is valid.")
except requests.exceptions.RequestException as e:
print("An error occurred while making the request:", str(e))
except Exception as e:
print("An error occurred:", str(e))
return None
# Function to parse the duration string and extract hours, minutes, and seconds
def parse_duration(duration):
duration = duration[2:]
hours, minutes, seconds = 0, 0, 0
if 'H' in duration:
hours = int(duration.split('H')[0])
duration = duration.split('H')[1]
if 'M' in duration:
minutes = int(duration.split('M')[0])
duration = duration.split('M')[1]
if 'S' in duration:
seconds = int(duration.split('S')[0])
return hours, minutes, seconds
# Function to save video information to a file
def save_video_info_to_file(output_file, video_info):
title = video_info["snippet"]["title"]
views = video_info["statistics"].get("viewCount", "N/A")
likes = video_info["statistics"].get("likeCount", "N/A")
upload_date = video_info["snippet"]["publishedAt"]
hours, minutes, seconds = parse_duration(video_info["contentDetails"]["duration"])
# Convert upload date to GMT-5 timezone, this is my time zone
upload_datetime = datetime.fromisoformat(upload_date[:-1])
upload_datetime_gmt5 = upload_datetime - timedelta(hours=5)
# Adjust for videos uploaded before 5 AM GMT-5
if upload_datetime_gmt5.hour < 5:
upload_datetime_gmt5 -= timedelta(days=1)
formatted_upload_date = upload_datetime_gmt5.strftime("%d/%m/%Y")
formatted_upload_time = upload_datetime_gmt5.strftime("%H:%M:%S")
duration_str = ""
if hours > 0:
duration_str += f"{hours} hour{'s' if hours > 1 else ''}"
if minutes > 0:
if duration_str:
duration_str += ", "
duration_str += f"{minutes} minute{'s' if minutes > 1 else ''}"
if seconds > 0:
if duration_str:
duration_str += " and "
duration_str += f"{seconds} second{'s' if seconds > 1 else ''}"
# Write video information to the output file
with open(output_file, "a", encoding="utf-8") as file:
file.write("Title: " + title + "\n")
file.write("Upload Date: " + formatted_upload_date + "\n")
file.write("Upload Time: " + formatted_upload_time + "\n")
file.write("Duration: " + duration_str + "\n")
file.write("Views: " + str(views) + "\n")
file.write("Likes: " + str(likes) + "\n\n\n")
# Function to get channel name and save video information to a file
def get_channel_name(channel_id):
api_key = "[YOUR API HERE]"
gmt_offset = -5
# Build the YouTube API service
youtube = build("youtube", "v3", developerKey=api_key)
videos_info = []
# Fetch videos information from the channel
next_page_token = None
while True:
videos_response = youtube.search().list(
part="id",
channelId=channel_id,
maxResults=50,
pageToken=next_page_token
).execute()
video_ids = [item["id"]["videoId"] for item in videos_response.get("items", []) if "videoId" in item.get("id", {})]
videos_details_response = youtube.videos().list(
part="snippet,statistics,contentDetails",
id=",".join(video_ids)
).execute()
videos_info.extend(videos_details_response["items"])
next_page_token = videos_response.get("nextPageToken")
if not next_page_token:
break
videos_info.sort(key=lambda x: x["snippet"]["publishedAt"], reverse=True)
# Fetch channel information
channel_info = youtube.channels().list(
part="snippet",
id=channel_id
).execute()
# Get the channel name or use a default if not available
if channel_info.get("items"):
channel_name = channel_info["items"][0]["snippet"]["title"]
else:
channel_name = "Unknown Channel"
# Modify the channel name for file naming
channel_name = re.sub(r'[^\w\s]', '', channel_name)
channel_name = channel_name.replace(" ", "_")
# Set the output file name
output_file = f"{channel_name}.txt"
# Save video information to the output file
for video_info in videos_info:
save_video_info_to_file(output_file, video_info)
print("Information has been saved to the file:", output_file)
# Entry point of the script
if __name__ == "__main__":
# Prompt user to input the YouTube channel link
channel_link = input("Enter the YouTube channel link: ")
# Get the channel ID from the provided link
channel_id = get_channel_id(channel_link)
# If a valid channel ID is obtained, get the channel name and save video information
if channel_id:
get_channel_name(channel_id)
考虑到这些复杂性,我的目标是优化脚本以实现更准确、更高效的分类。
我一直在深入研究完善这个 Python 脚本的复杂性,以实现高效的 YouTube 频道数据提取和组织。到目前为止,在我的尝试中,我尝试优化正则表达式模式,以便更好地提取通道 ID 并微调持续时间解析逻辑。
我希望这些调整能够提高脚本对视频进行分类的准确性,尤其是在“视频”部分。然而,结果并不像预期的那样。我正在寻求您的专业知识,以获得新的见解和建议。
我提前感谢您的宝贵贡献和耐心!
答: 暂无答案
评论