提问人:Eric Zarycki 提问时间:11/16/2023 更新时间:11/16/2023 访问量:42
从 r/worldnews 实时线程收集所有顶级评论
Gathering all top-level comments from r/worldnews live thread
问:
我是一名学生,试图从这个 r/worldnews 实时线程中获取所有顶级评论:https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ 学校研究项目。我目前正在使用 PRAW API 和 pandas 库在 Python 中编码。这是我到目前为止编写的代码:
url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)
但是当 limit=None 时,代码会超时。使用其他限制 (100,300,500) 仅返回 ~700 条注释。从这个 Reddit 线程收集顶级评论的任何帮助将不胜感激。
我查看了大约数百页的文档/ Reddit线程,并尝试了以下技术:
- 为 Reddit API 编写“超时”代码,然后在休息后继续收集评论
- 分批收集意见,然后再次致电replace_more 但无济于事。我还查看了 Reddit API 速率限制请求文档,希望有一种方法可以绕过这些限制。
答:
0赞
jeffreyohene
11/16/2023
#1
我能够使用递归函数而不是 replace_more 方法提取 190k+ 条评论来绕过超时问题。也许这会有所帮助:
url = “https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/” 提交 = reddit.submission(url=url) comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
def gather_comments(comment_list):
for comment in comment_list:
if isinstance(comment, praw.models.MoreComments):
try:
comment_list = comment_list[:comment_list.index(comment)] + comment.comments() + comment_list[comment_list.index(comment) + 1:]
except Exception as e:
print(f"Error replacing MoreComments: {e}")
else:
process_comment(comment)
if any(isinstance(comment, praw.models.MoreComments) for comment in comment_list):
gather_comments(comment_list)
top_level_comments = submission.comments
gather_comments(top_level_comments)
# Create DataFrame
comments_df = pd.DataFrame(comments_list)
评论