从 r/worldnews 实时线程收集所有顶级评论

Gathering all top-level comments from r/worldnews live thread

提问人:Eric Zarycki 提问时间:11/16/2023 更新时间:11/16/2023 访问量:42

问:

我是一名学生,试图从这个 r/worldnews 实时线程中获取所有顶级评论:https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/ 学校研究项目。我目前正在使用 PRAW API 和 pandas 库在 Python 中编码。这是我到目前为止编写的代码:

url = "https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/"
submission = reddit.submission(url=url)
comments_list = []
def process_comment(comment):
if isinstance(comment, praw.models.Comment) and comment.is_root:
comments_list.append({
'author': comment.author.name if comment.author else '[deleted]',
'body': comment.body,
'score': comment.score,
'edited': comment.edited,
'created_utc': comment.created_utc,
'permalink': f"https://www.reddit.com{comment.permalink}"
})
submission.comments.replace_more(limit=None, threshold=0)
for top_level_comment in submission.comments.list():
process_comment(top_level_comment)
comments_df = pd.DataFrame(comments_list)

但是当 limit=None 时,代码会超时。使用其他限制 (100,300,500) 仅返回 ~700 条注释。从这个 Reddit 线程收集顶级评论的任何帮助将不胜感激。

我查看了大约数百页的文档/ Reddit线程,并尝试了以下技术:

  • 为 Reddit API 编写“超时”代码,然后在休息后继续收集评论
  • 分批收集意见,然后再次致电replace_more 但无济于事。我还查看了 Reddit API 速率限制请求文档,希望有一种方法可以绕过这些限制。
Python 熊猫 数据科学 Reddit Praw

评论


答:

0赞 jeffreyohene 11/16/2023 #1

我能够使用递归函数而不是 replace_more 方法提取 190k+ 条评论来绕过超时问题。也许这会有所帮助:

url = “https://www.reddit.com/r/worldnews/comments/1735w17/rworldnews_live_thread_for_2023_israelhamas/” 提交 = reddit.submission(url=url) comments_list = []

def process_comment(comment):
    if isinstance(comment, praw.models.Comment) and comment.is_root:
        comments_list.append({
            'author': comment.author.name if comment.author else '[deleted]',
            'body': comment.body,
            'score': comment.score,
            'edited': comment.edited,
            'created_utc': comment.created_utc,
            'permalink': f"https://www.reddit.com{comment.permalink}"
        })

def gather_comments(comment_list):
    for comment in comment_list:
        if isinstance(comment, praw.models.MoreComments):
            try:
                comment_list = comment_list[:comment_list.index(comment)] + comment.comments() + comment_list[comment_list.index(comment) + 1:]
            except Exception as e:
                print(f"Error replacing MoreComments: {e}")
        else:
            process_comment(comment)

    if any(isinstance(comment, praw.models.MoreComments) for comment in comment_list):
        gather_comments(comment_list)


top_level_comments = submission.comments
gather_comments(top_level_comments)

# Create DataFrame
comments_df = pd.DataFrame(comments_list)