从 Twitter 用户列表 (csv) 中抓取加入日期/用户信息

Scrape join-dates/user info from a list (csv) of Twitter-users

提问人:yardworksimulator 提问时间:4/1/2021 更新时间:10/25/2021 访问量:731

问:

我正在寻找一个可能非常简单的问题的解决方案,并且真的希望得到一些帮助或提示。我有python和网络抓取的基本知识。

我想在Twitter上探索某个主题标签及其背后的社区。使用 twint,我将所有提及主题标签的推文下载到 .csv 文件中。之后,我清理了 .csv,这样就不会有同一用户的多个条目(来自具有相同主题标签的多个推文),并将其保存为 .txt。我现在想获得有关上述列表中大约 1.500 名用户的更多信息——主要是他们加入 twitter 的日期,推文数量将是一个 bonus。

我尝试过:Twint 应该能够做到这一点,但它不起作用(我正在使用他们的 github 上提供的 docker 映像)。我试图通过以下方式获取用户信息:

twint --userlist /bin/userlist.txt --user-full -o userlistfull.csv --csv

Twint 发出了一条很长的错误消息,如果我理解正确的话,它与 twint 中的一个未解决的错误有关:

CRITICAL:root:twint.get:User:'url'
ERROR:root:twint.run:Twint:Lookup:Unexpected exception occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 307, in Lookup
    await get.User(self.config.Username, self.config, db.Conn(self.config.Database))
  File "/usr/local/lib/python3.6/site-packages/twint/get.py", line 228, in User
    await Users(j_r, config, conn)
  File "/usr/local/lib/python3.6/site-packages/twint/output.py", line 177, in Users
    user = User(u)
  File "/usr/local/lib/python3.6/site-packages/twint/user.py", line 31, in User
    _usr.url = ur['data']['user']['legacy']['url']
KeyError: 'url'
Traceback (most recent call last):
  File "/usr/local/bin/twint", line 8, in <module>
    sys.exit(run_as_command())
  File "/usr/local/lib/python3.6/site-packages/twint/cli.py", line 339, in run_as_command
    main()
  File "/usr/local/lib/python3.6/site-packages/twint/cli.py", line 324, in main
    run.Lookup(c)
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 386, in Lookup
    run(config)
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 329, in run
    get_event_loop().run_until_complete(Twint(config).main(callback))
  File "/usr/local/lib/python3.6/asyncio/base_events.py", line 488, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 235, in main
    await task
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 270, in run
    await self.Lookup()
  File "/usr/local/lib/python3.6/site-packages/twint/run.py", line 307, in Lookup
    await get.User(self.config.Username, self.config, db.Conn(self.config.Database))
  File "/usr/local/lib/python3.6/site-packages/twint/get.py", line 228, in User
    await Users(j_r, config, conn)
  File "/usr/local/lib/python3.6/site-packages/twint/output.py", line 177, in Users
    user = User(u)
  File "/usr/local/lib/python3.6/site-packages/twint/user.py", line 31, in User
    _usr.url = ur['data']['user']['legacy']['url']
KeyError: 'url'

我试图遍历列表并让双胞胎单独查找每个用户名,但它也不起作用:

import twint 
import os
import sys
import nest_asyncio 
nest_asyncio.apply()

c = twint.Config()

with open("userlist.txt", "r") as a_file:

  for line in a_file:

    stripped_line = line.strip()
    stripped_line = c.Username
    twint.run.Search(c)

使用我给我的 Google Colab 运行它

 CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
    sleeping for 1.0 secs
    CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
    sleeping for 8.0 secs
    CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
    sleeping for 27.0 secs
    CRITICAL:root:twint.run:Twint:Feed:noDataExpecting value: line 1 column 1 (char 0)
    sleeping for 64.0 secs

我在找什么获取列表中这些用户的加入日期的最简单解决方案是什么?我应该使用其他库吗?我可以用 beautifulsoup 之类的东西循环列表并抓取加入日期吗?我该怎么做?

帮助将不胜感激,提前致谢!

python 网页抓取 twitter web-mining twint

评论


答:

0赞 Dery Sudrajat 5/21/2021 #1

尝试使用以下方法安装它

pip3 install --user --upgrade git+https://github.com/twintproject/twint.git@origin/master#egg=twint

并确保你的 python 版本高于 3.6 source

0赞 Mohammad Zarchi 10/25/2021 #2

只需在 twint/user.py 中替换这一行:

_usr.url = ur['data']['user']['legacy']['url']

对此:

try:
    _usr.url = ur['data']['user']['legacy']['url']
except:
    _usr.url = ''

评论

1赞 Community 10/25/2021
您的答案可以通过额外的支持信息得到改进。请编辑以添加更多详细信息,例如引文或文档,以便其他人可以确认您的答案是正确的。您可以在帮助中心找到有关如何写出好答案的更多信息。