提问人:Ashuwathama 提问时间:7/26/2022 最后编辑:Ashuwathama 更新时间:6/4/2023 访问量:694
试图抓取一个 Spotify 播放列表,但它只得到前 20 个结果中的 100 个结果
Trying To Scrape A Spotify Playlist But It Only Gets The First 20 Results Out Of 100
问:
我正在尝试学习硒,为了好玩,我决定抓取一个 Spotify 播放列表(因此我没有为此使用 spotify API),但它没有获得完整的列表,只是加载的歌曲,我尝试了滚动和等待网络中的解决方案,但似乎没有任何效果,也尝试缩小,它有帮助,但只找到 20 30 多个结果, 此外,当我手动向下滚动并尝试抓取时,它会忽略前几首歌曲并开始从加载的部分抓取。这是我的代码:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
website= "https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu"
path= "C:/Users/ashut/Downloads/Misc Docs/chromedriver_win32/chromedriver.exe"
service=Service(executable_path=path)
driver=webdriver.Chrome(service=service)
driver.get(website)
containers=driver.find_elements(by="xpath",value='//div[@data-testid="tracklist-row"]/div[@aria-colindex="2"]/div')
titles = []
artists = []
links = []
for container in containers:
title=container.find_element(by="xpath", value='./a/div').text
artist=container.find_element(by="xpath", value='./span/a').text
link=container.find_element(by="xpath", value='./span/a').get_attribute("href")
titles.append(title)
artists.append(artist)
links.append(link)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
mydict={'titles':titles,'artists':artists,'links':links}
artistslist= pd.DataFrame(mydict)
artistslist.to_csv('list_of_artist.csv')
答:
该页面根据用户的操作动态加载内容,在本例中为 - 滚动并到达底部。因此,您需要将页面滚动到底部(几次),直到所有歌曲都将加载并在页面中可用。您可以将以下代码片段调整到您的代码中:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
song_list = []
url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu'
browser.get(url)
try:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print("accepted cookies")
except Exception as e:
print('no cookie button')
bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))
for x in range(5):
songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
for song in songs:
print(song.text)
song_list.append(song.text)
t.sleep(2)
bottom_sentinel.location_once_scrolled_into_view
browser.implicitly_wait(15)
print(list(set(song_list)))
print('Total songs:', len(list(set(song_list))))
这将打印出相当多的重复歌曲,最后是一个包含独特歌曲的列表,以及独特歌曲的计数:
[...]
Total songs: 105
编辑
看来 OP 仍然很困惑,所以我将用完整的代码(debian/ubuntu 的 selenium/chrome 设置)更新这个答案。以下代码 - 在多个播放列表上进行了测试 - 将接受 cookie(如果有 cookie 弹出窗口),将 Spotify 播放列表滚动到底部,抓取歌曲并生成一个数据帧(也保存到 csv 文件),其中包含歌曲、专辑、艺术家和指向这些的链接。
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
song_list = []
df_song_list = []
# url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu'
# url='https://open.spotify.com/playlist/37i9dQZF1DX9u7XXOp0l5L'
url='https://open.spotify.com/playlist/37i9dQZF1DXbITWG1ZJKYt'
browser.get(url)
try:
WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
print("accepted cookies")
except Exception as e:
print('no cookie button')
bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))
for x in range(7):
songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
for song in songs:
song_list.append(song.get_attribute('innerHTML'))
t.sleep(0.5)
bottom_sentinel.location_once_scrolled_into_view
browser.implicitly_wait(15)
for song in list(set(song_list)):
soup = BeautifulSoup(song, 'html.parser')
position_in_playlist = soup.select_one('span.VrRwdIZO0sRX1lsWxJBe').text.strip()
artist = soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').text.strip()
artist_link = 'https://open.spotify.com/' + soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').select_one('a').get('href')
song = soup.select_one('div.t_yrXoUO3qGsJS4Y6iXX').text.strip()
song_link = 'https://open.spotify.com/' + soup.select_one('a.t_yrXoUO3qGsJS4Y6iXX').get('href')
album = soup.select_one('span.cPwEdQ').text.strip()
album_link = 'https://open.spotify.com/' + soup.select_one('div.bfQ2S9bMXr_kJjqEfcwA').select_one('a').get('href')
df_song_list.append((position_in_playlist, artist, artist_link, song, song_link, album, album_link))
print('Total songs:', len(list(set(song_list))))
df = pd.DataFrame(df_song_list, columns = ['Position in Playlist', 'Artist', 'Artist Link', 'Song', 'Song Link', 'Album', 'Album Link'])
df.to_csv('spotty_songs.csv')
df.head()
t.sleep(2)
browser.quit()
这将生成一个 csv 文件,并将在终端中打印出来:
accepted cookies
Total songs: 250
Position in Playlist Artist Artist Link Song Song Link Album Album Link
0 226 Sonny Rollins https://open.spotify.com//artist/1VEzN9lxvG6KPR3QQGsebR He's Younger Than You Are - From "Alfie" Score https://open.spotify.com//track/11vaRXRIFXJTRr3BuzNbk5 Alfie https://open.spotify.com//album/5vU75tE3FqpzFnbCXZuRE5
1 145 Phil Woods https://open.spotify.com//artist/6G4hVmXKJ9NW5JecncK89f In Your Own Sweet Way https://open.spotify.com//track/3YiuJ3OstUEa93UBqb1vcn Warm Woods https://open.spotify.com//album/4lj7s0K81qfLbXdLaDt2Ba
2 10 Ella Fitzgerald https://open.spotify.com//artist/5V0MlUE1Bft0mbLlND7FJz How Long Has This Been Going On? https://open.spotify.com//track/0HEU3berJ5OBojU8XmEk1c Ella Sings Gershwin https://open.spotify.com//album/3DJYxksYYP018jgpOTVXqO
3 81 Joe Henderson https://open.spotify.com//artist/3BG0nwVh3Gc7cuT4XdsLtt Blue Bossa - Remastered https://open.spotify.com//track/6qqK0oeBRapZn8f9hJJENw Page One https://open.spotify.com//album/7mQGTuvmdp56DNz0AmMwWi
4 5 Billie Holiday https://open.spotify.com//artist/1YzCsTRb22dQkh9lghPIrp Blue Moon https://open.spotify.com//track/1pZn8AX1WulW8IO338hE5D Solitude https://open.spotify.com//album/4izD3SCRElbkO06i8yf4Zp
评论
数据是动态加载的,一个项目可能有多个艺术家,我通过利用 vscode 扩展 clicknium 编写了一个示例,对于我的示例,您可以从 github 看到
我感谢所有的答案和所有为此做出贡献的人,我发现的最简单的解决方案是将浏览器缩小到 0.1 之类的东西,除此之外,u/platipus_on_fire 的解决方案是如果您不想做缩小之类的事情driver.execute_script("document.body.style.zoom = '0.1'")
评论