试图抓取一个 Spotify 播放列表,但它只得到前 20 个结果中的 100 个结果

Trying To Scrape A Spotify Playlist But It Only Gets The First 20 Results Out Of 100

提问人:Ashuwathama 提问时间:7/26/2022 最后编辑:Ashuwathama 更新时间:6/4/2023 访问量:694

问:

我正在尝试学习硒,为了好玩,我决定抓取一个 Spotify 播放列表(因此我没有为此使用 spotify API),但它没有获得完整的列表,只是加载的歌曲,我尝试了滚动和等待网络中的解决方案,但似乎没有任何效果,也尝试缩小,它有帮助,但只找到 20 30 多个结果, 此外,当我手动向下滚动并尝试抓取时,它会忽略前几首歌曲并开始从加载的部分抓取。这是我的代码:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

website= "https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu"
path= "C:/Users/ashut/Downloads/Misc Docs/chromedriver_win32/chromedriver.exe"

service=Service(executable_path=path)
driver=webdriver.Chrome(service=service)

driver.get(website) 
containers=driver.find_elements(by="xpath",value='//div[@data-testid="tracklist-row"]/div[@aria-colindex="2"]/div')

titles = []
artists = []
links = []

for container in containers:
    title=container.find_element(by="xpath", value='./a/div').text
    artist=container.find_element(by="xpath", value='./span/a').text
    link=container.find_element(by="xpath", value='./span/a').get_attribute("href")
    titles.append(title)
    artists.append(artist)
    links.append(link)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    
mydict={'titles':titles,'artists':artists,'links':links}
artistslist= pd.DataFrame(mydict)
artistslist.to_csv('list_of_artist.csv')
熊猫 网页抓取 Spotify chrome-web-driver

评论


答:

0赞 Barry the Platipus 7/26/2022 #1

该页面根据用户的操作动态加载内容,在本例中为 - 滚动并到达底部。因此,您需要将页面滚动到底部(几次),直到所有歌曲都将加载并在页面中可用。您可以将以下代码片段调整到您的代码中:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t



chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

song_list = []
url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu'
browser.get(url)

try:
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print("accepted cookies")
except Exception as e:
    print('no cookie button')


bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))

for x in range(5):
    songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
    for song in songs:
        print(song.text)
        song_list.append(song.text)
    t.sleep(2)
    bottom_sentinel.location_once_scrolled_into_view
    browser.implicitly_wait(15)
print(list(set(song_list)))
print('Total songs:', len(list(set(song_list))))

这将打印出相当多的重复歌曲,最后是一个包含独特歌曲的列表,以及独特歌曲的计数:

[...]
Total songs: 105

编辑

看来 OP 仍然很困惑,所以我将用完整的代码(debian/ubuntu 的 selenium/chrome 设置)更新这个答案。以下代码 - 在多个播放列表上进行了测试 - 将接受 cookie(如果有 cookie 弹出窗口),将 Spotify 播放列表滚动到底部,抓取歌曲并生成一个数据帧(也保存到 csv 文件),其中包含歌曲、专辑、艺术家和指向这些的链接。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t



chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

song_list = []
df_song_list = []

# url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu' 
# url='https://open.spotify.com/playlist/37i9dQZF1DX9u7XXOp0l5L'
url='https://open.spotify.com/playlist/37i9dQZF1DXbITWG1ZJKYt'
browser.get(url)

try:
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print("accepted cookies")
except Exception as e:
    print('no cookie button')


bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))

for x in range(7):
    songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
    for song in songs:
        song_list.append(song.get_attribute('innerHTML'))
    t.sleep(0.5)
    bottom_sentinel.location_once_scrolled_into_view
    browser.implicitly_wait(15)
for song in list(set(song_list)):
    soup = BeautifulSoup(song, 'html.parser')
    position_in_playlist = soup.select_one('span.VrRwdIZO0sRX1lsWxJBe').text.strip() 
    artist = soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').text.strip() 
    artist_link = 'https://open.spotify.com/' + soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').select_one('a').get('href')
    song = soup.select_one('div.t_yrXoUO3qGsJS4Y6iXX').text.strip()
    song_link = 'https://open.spotify.com/' + soup.select_one('a.t_yrXoUO3qGsJS4Y6iXX').get('href')
    album = soup.select_one('span.cPwEdQ').text.strip()
    album_link = 'https://open.spotify.com/' + soup.select_one('div.bfQ2S9bMXr_kJjqEfcwA').select_one('a').get('href')
    df_song_list.append((position_in_playlist, artist, artist_link, song, song_link, album, album_link))
    
print('Total songs:', len(list(set(song_list))))
df = pd.DataFrame(df_song_list, columns = ['Position in Playlist', 'Artist', 'Artist Link', 'Song', 'Song Link', 'Album', 'Album Link'])
df.to_csv('spotty_songs.csv')
df.head()
t.sleep(2)
browser.quit()

这将生成一个 csv 文件,并将在终端中打印出来:

accepted cookies
Total songs: 250
Position in Playlist    Artist  Artist Link Song    Song Link   Album   Album Link
0   226 Sonny Rollins   https://open.spotify.com//artist/1VEzN9lxvG6KPR3QQGsebR He's Younger Than You Are - From "Alfie" Score  https://open.spotify.com//track/11vaRXRIFXJTRr3BuzNbk5  Alfie   https://open.spotify.com//album/5vU75tE3FqpzFnbCXZuRE5
1   145 Phil Woods  https://open.spotify.com//artist/6G4hVmXKJ9NW5JecncK89f In Your Own Sweet Way   https://open.spotify.com//track/3YiuJ3OstUEa93UBqb1vcn  Warm Woods  https://open.spotify.com//album/4lj7s0K81qfLbXdLaDt2Ba
2   10  Ella Fitzgerald https://open.spotify.com//artist/5V0MlUE1Bft0mbLlND7FJz How Long Has This Been Going On?    https://open.spotify.com//track/0HEU3berJ5OBojU8XmEk1c  Ella Sings Gershwin https://open.spotify.com//album/3DJYxksYYP018jgpOTVXqO
3   81  Joe Henderson   https://open.spotify.com//artist/3BG0nwVh3Gc7cuT4XdsLtt Blue Bossa - Remastered https://open.spotify.com//track/6qqK0oeBRapZn8f9hJJENw  Page One    https://open.spotify.com//album/7mQGTuvmdp56DNz0AmMwWi
4   5   Billie Holiday  https://open.spotify.com//artist/1YzCsTRb22dQkh9lghPIrp Blue Moon   https://open.spotify.com//track/1pZn8AX1WulW8IO338hE5D  Solitude    https://open.spotify.com//album/4izD3SCRElbkO06i8yf4Zp

评论

0赞 Ashuwathama 7/26/2022
谢谢你,但它仍然只打印了 50 首歌曲,而且也不是从一开始,而是从中间的某个地方
0赞 Barry the Platipus 7/26/2022
更新了我的代码,现在它将从该页面获取所有 105 首歌曲。
0赞 justin 7/26/2022 #2

数据是动态加载的,一个项目可能有多个艺术家,我通过利用 vscode 扩展 clicknium 编写了一个示例,对于我的示例,您可以从 github 看到

0赞 Ashuwathama 7/27/2022 #3

我感谢所有的答案和所有为此做出贡献的人,我发现的最简单的解决方案是将浏览器缩小到 0.1 之类的东西,除此之外,u/platipus_on_fire 的解决方案是如果您不想做缩小之类的事情driver.execute_script("document.body.style.zoom = '0.1'")