试图抓取一个 Spotify 播放列表，但它只得到前 20 个结果中的 100 个结果-解网

问：

我正在尝试学习硒，为了好玩，我决定抓取一个 Spotify 播放列表（因此我没有为此使用 spotify API），但它没有获得完整的列表，只是加载的歌曲，我尝试了滚动和等待网络中的解决方案，但似乎没有任何效果，也尝试缩小，它有帮助，但只找到 20 30 多个结果，此外，当我手动向下滚动并尝试抓取时，它会忽略前几首歌曲并开始从加载的部分抓取。这是我的代码：

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
import pandas as pd
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

website= "https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu"
path= "C:/Users/ashut/Downloads/Misc Docs/chromedriver_win32/chromedriver.exe"

service=Service(executable_path=path)
driver=webdriver.Chrome(service=service)

driver.get(website) 
containers=driver.find_elements(by="xpath",value='//div[@data-testid="tracklist-row"]/div[@aria-colindex="2"]/div')

titles = []
artists = []
links = []

for container in containers:
    title=container.find_element(by="xpath", value='./a/div').text
    artist=container.find_element(by="xpath", value='./span/a').text
    link=container.find_element(by="xpath", value='./span/a').get_attribute("href")
    titles.append(title)
    artists.append(artist)
    links.append(link)
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    
mydict={'titles':titles,'artists':artists,'links':links}
artistslist= pd.DataFrame(mydict)
artistslist.to_csv('list_of_artist.csv')

熊猫硒网页抓取 Spotify chrome-web-driver

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t



chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

song_list = []
url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu'
browser.get(url)

try:
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print("accepted cookies")
except Exception as e:
    print('no cookie button')


bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))

for x in range(5):
    songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
    for song in songs:
        print(song.text)
        song_list.append(song.text)
    t.sleep(2)
    bottom_sentinel.location_once_scrolled_into_view
    browser.implicitly_wait(15)
print(list(set(song_list)))
print('Total songs:', len(list(set(song_list))))

这将打印出相当多的重复歌曲，最后是一个包含独特歌曲的列表，以及独特歌曲的计数：

[...]
Total songs: 105

编辑

看来 OP 仍然很困惑，所以我将用完整的代码（debian/ubuntu 的 selenium/chrome 设置）更新这个答案。以下代码 - 在多个播放列表上进行了测试 - 将接受 cookie（如果有 cookie 弹出窗口），将 Spotify 播放列表滚动到底部，抓取歌曲并生成一个数据帧（也保存到 csv 文件），其中包含歌曲、专辑、艺术家和指向这些的链接。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import time as t



chrome_options = Options()
chrome_options.add_argument("--no-sandbox")

webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)

song_list = []
df_song_list = []

# url='https://open.spotify.com/playlist/6iwz7yurUKaILuykiyeztu' 
# url='https://open.spotify.com/playlist/37i9dQZF1DX9u7XXOp0l5L'
url='https://open.spotify.com/playlist/37i9dQZF1DXbITWG1ZJKYt'
browser.get(url)

try:
    WebDriverWait(browser, 20).until(EC.element_to_be_clickable((By.ID, "onetrust-accept-btn-handler"))).click()
    print("accepted cookies")
except Exception as e:
    print('no cookie button')


bottom_sentinel = WebDriverWait(browser, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@data-testid='bottom-sentinel']")))

for x in range(7):
    songs = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@data-testid='tracklist-row']")))
    for song in songs:
        song_list.append(song.get_attribute('innerHTML'))
    t.sleep(0.5)
    bottom_sentinel.location_once_scrolled_into_view
    browser.implicitly_wait(15)
for song in list(set(song_list)):
    soup = BeautifulSoup(song, 'html.parser')
    position_in_playlist = soup.select_one('span.VrRwdIZO0sRX1lsWxJBe').text.strip() 
    artist = soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').text.strip() 
    artist_link = 'https://open.spotify.com/' + soup.select_one('span.rq2VQ5mb9SDAFWbBIUIn').select_one('a').get('href')
    song = soup.select_one('div.t_yrXoUO3qGsJS4Y6iXX').text.strip()
    song_link = 'https://open.spotify.com/' + soup.select_one('a.t_yrXoUO3qGsJS4Y6iXX').get('href')
    album = soup.select_one('span.cPwEdQ').text.strip()
    album_link = 'https://open.spotify.com/' + soup.select_one('div.bfQ2S9bMXr_kJjqEfcwA').select_one('a').get('href')
    df_song_list.append((position_in_playlist, artist, artist_link, song, song_link, album, album_link))
    
print('Total songs:', len(list(set(song_list))))
df = pd.DataFrame(df_song_list, columns = ['Position in Playlist', 'Artist', 'Artist Link', 'Song', 'Song Link', 'Album', 'Album Link'])
df.to_csv('spotty_songs.csv')
df.head()
t.sleep(2)
browser.quit()

这将生成一个 csv 文件，并将在终端中打印出来：

accepted cookies
Total songs: 250
Position in Playlist    Artist  Artist Link Song    Song Link   Album   Album Link
0   226 Sonny Rollins   https://open.spotify.com//artist/1VEzN9lxvG6KPR3QQGsebR He's Younger Than You Are - From "Alfie" Score  https://open.spotify.com//track/11vaRXRIFXJTRr3BuzNbk5  Alfie   https://open.spotify.com//album/5vU75tE3FqpzFnbCXZuRE5
1   145 Phil Woods  https://open.spotify.com//artist/6G4hVmXKJ9NW5JecncK89f In Your Own Sweet Way   https://open.spotify.com//track/3YiuJ3OstUEa93UBqb1vcn  Warm Woods  https://open.spotify.com//album/4lj7s0K81qfLbXdLaDt2Ba
2   10  Ella Fitzgerald https://open.spotify.com//artist/5V0MlUE1Bft0mbLlND7FJz How Long Has This Been Going On?    https://open.spotify.com//track/0HEU3berJ5OBojU8XmEk1c  Ella Sings Gershwin https://open.spotify.com//album/3DJYxksYYP018jgpOTVXqO
3   81  Joe Henderson   https://open.spotify.com//artist/3BG0nwVh3Gc7cuT4XdsLtt Blue Bossa - Remastered https://open.spotify.com//track/6qqK0oeBRapZn8f9hJJENw  Page One    https://open.spotify.com//album/7mQGTuvmdp56DNz0AmMwWi
4   5   Billie Holiday  https://open.spotify.com//artist/1YzCsTRb22dQkh9lghPIrp Blue Moon   https://open.spotify.com//track/1pZn8AX1WulW8IO338hE5D  Solitude    https://open.spotify.com//album/4izD3SCRElbkO06i8yf4Zp

试图抓取一个 Spotify 播放列表，但它只得到前 20 个结果中的 100 个结果

Trying To Scrape A Spotify Playlist But It Only Gets The First 20 Results Out Of 100

评论

评论