如何从此产品URL中获取每张图片的URL?

How to get urls of every image from this product URL?

提问人:Danny_webb 提问时间:11/16/2023 最后编辑:Andrej KeselyDanny_webb 更新时间:11/16/2023 访问量:35

问:

问题描述:

本网站上的每个产品 https://www.asos.com/us/women/dresses/cat/?cid=8799 都有几张图片。例如,这是一件黑色连衣裙 https://www.asos.com/us/asos-design/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/prd/204910824#colourWayId-204910828 的产品 URL,如果您单击它,您可以看到这件黑色连衣裙有 4 张图片。此外,这件连衣裙还有其他 2 种颜色版本(驼色和粉红色)。对于这些颜色中的每一种,还有另外 3-4 张图像。我想收集所有这些图像(该产品的黑色、骆驼和粉红色版本的每张图像)。

我尝试了什么(下面的代码): 到目前为止,我已经设法从主页收集了所有产品 URL,例如又名 1 个产品 URL = 上面提供的第二个链接。但是,一旦我访问了每个产品 URL,我就无法弄清楚如何访问此 URL 中的所有图像。我将不胜感激任何关于实现下一步的指导。

来自 Google Colab 的代码:


# Upload google drive files
from google.colab import drive
drive.mount('/content/drive')
# Import libraries
import urllib
import urllib.request
from bs4 import BeautifulSoup
import re
import requests
import matplotlib.pyplot as plt
from io import BytesIO 
# Make Soup function
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'

headers={'User-Agent':user_agent,} 

def make_soup(url):
    request= urllib.request.Request(url, None,headers) 
    thepage = urllib.request.urlopen(request)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata
# Find total page #s
site = 'https://www.asos.com/us/women/dresses/cat/?cid=8799'
soup = make_soup(site)
element = soup.find('p', class_='label_Ph1fi')
element = element.text

numbers = re.findall(r'\d{1,3}(?:,\d{3})*', element)

if len(numbers) >= 2:
    offset = int(numbers[0].replace(',', ''))
    num_images = int(numbers[1].replace(',', ''))
    num_pages = int(num_images / offset)
    print(f"Images Per Page: {offset}")
    print(f"Total Images: {num_images}")
    print(f"Total Pages:{num_pages}")
else:
    print("Numbers not found")
#num_images = int(element.replace(',', '').split(' ')[0])
# Get all product urls 

product_urls = []
for i in range(num_pages):
    site = 'https://www.asos.com/us/women/dresses/cat/?cid=8799&page='
    site = site + str(i)
    soup = make_soup(site)
    a = soup.find_all('a',class_='productLink_E9Lfb',href=True)
    for link in soup.find_all('a', class_='productLink_E9Lfb', href=True):
                href = link.get('href')
                if href:
                    product_urls.append(href)
    print('Page ', i, ' done')

print(product_urls)

# Get all images per product url

python 网页抓取 beautifulsoup 请求

评论


答:

1赞 Andrej Kesely 11/16/2023 #1

您可以尝试:

import json
import re

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/119.0"
}


def get_data(html_source):
    data = re.search(r"window\.asos\.pdp\.config\.product = (.*);", html_source)
    data = json.loads(data.group(1))
    return data


def get_images(url):
    data = get_data(requests.get(url, headers=headers).text)

    for i in data["images"]:
        print(f'{i["colour"]:<15} {i["url"]}')


base_url = "https://www.asos.com/us/asos-design/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/prd/204910824"
data = get_data(requests.get(base_url, headers=headers).text)

u = "https://www.asos.com/us/prd/"
for p in data["facetGroup"]["facets"][0]["products"]:
    get_images(u + str(p["productId"]))

指纹:

PINK            https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-1-pink
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-2
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-3
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-light-pink/204910765-4
CAMEL           https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-1-camel
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-2
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-3
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-camel/204910786-4
BLACK           https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-1-black
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-2
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-3
                https://images.asos-media.com/products/asos-design-super-soft-volume-sleeve-turtle-neck-mini-sweater-dress-in-black/204910824-4

评论

0赞 Danny_webb 11/16/2023
你是怎么学会这样做的?您是否参加了课程或有一些可以分享的在线参考资料?我想学习。谢谢。
0赞 Danny_webb 11/16/2023
另外,有没有办法用“requests.get”而不是“beautifulsoup”来实现我在帖子中包含的代码?我想保持我的代码一致(只有一种方法)。
0赞 Andrej Kesely 11/16/2023
@Danny_webb您可以安全地用于获取所有产品 URL,而我的代码用于从产品链接中获取所有图像 URL。无论完成工作需要什么:)beautifulsoup
0赞 Danny_webb 11/16/2023
您推荐的任何课程或参考资料来学习如何做您刚刚在回答中所做的事?