从具有相同文件名的多个 URL 下载和重命名图像

Downloading and renaming images from multiple URL with the same file name

提问人:Katie 提问时间:11/7/2023 最后编辑:Sam MasonKatie 更新时间:11/7/2023 访问量:51

问:

我正在尝试从存档中下载图像。我有图像 URL,并且能够使用下面的代码成功下载每个文件。但是,某些图像使用相同的名称(例如压缩的 .jpg),因此在运行命令时,只会创建一个压缩的 .jpg 文件。

我希望能够在下载时重命名这些文件,因此我最终会得到压缩1.jpg,压缩2.jpg等。我对 Python 非常陌生,所以试图在文件名末尾添加增量数字时让自己陷入一团糟。

谢谢

import requests    
image_url =[
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for img in image_url:     
     file_name = img.split('/')[-1]     
     print("Downloading file:%s"%file_name)    
     r = requests.get(img, stream=True)      
     # this should be file_name variable instead of "file_name" string    
     with open(file_name, 'wb') as f:    
         for chunk in r:    
             f.write(chunk)    

我尝试过使用 os 和 glob 重命名,但没有运气 - 如何在下载之前重命名文件?

python 存档 映像下载

评论

0赞 Sam Mason 11/7/2023
你希望他们叫什么?

答:

0赞 Ovski 11/7/2023 #1

您只需在文件名中添加索引即可。若要从 for 循环中获取索引,请在image_url列表中使用枚举。 然后,拆分文件名以获取可用于添加索引号的名称和扩展名列表。

import requests
import os.path

image_url = [
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for index, img in enumerate(image_url):
    file_name_string = img.split('/')[-1]
    file_name_list = os.path.splitext(file_name_string)
    target_file = f"{file_name_list[0]}{index + 1}{file_name_list[1]}"
    print("Downloading file:%s" % target_file)
    r = requests.get(img, stream=True)
    with open(target_file, 'wb') as f:
        for chunk in r:
            f.write(chunk)
0赞 Marco Parola 11/7/2023 #2

您可以为每个图像维护一个计数器,并将其附加到文件名中:

import requests
import os

image_url = [
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
    'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]

for i, img in enumerate(image_url, start=1):
    file_name = img.split('/')[-1]
    
    # Get the file extension
    file_extension = os.path.splitext(file_name)[-1]
    
    # Rename the file with an incremental number
    new_file_name = file_name + str(i) + file_extension 
    
    print("Downloading file: %s" % new_file_name)
    r = requests.get(img, stream=True)
    
    with open(new_file_name, 'wb') as f:
        for chunk in r:
            f.write(chunk)
0赞 Sam Mason 11/7/2023 #3

如果所有这些 URL 都来自一个共同的前缀,我会很想只使用带有斜杠的后缀。我还会使用一些错误检查来确保请求有效。

以下代码会将文件下载到类似以下名称的名称:000_103_975_thumbnail_compressed.jpg

import requests
import pathlib

image_urls =[
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
  'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
prefix = 'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/'

for url in image_urls:
    # turn the url into something suitable for local use
    out = pathlib.Path(url.removeprefix(prefix).replace('/', '_'))

    # no point fetching something we've already got
    # you can delete the file to retry if you really want that
    if out.exists():
        print(f"already saved {url} as {out}")
        continue

    # open the file early, failures will result in an empty file and hence won't be retried
    with open(out, 'wb') as fd, requests.get(url, stream=True) as resp:
        # don't want to save HTTP 404 or 501, leave these empty
        if not resp.ok:
            print(f"HTTP server error while fetching {url}:", resp)
            continue
        for chunk in resp.iter_content(2**18):
            fd.write(chunk)
        print(f"{url} saved to {out}")