提问人:Katie 提问时间:11/7/2023 最后编辑:Sam MasonKatie 更新时间:11/7/2023 访问量:51
从具有相同文件名的多个 URL 下载和重命名图像
Downloading and renaming images from multiple URL with the same file name
问:
我正在尝试从存档中下载图像。我有图像 URL,并且能够使用下面的代码成功下载每个文件。但是,某些图像使用相同的名称(例如压缩的 .jpg),因此在运行命令时,只会创建一个压缩的 .jpg 文件。
我希望能够在下载时重命名这些文件,因此我最终会得到压缩1.jpg,压缩2.jpg等。我对 Python 非常陌生,所以试图在文件名末尾添加增量数字时让自己陷入一团糟。
谢谢
import requests
image_url =[
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for img in image_url:
file_name = img.split('/')[-1]
print("Downloading file:%s"%file_name)
r = requests.get(img, stream=True)
# this should be file_name variable instead of "file_name" string
with open(file_name, 'wb') as f:
for chunk in r:
f.write(chunk)
我尝试过使用 os 和 glob 重命名,但没有运气 - 如何在下载之前重命名文件?
答:
0赞
Ovski
11/7/2023
#1
您只需在文件名中添加索引即可。若要从 for 循环中获取索引,请在image_url列表中使用枚举。 然后,拆分文件名以获取可用于添加索引号的名称和扩展名列表。
import requests
import os.path
image_url = [
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for index, img in enumerate(image_url):
file_name_string = img.split('/')[-1]
file_name_list = os.path.splitext(file_name_string)
target_file = f"{file_name_list[0]}{index + 1}{file_name_list[1]}"
print("Downloading file:%s" % target_file)
r = requests.get(img, stream=True)
with open(target_file, 'wb') as f:
for chunk in r:
f.write(chunk)
0赞
Marco Parola
11/7/2023
#2
您可以为每个图像维护一个计数器,并将其附加到文件名中:
import requests
import os
image_url = [
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
for i, img in enumerate(image_url, start=1):
file_name = img.split('/')[-1]
# Get the file extension
file_extension = os.path.splitext(file_name)[-1]
# Rename the file with an incremental number
new_file_name = file_name + str(i) + file_extension
print("Downloading file: %s" % new_file_name)
r = requests.get(img, stream=True)
with open(new_file_name, 'wb') as f:
for chunk in r:
f.write(chunk)
0赞
Sam Mason
11/7/2023
#3
如果所有这些 URL 都来自一个共同的前缀,我会很想只使用带有斜杠的后缀。我还会使用一些错误检查来确保请求有效。
以下代码会将文件下载到类似以下名称的名称:000_103_975_thumbnail_compressed.jpg
import requests
import pathlib
image_urls =[
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/975/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/105/093/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/103/984/thumbnail/compressed.jpg',
'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/000/107/697/thumbnail/compressed.jpg'
]
prefix = 'https://s3-eu-west-1.amazonaws.com/sheffdocfest.com/attachments/data/'
for url in image_urls:
# turn the url into something suitable for local use
out = pathlib.Path(url.removeprefix(prefix).replace('/', '_'))
# no point fetching something we've already got
# you can delete the file to retry if you really want that
if out.exists():
print(f"already saved {url} as {out}")
continue
# open the file early, failures will result in an empty file and hence won't be retried
with open(out, 'wb') as fd, requests.get(url, stream=True) as resp:
# don't want to save HTTP 404 or 501, leave these empty
if not resp.ok:
print(f"HTTP server error while fetching {url}:", resp)
continue
for chunk in resp.iter_content(2**18):
fd.write(chunk)
print(f"{url} saved to {out}")
评论