如何使用 tqdm/python 拥有一个包含多个下载的进度条?

How to have one progress bar with multiple downloads using tqdm/python?

提问人:Irfan 提问时间:9/6/2023 更新时间:9/6/2023 访问量:62

问:

这是我的工作 python 脚本,用于从 UniProt 下载 fasta 序列(非常感谢社区)。 '''

UniProt fasta downloader using accession ids from a text file,
show the download progress for each downloading sequence,
and make a list of unaccessible sequnces
'''
import functools
import pathlib
import shutil
import requests
from tqdm.auto import tqdm
#Part I: Read the file with IDs and make a list of urls to download the respective sequences
with open ('errtest.txt', 'r') as infile:
    lines = infile.readlines()

listfile_name = infile.name
file_name = listfile_name.split('.', 1)[0]

downloaded = 0 #sequences downloaded

URL_list = []
for line in lines:
    access_id = line.strip()
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'
    URL = url_part1+access_id+url_part2          
    URL_list.append(URL)

not_found = []
for url in URL_list:
    r = requests.get(url, stream=True, allow_redirects=True)
    file_size = int(r.headers.get('Content-Length', 0))
    if r.status_code != 200:
        Apart = url.removeprefix('https://rest.uniprot.org/uniprotkb/')
        short_id = Apart.removesuffix('.fasta')
        not_found.append (short_id)
        print (short_id, '-- not found')
    elif r.status_code == 200:
        path = pathlib.Path((file_name)+'seqs.fa').expanduser().resolve()
        path.parent.mkdir(parents=True, exist_ok=True)

        desc = "(Unknown total file size)" if file_size == 0 else ""
        r.raw.read = functools.partial(r.raw.read, decode_content=True)  # Decompress if needed
        with tqdm.wrapattr(r.raw, "read", total=file_size, desc=desc) as r_raw:
            with path.open("ab") as f:
                shutil.copyfileobj(r_raw, f)
        downloaded += 1
print ('Sequences with these accesion ids were not found:\n', not_found)
print (downloaded, 'sequences downloaded')

这些是 errtest.txt 文件的内容(一些错误的 ID 要计数,一些正确的 ID):

wrong1
D3VN13
B9W4V6
wrong2
A0A8S0XZH6
wrong3

这是典型的输出:

wrong1 -- not found

  0%|          | 0/477 [00:00<?, ?it/s]
100%|██████████| 477/477 [00:00<00:00, 239kB/s]

  0%|          | 0/473 [00:00<?, ?it/s]
100%|██████████| 473/473 [00:00<00:00, 42.4kB/s]
wrong2 -- not found

  0%|          | 0/534 [00:00<?, ?it/s]
100%|██████████| 534/534 [00:00<00:00, 268kB/s]
wrong3 -- not found
Sequences with these accesion ids were not found:
 ['wrong1', 'wrong2', 'wrong3']
3 sequences downloaded

目前为止,一切都好。接下来,我想为所有下载制作一个进度条。在这个文本文件中,只有 3 个合法 ID 和 3 个错误的 ID(有时会发生这种情况),并且可以一个接一个地显示三个进度条。但实际上,列表文件中将有数千个 ID,有 1000 个或 URL 以及相应的序列下载。因此,最好有一个显示下载进度的进度条。

python 下载 进度 tqdm

评论


答:

0赞 curt 9/6/2023 #1

我认为您可以在开始下载循环之前计算总大小,然后使用唯一的进度条,如下所示:

import functools
import pathlib
import shutil
import requests
from tqdm.auto import tqdm

# Part I: Read the file with IDs and make a list of URLs to download the respective sequences
with open('errtest.txt', 'r') as infile:
    lines = infile.readlines()

listfile_name = infile.name
file_name = listfile_name.split('.', 1)[0]

downloaded = 0

URL_list = []
total_file_size = 0  # Initialize total file size
not_found = []

for line in lines:
    access_id = line.strip()
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'
    URL = url_part1 + access_id + url_part2
    URL_list.append(URL)
    # classify files
    r = requests.get(URL, stream=True, allow_redirects=True)
    if r.status_code != 200:
        Apart = URL.removeprefix('https://rest.uniprot.org/uniprotkb/')
        short_id = Apart.removesuffix('.fasta')
        not_found.append(short_id)
        print(short_id, '-- not found')
    else:
        file_size = int(r.headers.get('Content-Length', 0))
        total_file_size += file_size  # Add current file size to total file size

# Create unique progress bar
with tqdm(total=total_file_size, unit='B', unit_scale=True, unit_divisor=1024, desc='Downloading') as pbar:
    for URL in URL_list:
        r = requests.get(URL, stream=True, allow_redirects=True)
        if r.status_code == 200:
            path = pathlib.Path((file_name) + 'seqs.fa').expanduser().resolve()
            path.parent.mkdir(parents=True, exist_ok=True)

            r.raw.read = functools.partial(r.raw.read, decode_content=True)
            with path.open("ab") as f:
                shutil.copyfileobj(r.raw, f)
            downloaded += 1
        pbar.update(file_size)

print('Sequences with these accession IDs were not found:\n', not_found)
print(downloaded, 'sequences downloaded')

评论

0赞 Irfan 9/7/2023
我在 IDLE 中得到了他的: wrong1 -- not found wrong2 -- not found wrong3 -- not found 下载: 0%| |0.00/1.45 千米赛 [00:00<?, ?B/s] 下载:36%|███▌ |534/1.45k [00:00<00:01, 677B/秒] 下载: 72%|███████▏ |1.04k/1.45k [00:01<00:00, 677B/秒] 下载: 1.56kB [00:02, 698B/s] 下载: 2.09kB [00:03, 665B/s] 下载: 2.61kB [00:04, 609B/s] 下载: 3.13kB [00:05, 603B/s] 下载: 3.13kB [00:05, 586B/s] 未找到具有这些登录 ID 的序列: ['wrong1', 'wrong2', 'wrong3'] 下载了 3 个序列
0赞 Irfan 9/7/2023
看起来进度条已更新,但在 IDLE 的下一行显示新的进度。