如何解决kaggle下载数据集时的“MemoryError”?

How to solve "MemoryError" when download dataset by kaggle?

提问人:forestbat 提问时间:9/17/2023 最后编辑:forestbat 更新时间:9/17/2023 访问量:46

问:

我想从 kaggle 下载数据集,但是当我在本地机器上运行它时,它崩溃了,这是我的代码:

api = kaggle.KaggleApi(json_str)
    api.authenticate()
    api.datasets_download(owner_slug='headwater', dataset_slug='Camels')

这是崩溃报告:

test_dload_archive.py:8: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
..\venv\lib\site-packages\kaggle\api\kaggle_api.py:1494: in datasets_download
    (data) = self.datasets_download_with_http_info(owner_slug, dataset_slug, **kwargs)  # noqa: E501
..\venv\lib\site-packages\kaggle\api\kaggle_api.py:1563: in datasets_download_with_http_info
    return self.api_client.call_api(
..\venv\lib\site-packages\kaggle\api_client.py:329: in call_api
    return self.__call_api(resource_path, method,
..\venv\lib\site-packages\kaggle\api_client.py:161: in __call_api
    response_data = self.request(
..\venv\lib\site-packages\kaggle\api_client.py:351: in request
    return self.rest_client.GET(url,
..\venv\lib\site-packages\kaggle\rest.py:247: in GET
    return self.request("GET", url,
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <kaggle.rest.RESTClientObject object at 0x000001B1FAE01D80>
method = 'GET'
url = 'https://www.kaggle.com/api/v1/datasets/download/headwater/Camels'
query_params = []
headers = {'Accept': 'file', 'User-Agent': 'Swagger-Codegen/1/python'}
body = None, post_params = {}, _preload_content = True, _request_timeout = None
……
            if six.PY3:
>               r.data = r.data.decode('utf8')
E               MemoryError

..\venv\lib\site-packages\kaggle\rest.py:235: MemoryError

我认为这是因为解压缩大文件的内存成本,但是如何解决呢?

更新: 当我在 linux 中时,crash 看起来像这样:

            if six.PY3:
>               r.data = r.data.decode('utf8')
E               UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcb in position 14: invalid continuation byte
python io 请求 内存不足 kaggle

评论

0赞 Jesse Sealand 9/17/2023
看起来压缩文件约为 3.3 GB。解压缩到 abobut 14GB/你有多少 RAM?

答:

1赞 Codist 9/17/2023 #1

请注意 rest.py 中的这一行:

r.data = r.data.decode('utf8')

这是非常幼稚的,对于这个特定的数据集来说,这是完全错误的。

您可以使用 cp037 解码此数据集,但为此,您需要适当地编辑 rest.py