如何将布尔类型的 dask 标量转换为布尔表达式

How to turn a dask scalar of type boolean into a boolean expression

提问人:Mika Bell 提问时间:10/17/2023 最后编辑:Mika Bell 更新时间:10/31/2023 访问量:36

问:

我有一个包含许多数据操作的长代码,其中最后我想通过比较两个 dask 系列来获得一个布尔表达式,这是我代码的最后一部分:

        scores_for_test_data[f"Healthy Sample Score {exclude_name}"] = np.power(
        (pairwise_test_excluded_sample - healthy_sample_mean_ratio), 2)

    scores_for_test_data[f"Lung Sample Score {exclude_name}"] = np.power(
        (pairwise_test_excluded_sample - lung_sample_mean_ratio), 2)

        scores_for_test_data[f"Sum Healthy Sample Score {exclude_name}"] = scores_for_test_data[
        f"Healthy Sample Score {exclude_name}"].sum()

    scores_for_test_data[f"Sum Lung Sample Score {exclude_name}"] = scores_for_test_data[
        f"Lung Sample Score {exclude_name}"].sum()

        res = scores_for_test_data[f"Sum Healthy Sample Score {exclude_name}"][0] < \
          scores_for_test_data[f"Sum Lung Sample Score {exclude_name}"][0]

    if res.any().compute():
        check += 1

(所有变量都是 dask 对象)

我的问题是我正在处理一个非常大的数据集,所以当我到达 compute() 时,它会产生一个内存错误,res 应该是 True/False,但它的类型是“dask.dataframe.core.Series”

如果可能,如何从此比较中获得布尔表达式?

scores_for_test_data[f“Sum Healthy Sample Score {exclude_name}”] 和 scores_for_test_data[f“Sum Lung Sample Score {exclude_name}”] 都是 dask 系列,其所有行的值相同,例如:

0     2375.861075
1     2375.861075
2     2375.861075
3     2375.861075
4     2375.861075
5     2375.861075

这就是为什么我只从比较中的每一行中取第一行。

这是我得到的:

2023-10-22 14:12:56,258 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.70 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:56,716 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.40 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:56,985 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.75 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,109 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.65 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,139 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.69 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,329 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.72 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,329 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.74 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,500 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.47 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,505 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:55007 (pid=14100) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:57,613 - distributed.worker.memory - WARNING - Worker is at 80% memory usage. Pausing worker.  Process memory: 6.39 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,681 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.49 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,686 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.69 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,784 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.78 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,872 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.52 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:57,877 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.47 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:58,034 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:12:58,218 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.48 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:58,276 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:54993 (pid=3008) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,357 - distributed.worker.memory - WARNING - Worker is at 81% memory usage. Pausing worker.  Process memory: 6.52 GiB -- Worker memory limit: 7.98 GiB
2023-10-22 14:12:58,405 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:54986 (pid=9212) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,443 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:55004 (pid=4788) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,568 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:55001 (pid=7332) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,682 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:54989 (pid=12900) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,956 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:54992 (pid=8280) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:58,983 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:12:59,139 - distributed.nanny.memory - WARNING - Worker tcp://127.0.0.1:54998 (pid=3868) exceeded 95% memory budget. Restarting...
2023-10-22 14:12:59,239 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:12:59,348 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:12:59,515 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:12:59,718 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:13:00,020 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:13:00,363 - distributed.nanny - WARNING - Restarting worker
2023-10-22 14:13:19,717 - distributed.worker.memory - WARNING - Unmanaged memory use is high. This may indicate a memory leak or the memory may not be released to the OS; see https://distributed.dask.org/en/latest/worker-memory.html#memory-not-released-back-to-the-os for more information. -- Unmanaged memory: 5.78 GiB -- Worker memory limit: 7.98 GiB

请有人帮忙

Python 数据帧 布尔 dask 延迟

评论

0赞 Guillaume EB 10/20/2023
我不认为问题来自这种比较和最终的 res 对象。Res 将是一个只有一个项目的系列,所以应该很小。内存错误可能来自比较前的操作。
0赞 Mika Bell 10/22/2023
@GuillaumeEB是的,我以前做过很多手术,我怎样才能在所有这些手术后得到最终结果?据我了解,一旦你得到一个最终的小数据,dask 可以运行计算命令并给出结果,但我似乎无法在我的代码中做到这一点......
0赞 Guillaume EB 10/28/2023
您能向我们展示您得到的堆栈跟踪吗?Dask 很懒惰,并试图优化它所做的计算以仅获得它需要的结果,但这并不总是可行的,如果你在某个时候需要计算一个大集合,你很容易遇到内存错误。
0赞 Mika Bell 10/31/2023
@GuillaumeEB 感谢您的回复,我将其添加到我的问题中
0赞 Guillaume EB 11/4/2023
您没有 Python 错误堆栈跟踪?日志告诉您的某个操作正在将过多的数据加载到内存中,但无法分辨是哪一个。也许您应该尝试逐个部分执行操作,以了解它失败的地方,或者使用 Dask Dashboard 来监控执行。

答: 暂无答案