pandas 数据结构的数据类型问题（真的）大数字-解网

问：

我在使用 Python Pandas 时遇到了问题，并且数据集存在一些精度问题。这是我用作此处随机情况的整数集：链接

我试图做的是计算一些统计数据并将它们切入垃圾箱。下面是驱动程序代码。Series

import pandas as pd
import numpy as np
with open('dataset.txt') as source:
    parsed_data = [ int(x) for x in source.readlines() ]

sr = pd.Series(parsed_data)

q1 = sr.quantile(0.25)
q2 = sr.quantile(0.50)
q3 = sr.quantile(0.75)
  
bins   = np.unique([sr.min(), q1/2, q1, q2/2, q2, q3/3, q3, sr.max()])
labels = np.arange(1, len(bins))

binned_data, bins = pd.cut(sr, bins = bins, labels=labels, retbins=True)

#### ? ####
print(sr)
print(f"Q1 = {q1} | Q2 (median) = {q2} | Q3 = {q3}")
print(bins)
### ?? ###
for entry in binned_data: 
    print(entry)

这产生：

Q1 = 0.0 | Q2 (median) = 0.0 | Q3 = 1589492404757.0
[0 529830801585.6667 1589492404757.0
 21267647932558653966460912964485513216]
.
.. bin numbers here BUT 
.
nan
nan
.
..
.

所以，我在这里的问题是：分位数 1 和 2（中位数）为 0.0。显然这是错误的。这也会产生不正确的装箱编号（如果不是因为会引发一个错误，告诉我装箱必须是唯一的！该系列是，我不能把它投射到某种 np.int。np.uniquecutdtypeobject

如何解决此问题？

pandas numpy 统计数据数字精度

The dtype of the Series is object and I cannot cast it to some sort of np.int.您遇到了整数在 NumPy 中的长度有限的问题。np.uint64 最多只能存储 18446744073709551615 的值。你有比这更大的数字。替代方案：1）转换为浮点数。通常，这意味着精度的损失，但您的所有数字都可以表示为浮点数，无需四舍五入。2）将其保留为对象 - Python int 是任意精度。

0赞 ex1led 10/25/2023

@NickODell 在这种情况下，如何使用数据类型为 int（Python 的 int）的 pandas 实用程序？

0赞 Nick ODell 10/25/2023

许多实用程序仍然可以在不做任何特殊事情的情况下工作。例如，作品。您能更具体地说明不起作用的实用程序吗？sr - 1

0赞 ex1led 10/26/2023

@NickODell 不，没关系，它们确实有效。我的问题是，分箱方法返回了很多我无法证明的 NaN 值。通过分别添加 -np.inf 和 +np.inf 作为下限和上限来解决它。

答： 暂无答案

上一个：获取 float64 进行 numpy 计算

下一个：在 lme4 中使用 cloglog 链接进行 PIRLS 步进减半

pandas 数据结构的数据类型问题（真的）大数字

Data type issues with pandas data structures for (really) big numbers

评论