如何比较两个具有文本、数字和 None 值的 Pandas Dataframes

How to compare two Pandas Dataframes with text, numerical, and None values

提问人:Pro Q 提问时间:4/30/2023 更新时间:4/30/2023 访问量:72

问:

我有两个数据帧,除了 s 之外,它们都包含文本和数字数据。但是,具有整数,并且具有浮点数。df1df2Nonedf1df2

我尝试将它们的相等性与 进行比较,但由于类型差异(整数与浮点数),这失败了。我也尝试过这样做,但这失败了(我想这是因为文本数据)。df1.equals(df2)np.allclose(df1, df2, equal_nan=True)TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

如何检查数据是否相同?df1df2

Python Pandas 数据帧 相等

评论


答:

0赞 Pro Q 4/30/2023 #1

不幸的是,在这种情况下,似乎没有任何简单的函数可以检查是否相等,因此我们必须构建自己的函数。

为了进行检查,我们将根据是文本数据(“对象”)还是数字数据来拆分列。然后,我们可以将数字数据与文本数据进行比较。我们还将通过将 s 转换为 numpy 来处理它们,以便 numpy 可以更好地处理它们。NoneNan

代码如下:

def compare_mixed_dataframes(df1, df2) -> bool:
    # (This code was written by GPT-4, but I've tested it and it works)
    # Get the column names of numerical columns
    num_cols = df1.select_dtypes(include=[np.number]).columns
    
    # Convert numerical columns to float and replace None with NaN
    df1_num = df1[num_cols].astype(float).fillna(np.nan)
    df2_num = df2[num_cols].astype(float).fillna(np.nan)

    # Compare numerical columns with a tolerance value using numpy.allclose()
    num_comparison = np.allclose(df1_num, df2_num, rtol=1e-05, atol=1e-08, equal_nan=True)

    # Compare sentence columns using pandas.DataFrame.equals()
    string_cols = df1.select_dtypes(include=['object']).columns
    str_comparison = df1[string_cols].equals(df2[string_cols])

    # Combine the results of numerical and sentence columns comparisons
    return num_comparison and str_comparison

如果你想自己测试代码,下面是一个快速脚本来测试它:

# Also written by GPT-4, but edited by me to contain a more advanced test case
# I have also checked to make sure that this works
import numpy as np
import pandas as pd

def compare_mixed_dataframes(df1, df2):
    # Get the column names of numerical columns
    num_cols = df1.select_dtypes(include=[np.number]).columns
    
    # Convert numerical columns to float and replace None with NaN
    df1_num = df1[num_cols].astype(float).fillna(np.nan)
    df2_num = df2[num_cols].astype(float).fillna(np.nan)

    # Compare numerical columns with a tolerance value using numpy.allclose()
    num_comparison = np.allclose(df1_num, df2_num, rtol=1e-05, atol=1e-08, equal_nan=True)

    # Compare sentence columns using pandas.DataFrame.equals()
    string_cols = df1.select_dtypes(include=['object']).columns
    str_comparison = df1[string_cols].equals(df2[string_cols])

    # Combine the results of numerical and sentence columns comparisons
    return num_comparison and str_comparison

# Create example DataFrames with mixed types (ints, floats, text, and Nones)
data1 = {'text': ['hello', 'world', None],
         'num': [None, 2, 3]}
df1 = pd.DataFrame(data1)

data2 = {'text': ['hello', 'world', None],
         'num': [None, 2.0, 3.0]}
df2 = pd.DataFrame(data2)

# DataFrames with different numbers
data3 = {'text': ['hello', 'world', None],
         'num': [None, 2, 4]}
df3 = pd.DataFrame(data3)

# Test the custom function with same and different DataFrames
print(compare_mixed_dataframes(df1, df2))  # True
print(compare_mixed_dataframes(df1, df3))  # False
1赞 Panda Kim 4/30/2023 #2

data1 = {'text': ['hello', 'world', None],
         'num': [None, 2, 3]}
df1 = pd.DataFrame(data1)

data2 = {'text': ['hello', 'world', None],
         'num': [None, 2.0, 3.0]}
df2 = pd.DataFrame(data2)

法典

df1.equals(df2.astype(df1.dtypes))

输出:

True

如果您担心转换 dtypes 时发生错误,请使用下面的代码。

df1.equals(df2.astype(df1.dtypes, errors='ignore'))

如果您无法将 dtype 更改为相同(忽略时),无论如何它们都不相同