猜测 DataFrame 的数据类型

Guessing data type for dataframe

提问人:M__ 提问时间:7/14/2022 更新时间:7/17/2022 访问量:280

问:

是否有一种算法可以检测文件或数据帧的每一列的数据类型?挑战在于通过错误、缺失或未规范化的数据来建议数据类型。我想检测键中命名的数据类型。我的第一次尝试是使用混乱的表格,但结果真的很糟糕,之前没有对数据进行规范化。因此,也许有一种算法可以为类型建议获得更好的结果,或者有一种在不了解数据的情况下对数据进行规范化的方法。结果应与数据帧中的键匹配。

import pandas as pd
from messytables import CSVTableSet, type_guess

data = {
  "decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
  "dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
  "dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
  "dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
  "strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
  "integers": ["N.A.", "", "1234", "123123", "2222", "0"],
  "time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
  "datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
  "boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)

towrite = io.BytesIO()
df.to_csv(towrite)  # write to BytesIO buffer
towrite.seek(0)

rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]
Python Pandas 数据帧 转换

评论


答:

0赞 Laurent 7/17/2022 #1

这是我对你有趣问题的看法。

使用您提供的数据帧,这里有一种方法可以做到这一点:

# For each main type, define a lambda helper function which returns the number of values in the given column of said type
helpers = {
    "float": lambda df, col: df[col]
    .apply(lambda x: x.replace(".", "").isdigit() and "." in x)
    .sum(),
    "integer": lambda df, col: df[col].apply(lambda x: x.isdigit()).sum(),
    "datetime": lambda df, col: pd.to_datetime(
        df[col], errors="coerce", infer_datetime_format=True
    )
    .notna()
    .sum(),
    "bool": lambda df, col: df[col].apply(lambda x: x == "True" or x == "False").sum(),
}

# Iterate on each column of the dataframe and get the type with maximum number of values
df_dtypes = {}
for col in df.columns:
    results = {key: helper(df, col) for key, helper in helpers.items()}
    best_result = max(results, key=results.get)
    df_dtypes[col] = best_result if max(results.values()) else "string"
print(df_dtypes)
# Output
{
    "decimals": "float",
    "dates1": "datetime",
    "dates2": "datetime",
    "dates3": "datetime",
    "strings": "string",
    "integers": "integer",
    "time": "datetime",
    "datetime": "datetime",
    "boolean": "bool",
}