提问人:M__ 提问时间:7/14/2022 更新时间:7/17/2022 访问量:280
猜测 DataFrame 的数据类型
Guessing data type for dataframe
问:
是否有一种算法可以检测文件或数据帧的每一列的数据类型?挑战在于通过错误、缺失或未规范化的数据来建议数据类型。我想检测键中命名的数据类型。我的第一次尝试是使用混乱的表格,但结果真的很糟糕,之前没有对数据进行规范化。因此,也许有一种算法可以为类型建议获得更好的结果,或者有一种在不了解数据的情况下对数据进行规范化的方法。结果应与数据帧中的键匹配。
import pandas as pd
from messytables import CSVTableSet, type_guess
data = {
"decimals": ["N.A", "", "111", "111.00", "111,12", "11,111.34"],
"dates1": ["N.A.", "", "02/17/2009", "2009/02/17", "February 17, 2009", "2014, Feb 17"],
"dates2": ["N.A.", "", "02/17/2009", "2009/02/17", "02/17/2009", "02/17/2009"],
"dates3": ["N.A.", "", "2009/02/17", "2009/02/17", "2009/02/17", "2009/02/17"],
"strings": ["N.A.", "", "N.A.", "N.A.", "test", "abc"],
"integers": ["N.A.", "", "1234", "123123", "2222", "0"],
"time": ["N.A.", "", "05:41:12", "05:40:12", "05:41:30", "06:41:12"],
"datetime": ["N.A.", "", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24", "10/02/2021 10:39:24"],
"boolean": ["N.A.", "", "True", "False", "False", "False"]
}
df = pd.DataFrame(data)
towrite = io.BytesIO()
df.to_csv(towrite) # write to BytesIO buffer
towrite.seek(0)
rows = CSVTableSet(towrite).tables[0]
types = type_guess(rows.sample)
print(types) # [Integer, Integer, String, String, Date(%Y/%m/%d), String, Integer, String, Date(%d/%m/%Y %H:%M:%S), Bool]
答:
0赞
Laurent
7/17/2022
#1
这是我对你有趣问题的看法。
使用您提供的数据帧,这里有一种方法可以做到这一点:
# For each main type, define a lambda helper function which returns the number of values in the given column of said type
helpers = {
"float": lambda df, col: df[col]
.apply(lambda x: x.replace(".", "").isdigit() and "." in x)
.sum(),
"integer": lambda df, col: df[col].apply(lambda x: x.isdigit()).sum(),
"datetime": lambda df, col: pd.to_datetime(
df[col], errors="coerce", infer_datetime_format=True
)
.notna()
.sum(),
"bool": lambda df, col: df[col].apply(lambda x: x == "True" or x == "False").sum(),
}
# Iterate on each column of the dataframe and get the type with maximum number of values
df_dtypes = {}
for col in df.columns:
results = {key: helper(df, col) for key, helper in helpers.items()}
best_result = max(results, key=results.get)
df_dtypes[col] = best_result if max(results.values()) else "string"
print(df_dtypes)
# Output
{
"decimals": "float",
"dates1": "datetime",
"dates2": "datetime",
"dates3": "datetime",
"strings": "string",
"integers": "integer",
"time": "datetime",
"datetime": "datetime",
"boolean": "bool",
}
评论