提问人:Minura Punchihewa 提问时间:7/11/2023 更新时间:7/11/2023 访问量:39
PySpark 中具有有意义的存储桶名称的存储桶数据名称
Bucket Data Names in PySpark with Meaningful Bucket Names
问:
我在 PySpark 中有一个函数,可以使用 .由于返回存储桶的数值,因此我还有另一个函数,称为 main 函数 create bucket names 中,这是有意义的。Bucketizer
Bucketizer
create_bin_col_names()
这是我的函数的样子,
def binning_data(df: DataFrame, bin_data_list: list) -> DataFrame:
""" This method creates bins for numerical values based on the column and the bins
defined in the parameter.yml file.
Args:
:param df: Data frame which has the columns to be binned.
:param bin_data_list: A list which defined columns and bin ranges.
Returns:
:rtype: df: Output data frame
"""
for col_data in bin_data_list:
column_name = col_data.upper()
bin_list = bin_data_list[col_data]
bin_list.insert(0, -float("inf"))
bin_list.append(float("inf"))
splits = bin_list
df = df.withColumn(column_name, df[column_name].cast(DoubleType()))
bucketizer = Bucketizer(
splits=splits, inputCol=column_name, outputCol=column_name + "_bin123"
)
df = bucketizer.setHandleInvalid("keep").transform(df)
d = create_bin_col_names(bin_data_list[col_data])
mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())])
df = df.withColumn(
column_name, mapping_expr[f.col(column_name + "_bin123")]
).drop(f.col(column_name + "_bin123"))
return df
def create_bin_col_names(bin_list: list) -> dict:
""" This function creates names for bins.
Args:
:param bin_list: A list which contains bins
Returns:
:rtype: dict: Output data dictionary with bin names
"""
d = {}
x = 0
for item in bin_list:
x += 1
if x == 1:
d.update({x - 1: "<" + str(item).strip()})
else:
d.update({x - 1: str(prev_name).strip() + "-" + str(item).strip()})
prev_name = item
d.update({x: ">" + str(item).strip()})
return d
这没有问题,但分配的存储桶名称似乎不准确。它似乎总是比理想情况下应该分配的存储桶类少。
我在这里做错了什么?如果有另一种不那么复杂的方法,我也愿意尝试。
答: 暂无答案
评论