PySpark 中具有有意义的存储桶名称的存储桶数据名称

Bucket Data Names in PySpark with Meaningful Bucket Names

提问人:Minura Punchihewa 提问时间:7/11/2023 更新时间:7/11/2023 访问量:39

问:

我在 PySpark 中有一个函数,可以使用 .由于返回存储桶的数值,因此我还有另一个函数,称为 main 函数 create bucket names 中,这是有意义的。BucketizerBucketizercreate_bin_col_names()

这是我的函数的样子,

def binning_data(df: DataFrame, bin_data_list: list) -> DataFrame:
    """ This method creates bins for numerical values based on the column and the bins
    defined in the parameter.yml file.

    Args:
        :param df: Data frame which has the columns to be binned.
        :param bin_data_list: A list which defined columns and bin ranges.
    Returns:
        :rtype: df: Output data frame
    """

    for col_data in bin_data_list:
        column_name = col_data.upper()

        bin_list = bin_data_list[col_data]
        bin_list.insert(0, -float("inf"))
        bin_list.append(float("inf"))
        splits = bin_list

        df = df.withColumn(column_name, df[column_name].cast(DoubleType()))

        bucketizer = Bucketizer(
            splits=splits, inputCol=column_name, outputCol=column_name + "_bin123"
        )
        df = bucketizer.setHandleInvalid("keep").transform(df)

        d = create_bin_col_names(bin_data_list[col_data])

        mapping_expr = f.create_map([f.lit(x) for x in chain(*d.items())])
        df = df.withColumn(
            column_name, mapping_expr[f.col(column_name + "_bin123")]
        ).drop(f.col(column_name + "_bin123"))

    return df

def create_bin_col_names(bin_list: list) -> dict:
    """ This function creates names for bins.

        Args:
            :param bin_list: A list which contains bins
        Returns:
            :rtype: dict: Output data dictionary with bin names
    """

    d = {}
    x = 0
    for item in bin_list:
        x += 1
        if x == 1:
            d.update({x - 1: "<" + str(item).strip()})
        else:
            d.update({x - 1: str(prev_name).strip() + "-" + str(item).strip()})
        prev_name = item
    d.update({x: ">" + str(item).strip()})

    return d

这没有问题,但分配的存储桶名称似乎不准确。它似乎总是比理想情况下应该分配的存储桶类少。

我在这里做错了什么?如果有另一种不那么复杂的方法,我也愿意尝试。

Python PySpark 分箱

评论


答: 暂无答案