PySpark：将 null 值替换为空列表-解网

问：

我外部连接了两个和操作的结果，并最终得到了这个数据帧（）：groupBycollect_setfoo

>>> foo.show(3)
+---+------+------+
| id|    c1|    c2|
+---+------+------+
|  0|  null|   [1]|
|  7|   [6]|  null|
|  6|   [6]|[7, 8]|
+---+------+------+

我想连接并一起得到这个结果：c1c2

+---+------+------+---------+
| id|    c1|    c2|      res|
+---+------+------+---------+
|  0|  null|   [1]|      [1]|
|  7|   [6]|  null|      [6]|
|  6|   [6]|[7, 8]|[6, 7, 8]|
+---+------+------+---------+

为此，我需要将和中的 null 值合并在一起。但是，我什至不知道数据类型和是什么。如何用和的串联替换 null 值，如上所示？c1c2c1c2[]c1c2res

这就是我目前连接两个列的方式：

# Concat returns null for rows where either column is null
foo.selectExpr(
    'id',
    'c1',
    'c2',
    'concat(c1, c2) as res'
)

python apache-spark pyspark null

from pyspark.sql.functions import *
df = spark.createDataFrame([(0,[None],[1]),(7,[6],[None]),(6,[6],[7,8])],['id','c1','c2'])
df.withColumn("res",expr("""array_except(array_union(c1,c2),array(null))""")).show()
#+---+------+------+---------+
#| id|    c1|    c2|      res|
#+---+------+------+---------+
#|  0|[null]|   [1]|      [1]|
#|  7|   [6]|[null]|      [6]|
#|  6|   [6]|[7, 8]|[6, 7, 8]|
#+---+------+------+---------+

上一个：数值列的 pandas <NA> 和 NaN 之间的区别

下一个：ValueError：必须设置给定的用户名

PySpark：将 null 值替换为空列表

PySpark: Replace null values with empty list

评论

评论