Spark 写入处理 Shell 中的内存错误，但不适用于 spark-submit-解网

问：

我正在尝试读取 70Gb 的数据并应用过滤器并将输出写入另一个 S3 位置（我在写入之前添加了一个 coalesce（1000），但是使用完后这个简单的操作

Spark Submit 出现以下错误：

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 5782, ip-10-70-21-40.ap-south-1.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits.  17.5 GB of 16 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

但是当在 Spark-shell 中运行时，它会运行并生成带有成功文件的数据，同时给出持续的错误：

提交配置：

spark-submit   --deploy-mode cluster  --driver-memory 8g  --executor-memory 26g  --conf spark.executor.cores=4  --conf spark.executor.instances=10 --conf spark.sql.shuffle.partitions=400 --conf spark.default.parallelism=400 --conf spark.hadoop.fs.s3a.multipart.threshold=2097152000  --conf spark.hadoop.fs.s3a.multipart.size=104857600 --conf spark.hadoop.fs.s3a.maxRetries=4  --conf spark.hadoop.fs.s3a.connection.maximum=500 --conf spark.hadoop.fs.s3a.connection.timeout=600000 --conf spark.executor.memoryOverhead=3g  --conf spark.sql.caseSensitive=true --conf spark.task.maxFailures=4 --conf spark.network.timeout=600s --conf spark.sql.files.maxPartitionBytes=67108864   --conf spark.yarn.maxAppAttempts=1

Shell 配置：

spark-shell --master yarn --executor-memory 24g  --executor-cores 3 --driver-memory 8g --name shell --jars s3://xyz --conf spark.serializer=org.apache.spark.serializer.KryoSerializer

我尝试更改和增加 spark.executor.memoryOverhead，但即使数字大得离谱，我仍然遇到类似的错误。

有人可以帮助我理解为什么会发生这种情况吗？

apache-spark pyspark scala-spark

Spark 写入处理 Shell 中的内存错误，但不适用于 spark-submit

Spark write working with memory errors in Shell but not with spark-submit

评论