Airflow 任务因“调度程序心跳出现异常”错误而终止

Airflow tasks getting killed with "Scheduler heartbeat got an exception" error

提问人:Vipul Pandey 提问时间:4/2/2019 最后编辑:Hussein AwalaVipul Pandey 更新时间:3/2/2023 访问量:4785

问:

我有一个运行 4 个任务的 dag,这些任务都是 bash 运算符。最近我搬到了 airflow 版本 1.10.2。我经常看到以下错误:

ERROR - Scheduler heartbeat got an exception: (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') (Background on this error at: http://sqlalche.me/e/e3q8)

我使用 mysql 作为元数据的后端。我检查了mysql中变量innodb_lock_wait_timeout的值:

mysql> show variables like 'innodb_lock_wait_timeout';
+--------------------------+-------+
| Variable_name            | Value |
+--------------------------+-------+
| innodb_lock_wait_timeout | 50    |

这不是很高,得到这个问题。 有谁知道为什么会这样?

MySQL 空气流通

评论


答:

0赞 joebeeson 7/22/2020 #1

您的异常没有引用锁定超时,而是表示发生了死锁;如果启用,则不会查询该变量的死锁,默认情况下是。innodb_lock_wait_timeoutinnodb_deadlock_detect

尝试在数据库配置中进行设置。innodb_deadlock_detect=OFF

-1赞 Md Alamgir Hossain 3/2/2023 #2

您看到的错误消息表明调度程序进程无法与 Airflow 用于元数据存储的 MySQL 数据库进行通信。该错误特别提到了死锁,当两个或多个进程相互等待彼此释放对资源的锁定时,可能会发生死锁。

您可以尝试以下几种方法来解决此问题:

Increase the innodb_lock_wait_timeout variable in MySQL to a higher value, such as 120 seconds or more. This will give the scheduler process more time to wait for the lock to be released before giving up and throwing an error.

Check if there are any other processes running on the same MySQL database that could be causing contention for locks. This could include other Airflow instances or other applications that are accessing the same database. If possible, try to isolate the Airflow instance to its own database or server.

Check the scheduler logs for any other errors or warnings that could be related to the issue. There may be other issues that are causing the scheduler to fail, such as network connectivity problems or database configuration issues.

Upgrade to a more recent version of Airflow. The version you are using (1.10.2) is quite old, and there have been many bug fixes and performance improvements since then. Upgrading to a newer version may resolve the issue you are seeing.

Consider switching to a different metadata backend, such as PostgreSQL or SQLite. These databases may have better performance and reliability characteristics than MySQL in certain situations. However, switching to a different database will require some additional setup and configuration work.

一般来说,死锁可能难以诊断和解决,因为它们可能由多种因素引起。如果上述建议都不起作用,您可能需要向 Airflow 社区或数据库专家寻求帮助,他们可以帮助您识别和解决死锁的根本原因。