AWS pcluster 失败,出现 MasterServerWaitCondition 收到 FAILURE 信号、iptables 和 chef 版本错误

AWS pcluster 失败,出现 MasterServerWaitCondition 收到 FAILURE 信号、iptables 和 chef 版本错误

我正在尝试为 parallelcluster 创建 AMI。我使用了 amazon 的库存 AMI(ami-0436692c7b452bae4 用于 us-west-2,我所在的区域,以及 alinux),并通过添加一些软件包对其进行了轻微修改。

但是,当我运行时pcluster create foo --norollback出现错误:

Beginning cluster creation for cluster: stockAWS
Creating stack named: parallelcluster-stockAWS
Status: parallelcluster-stockAWS - ROLLBACK_IN_PROGRESS                         
Cluster creation failed.  Failed events:
  - AWS::AutoScaling::AutoScalingGroup ComputeFleet Resource creation cancelled
  - AWS::CloudFormation::WaitCondition MasterServerWaitCondition Received FAILURE signal with UniqueId i-booyaa

然后我运行ssh foo并查看日志,其中/var/log/cfncluster-init.log显示了一个很长的错误日志,我在日志的底部提供了以下内容:

2021-07-28 23:16:49,659 [ERROR] Command chef (chef-client --local-mode --config /etc/chef/client.rb --log_level auto --force-formatter --no-color --chef-zero-port 8889 --json-attributes /etc/chef/dna.json --override-runlist aws-parallelcluster::_prep_env) failed
2021-07-28 23:16:49,659 [DEBUG] Command chef output: Starting Chef Client, version 14.2.0
[2021-07-28T23:16:47+00:00] WARN: Run List override has been provided.
[2021-07-28T23:16:47+00:00] WARN: Run List override has been provided.
[2021-07-28T23:16:47+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2021-07-28T23:16:47+00:00] WARN: Original Run List: [recipe[aws-parallelcluster::slurm_config]]
[2021-07-28T23:16:47+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
[2021-07-28T23:16:47+00:00] WARN: Overridden Run List: [recipe[aws-parallelcluster::_prep_env]]
resolving cookbooks for run list: ["aws-parallelcluster::_prep_env"]
Synchronizing Cookbooks:
  - aws-parallelcluster (2.5.1)
  - poise-python (1.7.0)
  - tar (2.1.1)
  - selinux (2.1.1)
  - nfs (2.6.4)
  - yum (5.1.0)
  - yum-epel (3.1.0)
  - openssh (2.6.3)
  - apt (7.0.0)
  - hostname (0.4.2)
  - line (2.4.1)
  - ulimit (1.0.0)
  - pyenv (3.1.1)
  - kernel_module (1.1.2)
  - poise (2.8.2)
  - poise-languages (2.1.2)
  - iptables (8.0.0)
  - hostsfile (3.0.1)
  - poise-archive (1.5.0)

Running handlers:
[2021-07-28T23:16:49+00:00] ERROR: Running exception handlers
[2021-07-28T23:16:49+00:00] ERROR: Running exception handlers
Running handlers complete
[2021-07-28T23:16:49+00:00] ERROR: Exception handlers complete
[2021-07-28T23:16:49+00:00] ERROR: Exception handlers complete
Chef Client failed. 0 resources updated in 11 seconds
[2021-07-28T23:16:49+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2021-07-28T23:16:49+00:00] FATAL: Stacktrace dumped to /etc/chef/local-mode-cache/cache/chef-stacktrace.out
[2021-07-28T23:16:49+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-07-28T23:16:49+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2021-07-28T23:16:49+00:00] FATAL: Chef::Exceptions::CookbookChefVersionMismatch: Cookbook 'iptables' version '8.0.0' depends on chef version [">= 15.3"], but the running chef version is 14.2.0
[2021-07-28T23:16:49+00:00] FATAL: Chef::Exceptions::CookbookChefVersionMismatch: Cookbook 'iptables' version '8.0.0' depends on chef version [">= 15.3"], but the running chef version is 14.2.0

2021-07-28 23:16:49,659 [ERROR] Error encountered during build of chefPrepEnv: Command chef failed
Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py", line 573, in run_config
    CloudFormationCarpenter(config, self._auth_config).build(worklog)
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/construction.py", line 273, in build
    self._config.commands)
  File "/usr/lib/python3.7/site-packages/cfnbootstrap/command_tool.py", line 127, in apply
    raise ToolError(u"Command %s failed" % name)
cfnbootstrap.construction_errors.ToolError: Command chef failed
2021-07-28 23:16:49,661 [ERROR] -----------------------BUILD FAILED!------------------------

如果我运行,iptables --version我会得到v1.8.4。使用 sudo 运行也是一样。chef 是14.2.0

令人沮丧的是,如果我使用普通的 aws AMI 创建并行集群堆栈,我会得到完全相同的行为。这是怎么回事?

相关内容