Spark YARN 容量调度程序

2024-6-1 • tag-icon

我正在尝试在 Amazon EMR 中设置容量调度程序，除了默认队列外还有 2 个队列。我已成功创建队列 user1 和 user2，但是当我使用 spark-submit 在新队列上运行脚本时，它会卡在 ACCEPTED 状态。奇怪的是，我仍然可以将应用程序提交到默认队列，并且一切都按预期运行。

当前使用默认调度程序，但我尝试使用主导调度程序，结果相同。

我查看了日志，它们大部分看起来都还好。有时我会收到一个错误：

2019-12-04 19:18:28,888 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start

虽然我已将 maximum-am-resource-percent 设置为 100%（我认为）。但我也尝试为每个队列明确设置此属性，但没有成功。

我如何开展这项工作：

spark-submit --master yarn --queue user1 test.py

正在运行（健康）的任务/核心节点的数量似乎没有什么区别。

我是 Spark 管理新手，所以任何指导都会很感激。除了表示没有可用资源的回复外，我在网上找不到太多信息。我的配置很可能没有分配足够的资源，但如果默认队列仍然正常工作，我认为不会出现这种情况。这是我的 capacity-scheduler.xml 供参考：

<?xml version="1.0" encoding="UTF-8" ?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>

  <property>
    <name>yarn.scheduler.capacity.maximum-applications</name>
    <value>10000</value>
    <description>
      Maximum number of applications that can be pending and running.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
    <value>1.0</value>
    <description>
      Maximum percent of resources in the cluster which can be used to run
      application masters i.e. controls number of concurrent running
      applications.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
    <description>
      The ResourceCalculator implementation to be used to compare
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare
      multi-dimensional resources such as Memory, CPU etc.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.queues</name>
    <value>user1,user2,default</value>
    <description>
      The queues at the this level (root is the root queue).
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.capacity</name>
    <value>10</value>
    <description>Default queue target capacity.</description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
    <value>2</value>
    <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
    <value>40</value>
    <description>
      The maximum capacity of the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.state</name>
    <value>RUNNING</value>
    <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
    <value>*</value>
    <description>
      The ACL of who can submit jobs to the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
    <value>*</value>
    <description>
      The ACL of who can administer jobs on the default queue.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.node-locality-delay</name>
    <value>40</value>
    <description>
      Number of missed scheduling opportunities after which the CapacityScheduler
      attempts to schedule rack-local containers.
      Typically this should be set to number of nodes in the cluster, By default is setting
      approximately number of nodes in one rack which is 40.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings</name>
    <value />
    <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
    <value>false</value>
    <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
    <value>1</value>
    <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.accessible-node-labels</name>
    <value>*</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.accessible-node-labels.CORE.capacity</name>
    <value>100</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.accessible-node-labels</name>
    <value>*</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity</name>
    <value>100</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.user1.capacity</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user1.user-limit-factor</name>
    <value>2</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user1.maximum-capacity</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user2.capacity</name>
    <value>40</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user2.user-limit-factor</name>
    <value>2</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user2.maximum-capacity</name>
    <value>50</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user1.state</name>
    <value>RUNNING</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user2.state</name>
    <value>RUNNING</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user1.acl_submit_applications</name>
    <value>*</value>
  </property>
  <property>
    <name>yarn.scheduler.capacity.root.user2.acl_submit_applications</name>
    <value>*</value>
  </property>
</configuration>

谢谢，塞思

答案1

从 5.19.0 版本开始，EMR 对不同的节点组 ( MASTER/ CORE/ TASK) [1] 使用 YARN 节点标签。为了使自定义队列在此设置中正常工作，您需要明确配置队列和节点标签之间的映射。

如果您查看capacity-scheduler.xml，您会注意到 EMR 已经为您针对root和default队列配置了此功能：

  <property>
    <name>yarn.scheduler.capacity.root.default.accessible-node-labels</name>
    <value>*</value>
  </property>

  <property>
    <name>yarn.scheduler.capacity.root.default.accessible-node-labels.CORE.capacity</name>
    <value>100</value>
  </property>

这就是为什么在队列上运行应用程序default能够正常工作的原因。

因此，要使自定义队列正常工作，您必须设置它们的accessible-node-labels设置capacity。您应该能够使用capacity与队列配置其余部分相同的设置。

[1]https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-master-core-task-nodes.html

答案1

相关内容