Hadoop + 警告：数据节点机器的块接收速度很慢

2024-6-2 • tag-icon

我们有一个带有487数据节点机器的 Hadoop 集群（每个数据节点机器还包括服务节点管理器），所有机器都是物理机器（ DELL ），操作系统是 RHEL 7.9 版本。

每个数据节点机器有12个磁盘，每个磁盘大小为12T

从 HDP 包安装的 Hadoop 集群类型（以前在 Horton-works 下，现在在 Cloudera 下）

用户抱怨在数据节点机器上运行的 Spark 应用程序运行缓慢

经过调查，我们在数据节点日志中看到以下警告

2024-03-18 17:41:30,230 WARN  datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 401ms (threshold=300ms), downstream DNs=[172.87.171.24:50010, 172.87.171.23:50010]
2024-03-18 17:41:49,795 WARN  datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 410ms (threshold=300ms), downstream DNs=[172.87.171.26:50010, 172.87.171.31:50010]
2024-03-18 18:06:29,585 WARN  datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 303ms (threshold=300ms), downstream DNs=[172.87.171.34:50010, 172.87.171.22:50010]
2024-03-18 18:18:55,931 WARN  datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 729ms (threshold=300ms), downstream DNs=[172.87.11.27:50010]

从上面的日志中我们可以看到warning Slow BlockReceiver write packet to mirror took xxms以及数据节点机器等等172.87.171.23,172.87.171.24。

据我了解，异常“缓慢”BlockReceiver write packet to mirror可能表示将块写入操作系统缓存或磁盘时出现延迟

因此，我尝试收集此警告/异常的原因，这里有

将块写入操作系统缓存或磁盘的延迟
集群已达到或接近其资源限制（内存、CPU 或磁盘）
机器之间的网络问题

根据我的验证，我没有看到磁盘或者中央处理器或者记忆问题，我们检查了所有机器

从网络方面来看，我没有看到与机器本身相关的特殊问题

我们还使用 iperf3 ro 检查一台机器与另一台机器之间的带宽。

data-node01下面是之间的示例data-node03 （根据我的理解，如果我错了请纠正我，看起来带宽是可以的）

来自 data-node01

iperf3 -i 10 -s

[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-10.00  sec  7.90 GBytes  6.78 Gbits/sec
[  5]  10.00-20.00  sec  8.21 GBytes  7.05 Gbits/sec
[  5]  20.00-30.00  sec  7.25 GBytes  6.23 Gbits/sec
[  5]  30.00-40.00  sec  7.16 GBytes  6.15 Gbits/sec
[  5]  40.00-50.00  sec  7.08 GBytes  6.08 Gbits/sec
[  5]  50.00-60.00  sec  6.27 GBytes  5.39 Gbits/sec
[  5]  60.00-60.04  sec  35.4 MBytes  7.51 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth
[  5]   0.00-60.04  sec  0.00 Bytes  0.00 bits/sec                  sender
[  5]   0.00-60.04  sec  43.9 GBytes  6.28 Gbits/sec                  receiver

来自 data-node03

iperf3 -i 1 -t 60 -c 172.87.171.84

[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec   792 MBytes  6.64 Gbits/sec    0   3.02 MBytes
[  4]   1.00-2.00   sec   834 MBytes  6.99 Gbits/sec   54   2.26 MBytes
[  4]   2.00-3.00   sec   960 MBytes  8.05 Gbits/sec    0   2.49 MBytes
[  4]   3.00-4.00   sec   896 MBytes  7.52 Gbits/sec    0   2.62 MBytes
[  4]   4.00-5.00   sec   790 MBytes  6.63 Gbits/sec    0   2.70 MBytes
[  4]   5.00-6.00   sec   838 MBytes  7.03 Gbits/sec    4   1.97 MBytes
[  4]   6.00-7.00   sec   816 MBytes  6.85 Gbits/sec    0   2.17 MBytes
[  4]   7.00-8.00   sec   728 MBytes  6.10 Gbits/sec    0   2.37 MBytes
[  4]   8.00-9.00   sec   692 MBytes  5.81 Gbits/sec   47   1.74 MBytes
[  4]   9.00-10.00  sec   778 MBytes  6.52 Gbits/sec    0   1.91 MBytes
[  4]  10.00-11.00  sec   785 MBytes  6.58 Gbits/sec   48   1.57 MBytes
[  4]  11.00-12.00  sec   861 MBytes  7.23 Gbits/sec    0   1.84 MBytes
[  4]  12.00-13.00  sec   844 MBytes  7.08 Gbits/sec    0   1.96 MBytes

注意-网卡速度很快10G（我们通过 ethtool 检查过）

我们还检查了网卡的固件版本

ethtool -i p1p1
driver: i40e
version: 2.8.20-k
firmware-version: 8.40 0x8000af82 20.5.13
expansion-rom-version:
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

我们还检查了内核消息（dmesg），但没有发现任何特殊情况。

这里还有一些其他机器的示例，这些示例对于所有机器来说都是典型的

来自 dmesg

[ 8910.032800] perf: interrupt took too long (3264 > 3131), lowering kernel.perf_event_max_sample_rate to 61000
[10472.999301] perf: interrupt took too long (4112 > 4080), lowering kernel.perf_event_max_sample_rate to 48000
[13881.302673] perf: interrupt took too long (5231 > 5140), lowering kernel.perf_event_max_sample_rate to 38000
[19118.612768] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported
[24899.873110] perf: interrupt took too long (6539 > 6538), lowering kernel.perf_event_max_sample_rate to 30000
[241682.630383] perf: interrupt took too long (8182 > 8173), lowering kernel.perf_event_max_sample_rate to 24000

来自 ethtool

Settings for p5p1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseX/Full
                                10000baseSR/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  1000baseX/Full
                                10000baseSR/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Speed: 10000Mb/s
        Duplex: Full
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: off
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

来自 smartctl

smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST2000NM0155
Revision:             DT31
Compliance:           SPC-4
User Capacity:        2,000,398,934,016 bytes [2.00 TB]
Logical block size:   512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5009484f99f
Serial number:        ZC21D83H
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Tue Mar 19 15:57:43 2024 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     29 C
Drive Trip Temperature:        60 C

Manufactured in week 33 of year 2017
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  93
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1820
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 186667170
  Blocks received from initiator = 1275068102
  Blocks read from cache and sent to initiator = 1770226642
  Number of read and write commands whose size <= segment size = 2229185660
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 41721.87
  number of minutes until next internal SMART test = 53

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3668927007        9         0  3668927016          9    1197190.371           0
write:         0        0       386       386        386     849663.857           0
verify:     8968        0         0      8968          0          0.000           0

Non-medium error count:      662

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  96       3                 - [-   -    -]

Long (extended) Self-test duration: 13740 seconds [229.0 minutes]

从其中一个数据节点上的 iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          45.81    0.00    3.92    4.09    0.00   46.18

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdh               4.75        26.19       155.21   22283500  132053386
sde             126.62     13137.47      3707.88 11177781537 3154781160
sdf              96.12     12452.01      1084.80 10594563333  922986408
sdg              96.70     13152.62      1084.87 11190668113  923042672
sdb             116.82     13834.69      3637.37 11770994845 3094793700
sda             116.94     13900.15      3659.90 11826688565 3113955968
sdc             120.05     13680.79      4497.98 11640055245 3827023720
sdd              84.66     12973.68      1044.57 11038424341  888755400

来自免费-g

free -g
              total        used        free      shared  buff/cache   available
Mem:            314         234           0           2          79          76
Swap:            15           1          14

来自 dmesg 关于 CPU

dmesg  | grep CPU
[    0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
[    0.000000] smpboot: Ignoring 160 unusable CPUs in ACPI table
[    0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:2
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=2
[    0.000000]  RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=32.
[    0.184771] CPU0: Thermal monitoring enabled (TM1)
[    0.184943] TAA: Vulnerable: Clear CPU buffers attempted, no microcode
[    0.184944] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[    0.324340] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (fam: 06, model: 4f, stepping: 01)
[    0.327772] smpboot: CPU 1 Converting physical 0 to logical die 1
[    0.408126] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[    0.436824] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    0.436828] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[    0.464933] Brought up 32 CPUs
[    3.223989] acpi LNXCPU:7e: hash matches
[   49.145592] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.

相关内容