我们有一个带有487
数据节点机器的 Hadoop 集群(每个数据节点机器还包括服务节点管理器),所有机器都是物理机器( DELL ),操作系统是 RHEL 7.9 版本。
每个数据节点机器有12个磁盘,每个磁盘大小为12T
从 HDP 包安装的 Hadoop 集群类型(以前在 Horton-works 下,现在在 Cloudera 下)
用户抱怨在数据节点机器上运行的 Spark 应用程序运行缓慢
经过调查,我们在数据节点日志中看到以下警告
2024-03-18 17:41:30,230 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 401ms (threshold=300ms), downstream DNs=[172.87.171.24:50010, 172.87.171.23:50010]
2024-03-18 17:41:49,795 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 410ms (threshold=300ms), downstream DNs=[172.87.171.26:50010, 172.87.171.31:50010]
2024-03-18 18:06:29,585 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 303ms (threshold=300ms), downstream DNs=[172.87.171.34:50010, 172.87.171.22:50010]
2024-03-18 18:18:55,931 WARN datanode.DataNode (BlockReceiver.java:receivePacket(567)) - Slow BlockReceiver write packet to mirror took 729ms (threshold=300ms), downstream DNs=[172.87.11.27:50010]
从上面的日志中我们可以看到warning Slow BlockReceiver write packet to mirror took xxms
以及数据节点机器等等172.87.171.23,172.87.171.24
。
据我了解,异常“缓慢”BlockReceiver write packet to mirror
可能表示将块写入操作系统缓存或磁盘时出现延迟
因此,我尝试收集此警告/异常的原因,这里有
将块写入操作系统缓存或磁盘的延迟
集群已达到或接近其资源限制(内存、CPU 或磁盘)
机器之间的网络问题
根据我的验证,我没有看到磁盘或者中央处理器或者记忆问题,我们检查了所有机器
从网络方面来看,我没有看到与机器本身相关的特殊问题
我们还使用 iperf3 ro 检查一台机器与另一台机器之间的带宽。
data-node01
下面是之间的示例data-node03
(根据我的理解,如果我错了请纠正我,看起来带宽是可以的)
来自 data-node01
iperf3 -i 10 -s
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-10.00 sec 7.90 GBytes 6.78 Gbits/sec
[ 5] 10.00-20.00 sec 8.21 GBytes 7.05 Gbits/sec
[ 5] 20.00-30.00 sec 7.25 GBytes 6.23 Gbits/sec
[ 5] 30.00-40.00 sec 7.16 GBytes 6.15 Gbits/sec
[ 5] 40.00-50.00 sec 7.08 GBytes 6.08 Gbits/sec
[ 5] 50.00-60.00 sec 6.27 GBytes 5.39 Gbits/sec
[ 5] 60.00-60.04 sec 35.4 MBytes 7.51 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-60.04 sec 0.00 Bytes 0.00 bits/sec sender
[ 5] 0.00-60.04 sec 43.9 GBytes 6.28 Gbits/sec receiver
来自 data-node03
iperf3 -i 1 -t 60 -c 172.87.171.84
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 792 MBytes 6.64 Gbits/sec 0 3.02 MBytes
[ 4] 1.00-2.00 sec 834 MBytes 6.99 Gbits/sec 54 2.26 MBytes
[ 4] 2.00-3.00 sec 960 MBytes 8.05 Gbits/sec 0 2.49 MBytes
[ 4] 3.00-4.00 sec 896 MBytes 7.52 Gbits/sec 0 2.62 MBytes
[ 4] 4.00-5.00 sec 790 MBytes 6.63 Gbits/sec 0 2.70 MBytes
[ 4] 5.00-6.00 sec 838 MBytes 7.03 Gbits/sec 4 1.97 MBytes
[ 4] 6.00-7.00 sec 816 MBytes 6.85 Gbits/sec 0 2.17 MBytes
[ 4] 7.00-8.00 sec 728 MBytes 6.10 Gbits/sec 0 2.37 MBytes
[ 4] 8.00-9.00 sec 692 MBytes 5.81 Gbits/sec 47 1.74 MBytes
[ 4] 9.00-10.00 sec 778 MBytes 6.52 Gbits/sec 0 1.91 MBytes
[ 4] 10.00-11.00 sec 785 MBytes 6.58 Gbits/sec 48 1.57 MBytes
[ 4] 11.00-12.00 sec 861 MBytes 7.23 Gbits/sec 0 1.84 MBytes
[ 4] 12.00-13.00 sec 844 MBytes 7.08 Gbits/sec 0 1.96 MBytes
注意-网卡速度很快10G
(我们通过 ethtool 检查过)
我们还检查了网卡的固件版本
ethtool -i p1p1
driver: i40e
version: 2.8.20-k
firmware-version: 8.40 0x8000af82 20.5.13
expansion-rom-version:
bus-info: 0000:3b:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
我们还检查了内核消息(dmesg
),但没有发现任何特殊情况。
这里还有一些其他机器的示例,这些示例对于所有机器来说都是典型的
来自 dmesg
[ 8910.032800] perf: interrupt took too long (3264 > 3131), lowering kernel.perf_event_max_sample_rate to 61000
[10472.999301] perf: interrupt took too long (4112 > 4080), lowering kernel.perf_event_max_sample_rate to 48000
[13881.302673] perf: interrupt took too long (5231 > 5140), lowering kernel.perf_event_max_sample_rate to 38000
[19118.612768] warning: `lshw' uses legacy ethtool link settings API, link modes are only partially reported
[24899.873110] perf: interrupt took too long (6539 > 6538), lowering kernel.perf_event_max_sample_rate to 30000
[241682.630383] perf: interrupt took too long (8182 > 8173), lowering kernel.perf_event_max_sample_rate to 24000
来自 ethtool
Settings for p5p1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseX/Full
10000baseSR/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes: 1000baseX/Full
10000baseSR/Full
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: internal
Auto-negotiation: off
Supports Wake-on: g
Wake-on: g
Current message level: 0x00000007 (7)
drv probe link
Link detected: yes
来自 smartctl
smartctl -a /dev/sdb
smartctl 7.0 2018-12-30 r4883 [x86_64-linux-3.10.0-1160.el7.x86_64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST2000NM0155
Revision: DT31
Compliance: SPC-4
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Logical block size: 512 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5009484f99f
Serial number: ZC21D83H
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Tue Mar 19 15:57:43 2024 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature: 29 C
Drive Trip Temperature: 60 C
Manufactured in week 33 of year 2017
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 93
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 1820
Elements in grown defect list: 0
Vendor (Seagate Cache) information
Blocks sent to initiator = 186667170
Blocks received from initiator = 1275068102
Blocks read from cache and sent to initiator = 1770226642
Number of read and write commands whose size <= segment size = 2229185660
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 41721.87
number of minutes until next internal SMART test = 53
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 3668927007 9 0 3668927016 9 1197190.371 0
write: 0 0 386 386 386 849663.857 0
verify: 8968 0 0 8968 0 0.000 0
Non-medium error count: 662
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed 96 3 - [- - -]
Long (extended) Self-test duration: 13740 seconds [229.0 minutes]
从其中一个数据节点上的 iostat
avg-cpu: %user %nice %system %iowait %steal %idle
45.81 0.00 3.92 4.09 0.00 46.18
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sdh 4.75 26.19 155.21 22283500 132053386
sde 126.62 13137.47 3707.88 11177781537 3154781160
sdf 96.12 12452.01 1084.80 10594563333 922986408
sdg 96.70 13152.62 1084.87 11190668113 923042672
sdb 116.82 13834.69 3637.37 11770994845 3094793700
sda 116.94 13900.15 3659.90 11826688565 3113955968
sdc 120.05 13680.79 4497.98 11640055245 3827023720
sdd 84.66 12973.68 1044.57 11038424341 888755400
来自免费-g
free -g
total used free shared buff/cache available
Mem: 314 234 0 2 79 76
Swap: 15 1 14
来自 dmesg 关于 CPU
dmesg | grep CPU
[ 0.000000] smpboot: Allowing 32 CPUs, 0 hotplug CPUs
[ 0.000000] smpboot: Ignoring 160 unusable CPUs in ACPI table
[ 0.000000] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:32 nr_cpu_ids:32 nr_node_ids:2
[ 0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=2
[ 0.000000] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=32.
[ 0.184771] CPU0: Thermal monitoring enabled (TM1)
[ 0.184943] TAA: Vulnerable: Clear CPU buffers attempted, no microcode
[ 0.184944] MDS: Vulnerable: Clear CPU buffers attempted, no microcode
[ 0.324340] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz (fam: 06, model: 4f, stepping: 01)
[ 0.327772] smpboot: CPU 1 Converting physical 0 to logical die 1
[ 0.408126] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[ 0.436824] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.436828] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[ 0.464933] Brought up 32 CPUs
[ 3.223989] acpi LNXCPU:7e: hash matches
[ 49.145592] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.