我们有一个带有主动/备用资源管理器服务的 Hadoop 集群,主动资源管理器在 master1 机器上,备用资源管理器在 master2 机器上
在我们的集群中,包含资源管理器服务的 YARN 服务正在管理工作机器上的 276 个节点管理器组件
从 Ambari WEB UI 警报(资源管理器警报)中,我们注意到以下内容
Resource Manager Web UI
Connection failed to http://master2.jupiter.com:8088(timed out)
我们开始使用端口 8088 的 wget 来调试该问题,发现该进程挂起了 - HTTP 请求已发送awaiting response... No data received
。
资源管理器机器的示例
wget --debug http://master2.jupiter.com:8088
DEBUG output created by Wget 1.14 on Linux-gnu.
URI encoding = ‘UTF-8’
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:13:42-- http://master2` .jupiter.com:8088/
Resolving master2.jupiter.com (master2.jupiter.com)... 192.9.201.169
Caching master2.jupiter.com => 192.9.201.169
Connecting to master2.jupiter.com (master2.jupiter.com)|192.9.201.169|:8088... connected.
Created socket 3.
Releasing 0x0000000000a0da00 (new refcount 1).
---request begin---
GET / HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: master2.jupiter.com:8088
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 307 TEMPORARY_REDIRECT
Cache-Control: no-cache
Expires: Wed, 21 Feb 2024 10:13:42 GMT
Date: Wed, 21 Feb 2024 10:13:42 GMT
Pragma: no-cache
Expires: Wed, 21 Feb 2024 10:13:42 GMT
Date: Wed, 21 Feb 2024 10:13:42 GMT
Pragma: no-cache
Content-Type: text/plain; charset=UTF-8
X-Frame-Options: SAMEORIGIN
Location: http://master1.jupiter.com:8088/
Content-Length: 43
Server: Jetty(6.1.26.hwx)
---response end---
307 TEMPORARY_REDIRECT
Registered socket 3 for persistent reuse.
URI content encoding = ‘UTF-8’
Location: http://master1.jupiter.com:8088/ [following]
Skipping 43 bytes of body: [This is standby RM. The redirect url is: /
] done.
URI content encoding = None
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:13:42-- http://master1.jupiter.com:8088/
conaddr is: 192.9.201.169
Resolving master1.jupiter.com (master1.jupiter.com)... 192.9.66.14
Caching master1.jupiter.com => 192.9.66.14
Releasing 0x0000000000a0f320 (new refcount 1).
Found master1.jupiter.com in host_name_addresses_map (0xa0f320)
Connecting to master1.jupiter.com (master1.jupiter.com)|192.9.66.14|:8088... connected.
Created socket 4.
Releasing 0x0000000000a0f320 (new refcount 1).
.
.
.
---response end---
302 Found
Disabling further reuse of socket 3.
Closed fd 3
Registered socket 4 for persistent reuse.
URI content encoding = ‘UTF-8’
Location: http://master1.jupiter.com:8088/cluster [following]
] done.
URI content encoding = None
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
Converted file name 'index.html' (UTF-8) -> 'index.html' (UTF-8)
--2024-02-21 10:27:07-- http://master1.jupiter.com:8088/cluster
Reusing existing connection to master1.jupiter.com:8088.
Reusing fd 4.
---request begin---
GET /cluster HTTP/1.1
User-Agent: Wget/1.14 (linux-gnu)
Accept: */*
Host: master1.jupiter.com:8088
Connection: Keep-Alive
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Cache-Control: no-cache
Expires: Wed, 21 Feb 2024 10:30:23 GMT
Date: Wed, 21 Feb 2024 10:30:23 GMT
Pragma: no-cache
Expires: Wed, 21 Feb 2024 10:30:23 GMT
Date: Wed, 21 Feb 2024 10:30:23 GMT
Pragma: no-cache
Content-Type: text/html; charset=utf-8
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked
Server: Jetty(6.1.26.hwx)
---response end---
200 OK
URI content encoding = ‘utf-8’
Length: unspecified [text/html]
Saving to: ‘index.html’
[ <=> ] 1,018,917 --.-K/s in 0.04s
2024-02-21 10:31:31 (24.0 MB/s) - ‘index.html’ saved [1018917]
正如我们上面看到的,wget 需要很长时间才能完成,大约 20 分钟,而不是在一两秒内完成该过程
我们可以将 tcpdump 视为
tcpdump -vv -s0 tcp port 8088 -w /tmp/why_8088_hang.pcap
但我想了解是否有更好的简单方法来理解为什么我们会收到 HTTP 请求、等待响应...,也许它与资源管理器服务有关