HDFS NFS 网关读取输入/输出错误

2024-6-1 • tag-icon

我已经通过官方文档在我们的 HDFS 集群上启用了 HDFS NFSv3 网关。除了一台 Ubuntu 16.04 服务器机器之外，一切运行正常。以下是内核mount和机器的sysctl -a输出信息。

root@Linux:~$ uname -a
Linux xxx-server-001 4.15.0-46-generic #49~16.04.1-Ubuntu SMP Tue Feb 12 17:45:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

root@Linux:~$ mount | grep hdfs
10.30.200.100:/ on /hdfs type nfs (rw,relatime,sync,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,nolock,noacl,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.30.200.100,mountvers=3,mountport=4242,mountproto=tcp,local_lock=all,addr=10.30.200.100)

root@Linux:~$ sysctl -a | grep nfs
fs.nfs.idmap_cache_timeout = 2
fs.nfs.nfs_callback_tcpport = 0
fs.nfs.nfs_congestion_kb = 259136
fs.nfs.nfs_mountpoint_timeout = 500
fs.nfs.nlm_grace_period = 0
fs.nfs.nlm_tcpport = 0
fs.nfs.nlm_timeout = 10
fs.nfs.nlm_udpport = 0
fs.nfs.nsm_local_state = 3
fs.nfs.nsm_use_hostnames = 0
sunrpc.nfs_debug = 0xffff
sunrpc.nfsd_debug = 0x0000

症状如下：

它可以ls /hdfs读取包含很少文件的文件夹而不会出现错误，但当Input/output error它尝试读取的文件夹包含许多文件（超过 100 个左右）时，它会失败。
sudo rpcdebug -m nfs -c all当在机器上启用 NFS 调试信息时，我发现dmesg当我点击以下Input/ouput error链接时会出现以下错误日志。我检查了源代码ls这里，看起来像是一些缓冲区溢出问题。这是否意味着它是 NFS 的内核错误？

[2538707.003904] NFS: dentry_delete(1232344325/sss.123.txt, 4808cc)
[2538707.003907] NFS: decode_fattr3 prematurely hit the end of our receive buffer. Remaining buffer length is 0 words.
[2538707.003914] NFS: readdir(b200/095900) returns -5

当使用其他笔记本电脑或服务器通过挂载 HDFS NFS 网关时sudo mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync 10.30.200.100:/ /hdfs，它没有任何问题。这意味着它可能不是 NFS 网关服务器本身的问题。但是，我尝试4.15.0-46-generic在自己的笔记本电脑上安装内核，但无法重现此问题。
这个问题不是一直可以复现的，有时候刚挂载完网关，重试第二、三次就会成功，但失败率会达到 90%+，所以还是没法用。

请告诉我是否有任何可以调试这种奇怪情况的方向。提前致谢！

相关内容