我们发现,nginx 和 php-fpm 等应用程序在从连接的 NFS 挂载打开好文件时偶尔(暂时)出现错误:
php-fpm错误示例:
2017/05/20 22:53:09 [error] 55#0: *6575 FastCGI sent in stderr: "PHP message: PHP Warning: getimagesize(/www/newspaperfoundation.org/html/wp-content/blogs.dir/22/files/2017/05/19-highest-honors-1.jpg): failed to open stream: Input/output error in /www/newspaperfoundation.org/html/wp-content/plugins/mashsharer/includes/header-meta-tags.php on line 271" while reading response header from upstream, client:
192.168.255.34, server: www.dailyrepublic.com, request: "GET /solano-news/fairfield/highest-honors-commends-students-with-4-0-and-higher-grade-point-average/ HTTP/1.1", upstream: "fastcgi://172.17.0.3:9001", host: "www.dailyrepublic.com"
nginx错误示例:
2017/05/20 23:22:32 [crit] 56#0: *712 open() "/www/newspaperfoundation.org/html/wp-content/blogs.dir/24/files/2017/05/Tandem1W-550x550.jpg" failed (5: Input/output error), client: 192.168.255.34, server: www.davisenterprise.com, request: "GET /files/2017/05/Tandem1W-550x550.jpg HTTP/1.1", host: "www.davisenterprise.com", referrer: "http://www.davisenterprise.com/"
在出现临时错误期间,我可以ls
看到文件存在且具有正确的权限。经过很长一段时间后,图像最终恢复正常。其他文件返回正常,没有输入/输出错误。
我找不到太多日志来记录这个问题。但启用后,rpcdebug
我在发生错误时看到很多类似这样的消息:
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:07 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:07 tomentella kernel: nfsv4 compound op #1/5: 22 (OP_PUTFH)
May 20 16:10:07 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #1: 22: status 0
May 20 16:10:07 tomentella kernel: nfsv4 compound op #2/5: 18 (OP_OPEN)
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
May 20 16:10:07 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 5 #2: 18: status 10011
May 20 16:10:07 tomentella kernel: nfsv4 compound returned 10011
May 20 16:10:08 tomentella kernel: nfsd_dispatch: vers 4 proc 1
May 20 16:10:08 tomentella kernel: nfsv4 compound op #1/4: 22 (OP_PUTFH)
May 20 16:10:08 tomentella kernel: nfsd: fh_verify(36: 01070001 008c0312 00000000 3c639297 604b0f25 ce691899)
May 20 16:10:08 tomentella kernel: nfsv4 compound op ffff8806239e5080 opcnt 4 #1: 22: status 0
May 20 16:10:08 tomentella kernel: nfsv4 compound op #2/4: 15 (OP_LOOKUP)
特别是,我觉得我只会在文件出错时看到此消息:
May 20 16:10:07 tomentella kernel: NFSD: nfsd4_open filename 19tommeyerW.jpg op_openowner (null)
对于可能导致input/output
错误的原因您有什么想法吗?
客户端使用以下内容进行挂载:
mount.nfs4 -v -o proto=tcp $NFSMASTERHOST:/srv/data /srv/data
Centos 7 已更新软件包。错误是“新”的,最近服务器更改很少。我想也许我最近对系统软件包的更新可能是导致此更改的原因。
由于某些图像的问题时有时无,因此我能够查看日志并进行比较/对比。以下是使用 grep 命令对特定图像名称进行搜索时从正常变为糟糕的例子:
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_open_confirm on file Ron-Thomas-web-150x150.jpg
May 20 18:38:37 tomentella kernel: NFSD: nfsd4_close on file Ron-Thomas-web-150x150.jpg
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:08 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:10 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
May 20 18:39:11 tomentella kernel: NFSD: nfsd4_open filename Ron-Thomas-web-150x150.jpg op_openowner (null)
以下是nfsstat
tomentella ★ ~ $ nfsstat
Server rpc stats:
calls badcalls badclnt badauth xdrcall
94437487 6 6 0 0
Server nfs v4:
null compound
503 0% 94436978 99%
Server nfs v4 operations:
op0-unused op1-unused op2-future access close commit
0 0% 0 0% 0 0% 11213689 3% 2631554 0% 3377 0%
create delegpurge delegreturn getattr getfh link
579 0% 0 0% 0 0% 88581315 31% 32460559 11% 0 0%
lock lockt locku lookup lookup_root nverify
365 0% 0 0% 365 0% 30058556 10% 0 0% 0 0%
open openattr open_conf open_dgrd putfh putpubfh
2771686 0% 0 0% 74326 0% 0 0% 92969992 32% 0 0%
putrootfh read readdir readlink remove rename
2435 0% 1999675 0% 1917567 0% 350 0% 12404 0% 5072 0%
renew restorefh savefh secinfo setattr setcltid
1226801 0% 0 0% 5072 0% 0 0% 18315216 6% 121025 0%
setcltidconf verify write rellockowner bc_ctl bind_conn
121105 0% 0 0% 115189 0% 365 0% 0 0% 0 0%
exchange_id create_ses destroy_ses free_stateid getdirdeleg getdevinfo
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
getdevlist layoutcommit layoutget layoutreturn secinfononam sequence
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
set_ssv test_stateid want_deleg destroy_clid reclaim_comp
0 0% 0 0% 0 0% 0 0% 0 0%
Client rpc stats:
calls retrans authrefrsh
0 0 0
答案1
该问题似乎与 Docker 主机后面的重复本地 IP 有关。Docker 为两个容器分配了相同的内部 IP(例如172.17.0.4
),NFS 服务器无法确定要响应哪个客户端,在某些情况下会同时删除两个客户端。这显然是 RHEL 实现中长期存在的问题,因为我能够找到Centos 6 中记录此问题的错误报告(目前在 CentOS 7.3 中仍然对我有影响)。
答案2
我在搜索共享 NFS 安装的输入/输出错误问题的解决方案时发现了这一点。我在几台机器上安装了共享 NFS 驱动器,使用 PHP 进行读写。我偶尔会遇到这样的错误,但这种情况很常见。我不知道我的做法是否解决了这个问题,但万一它能帮助其他遇到同样问题的人……
因此,我通过克隆创建了工作服务器。这导致它们都具有相同的主机名。我没怎么想过,据我所知,主机名不会影响我正在做的事情。我将主机名全部更改为唯一的,并确保 /etc/hosts 文件包含指向 127.0.0.1 的主机名,此后 NFS 错误再也没有出现过。