在我的 Ubuntu 14.04.2 服务器上,IPv4 每小时会离线几次(我见过一到四次,但每小时没有特定的分钟左右)。
我的托管商坚持认为问题出在服务器端,而且基于 Debian 的救援系统没有表现出相同的症状,这一事实让我认为他们是对的。但是,救援系统不会像已安装的 Ubuntu 系统那样在任何接口上配置全局 IPv6 地址。
通常每小时(基于 IPv4)SSH 连接会由于太多超时数据包而断开一到四次。
当从另一个远程服务器监控服务器时,ICMPv4 ping 要么超时,要么路由器将响应目标主机不可用(我经常看到这两种情况!)。同时 ICMPv6 ping 完全不受影响。
另外,当我使用 IPv6 通过 SSH 从其他远程主机进行连接时,该连接不会停止,系统也不会出现冻结等情况(正如我最初所怀疑的那样)。
系统和内核日志也表明没有问题,无论我禁用所有防火墙规则还是打开防火墙都没有区别。我还让它运行并启用了所有丢弃的数据包的日志记录,以查看是否可以将其中的某些内容关联起来。
在这些离线时间没有cron
作业正在运行,并且它也不会在同一分钟左右发生,这表明有一些常规cron
作业。
我还缩小了另一个方面的范围。当我 ping (ICMPv4) 时从显示症状的主机,环回不受影响eth0
。对我来说,这表明它与一般的 IPv4 无关,而是特定于与系统中的一个网卡相对应的接口。
我如何从这里继续进行故障排除?鉴于我到目前为止所做的事情,下一步是什么?是否有一个已知的错误与我看到的症状相对应?
注意:我已经花了一个多月的时间来诊断这个问题。所以在这里问对我来说是最后的手段。请根据需要索取更多详细信息,我将添加它们。
到目前为止我所做的:
ping
与ping6
mtr
从服务器到服务器,我的托管服务商并不认为丢失的几个数据包有任何异常- 分别通过 IPv4 和 IPv6 进行 SSH 连接
tail
-ed/var/log/kern.log
,/var/log/syslog
看看/var/log/auth.log
离线期间是否会出现任何内容- 分别刷新了 IPv4 和 IPv6 的所有防火墙规则
- 还简单地启用了丢弃数据包的日志记录
- 删除了几个我怀疑是潜在罪魁祸首的软件包
以下是手动安装的软件包列表:
# echo $(apt-mark showmanual)
acl adduser aggregate apparmor apparmor-profiles apparmor-utils apt apt-cacher-ng apt-file apt-rdepends apt-utils base-files base-passwd bash bash-completion bash-static bridge-utils bsdutils btrfs-tools busybox-initramfs busybox-static bzip2 bzr ca-certificates cgmanager cgroup-bin cifs-utils colordiff coreutils cpio crda cron cron-apt cryptmount cryptsetup dash debconf debianutils debootstrap debsums dh-python dialog diffutils dnsutils dpkg dpkg-dev duplicity e2fslibs e2fsprogs ed etckeeper fakechroot fakeroot file findutils gcc-4.8-base gcc-4.9-base gdisk-noicu git git-svn gnupg gnutls-bin gpgv grep gzip haveged heirloom-mailx hostname htop ifupdown init-system-helpers initramfs-tools initramfs-tools-bin initscripts insserv iproute2 ipset iptables iputils-ping klibc-utils kmod kpartx less libacl1 libapt-inst1.5 libapt-pkg4.12 libattr1 libaudit-common libaudit1 libblkid1 libbz2-1.0 libc-bin libc6 libcap2 libcgmanager0 libck-connector0 libcomerr2 libdb5.3 libdbus-1-3 libdebconfclient0 libdrm2 libedit2 libevent-2.0-5 libexpat1 libffi6 libgcc1 libgdbm3 libgssapi-krb5-2 libjson-c2 libjson0 libk5crypto3 libkeyutils1 libklibc libkmod2 libkrb5-3 libkrb5support0 liblzma5 libmount1 libmpdec2 libncurses5 libncursesw5 libnih-dbus1 libnih1 libnl-3-200 libnl-genl-3-200 libpam-modules libpam-modules-bin libpam-mount libpam-runtime libpam-systemd libpam0g libpci3 libpcre3 libplymouth2 libpng12-0 libprocps3 libpython-stdlib libpython2.7-minimal libpython2.7-stdlib libpython3-stdlib libpython3.4-minimal libpython3.4-stdlib libreadline6 libselinux1 libsemanage-common libsemanage1 libsepol1 libslang2 libsqlite3-0 libss2 libssl1.0.0 libstdc++6 libtinfo5 libudev1 libui-dialog-perl libusb-0.1-4 libusb-1.0-0 libustr-1.0-1 libuuid1 libwrap0 linux-firmware linux-image-3.13.0-24-generic linux-image-extra-3.13.0-24-generic linux-image-generic localepurge locales logcheck logcheck-database login logrotate lsb-base lsb-release lshw lsof lxc lxc-templates make makedev man-db manpages manpages-dev mawk mc md5deep mdadm mercurial mime-support mlocate module-init-tools molly-guard mount mountall mtr-tiny multiarch-support ncurses-base ncurses-bin ndisc6 net-tools netcat-openbsd netsniff-ng nmap openntpd openssh-client openssh-server openssh-sftp-server p7zip-full p7zip-rar passwd pax pciutils perl perl-base perl-modules plymouth postfix procps psmisc pv python python-apt-common python-mako python-mechanize python-minimal python2.7 python2.7-minimal python3 python3-apt python3-minimal python3.4 python3.4-minimal readline-common reprepro resolvconf rsyslog sed sensible-utils sharutils smartmontools subversion sudo sysv-rc sysvinit-utils tar tcpdump tcptraceroute tmux traceroute tree tzdata ubuntu-keyring ucf udev uidmap unattended-upgrades unbound-host unrar unzip upstart usbutils util-linux vim-nox vnstat wget whois wireless-regdb xz-utils zerofree zip zlib1g zsh-doc zsh-static
debootstrap
(当然,其中一些来自过程。)
要求提供的信息:
$ uname -a|sed 's/'$(hostname -f)'/foobar/g'
Linux foobar 3.13.0-46-generic #79-Ubuntu SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
我更新到了较新的内核(包裹linux-image-generic-lts-utopic
):
$ uname -a|sed 's/'$(hostname -f)'/foobar/g'
Linux foobar 3.16.0-33-generic #44~14.04.1-Ubuntu SMP Fri Mar 13 10:33:29 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
输出sysctl -a
已被匿名化并放置这里。
该命令是(减去 1 将sed
接口名称替换为_bridge
):
sudo sysctl -a|sed 's/'$(hostname -f)'/foobar/g;s/'$(hostname -s)'/foobar/g'|grep -Ev '^net\.ipv[46]\.(neigh|conf)\._[s]'|grep -v nf_log
总体有三与所有配置为 IPv4 和 IPv6 的接口类似_bridge
,仅 IP 地址不同。但是,它们目前尚未使用。每间客房预计供一名 LXC 客人使用。
# lspci -s 06:00.0 -vv
06:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 02)
Subsystem: Micro-Star International Co., Ltd. [MSI] X58 Pro-E
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 256 bytes
Interrupt: pin A routed to IRQ 42
Region 0: I/O ports at e800 [size=256]
Region 2: Memory at fbeff000 (64-bit, non-prefetchable) [size=4K]
Region 4: Memory at f6ff0000 (64-bit, prefetchable) [size=64K]
[virtual] Expansion ROM at fbe00000 [disabled] [size=128K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=375mA PME(D0+,D1+,D2+,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee00000 Data: 40c1
Capabilities: [70] Express (v1) Endpoint, MSI 01
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Capabilities: [b0] MSI-X: Enable- Count=2 Masked-
Vector table: BAR=4 offset=00000000
PBA: BAR=4 offset=00000800
Capabilities: [d0] Vital Product Data
Unknown small resource type 05, will not decode more.
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr+ BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01
Status: NegoPending- InProgress-
Capabilities: [160 v1] Device Serial Number 01-00-00-00-68-4c-e0-00
Kernel driver in use: r8169
# modinfo r8169
filename: /lib/modules/3.16.0-33-generic/kernel/drivers/net/ethernet/realtek/r8169.ko
firmware: rtl_nic/rtl8168g-3.fw
firmware: rtl_nic/rtl8168g-2.fw
firmware: rtl_nic/rtl8106e-2.fw
firmware: rtl_nic/rtl8106e-1.fw
firmware: rtl_nic/rtl8411-2.fw
firmware: rtl_nic/rtl8411-1.fw
firmware: rtl_nic/rtl8402-1.fw
firmware: rtl_nic/rtl8168f-2.fw
firmware: rtl_nic/rtl8168f-1.fw
firmware: rtl_nic/rtl8105e-1.fw
firmware: rtl_nic/rtl8168e-3.fw
firmware: rtl_nic/rtl8168e-2.fw
firmware: rtl_nic/rtl8168e-1.fw
firmware: rtl_nic/rtl8168d-2.fw
firmware: rtl_nic/rtl8168d-1.fw
version: 2.3LK-NAPI
license: GPL
description: RealTek RTL-8169 Gigabit Ethernet driver
author: Realtek and the Linux r8169 crew <[email protected]>
srcversion: D0E1934D763B6927E0CB4A4
alias: pci:v00000001d00008168sv*sd00002410bc*sc*i*
alias: pci:v00001737d00001032sv*sd00000024bc*sc*i*
alias: pci:v000016ECd00000116sv*sd*bc*sc*i*
alias: pci:v00001259d0000C107sv*sd*bc*sc*i*
alias: pci:v00001186d00004302sv*sd*bc*sc*i*
alias: pci:v00001186d00004300sv*sd*bc*sc*i*
alias: pci:v00001186d00004300sv00001186sd00004B10bc*sc*i*
alias: pci:v000010ECd00008169sv*sd*bc*sc*i*
alias: pci:v000010ECd00008168sv*sd*bc*sc*i*
alias: pci:v000010ECd00008167sv*sd*bc*sc*i*
alias: pci:v000010ECd00008136sv*sd*bc*sc*i*
alias: pci:v000010ECd00008129sv*sd*bc*sc*i*
depends: mii
intree: Y
vermagic: 3.16.0-33-generic SMP mod_unload modversions
signer: Magrathea: Glacier signing key
sig_key: 25:26:EE:FE:32:C9:58:B4:CD:85:CA:5F:BF:EB:ED:A1:75:D1:B2:18
sig_hashalgo: sha512
parm: use_dac:Enable PCI DAC. Unsafe on 32 bit PCI slot. (int)
parm: debug:Debug verbosity level (0=none, ..., 16=all) (int)
答案1
您肯定遇到了网络堆栈问题,因此我确实建议您使用冗长但经过验证的路线图来解决此问题。我自己在类似的情况下使用过它,甚至在真实的硬件上也发生过。安装 Ncurses 并与编译器一起构建必要的开发包。首先,制作 Linux 内核的 git 快照,并在 git 快照目录中执行此操作:
git checkout-index -a -f --prefix=/usr/src/linux-build/ <--- trailing slash is MUST-HAVE!
cd /usr/src/linux-build
cp /boot/config-\`uname -r\` .config
make menuconfig
检查您需要的所有 IP 选项并禁用 IPv6。之后构建并安装您的内核。其次,在你的/etc/sysctl.conf
:
net.ipv6.conf.default.autoconf=0
net.ipv6.conf.default.accept_dad=0
net.ipv6.conf.default.accept_ra=0
net.ipv6.conf.default.accept_ra_defrtr=0
net.ipv6.conf.default.accept_ra_rtr_pref=0
net.ipv6.conf.default.accept_ra_pinfo=0
net.ipv6.conf.default.accept_source_route=0
net.ipv6.conf.default.accept_redirects=0
net.ipv6.conf.default.forwarding=0
net.ipv6.conf.all.autoconf=0
net.ipv6.conf.all.accept_dad=0
net.ipv6.conf.all.accept_ra=0
net.ipv6.conf.all.accept_ra_defrtr=0
net.ipv6.conf.all.accept_ra_rtr_pref=0
net.ipv6.conf.all.accept_ra_pinfo=0
net.ipv6.conf.all.accept_source_route=0
net.ipv6.conf.all.accept_redirects=0
net.ipv6.conf.all.forwarding=0
net.ipv6.conf.default.disable_ipv6=1
net.ipv6.conf.all.disable_ipv6=1
net.ipv6.conf.lo.disable_ipv6=1
然后重新启动,并在 /etc/ssh/ssh 中检查这一点d_配置:
KeyRegenerationInterval 3600
ServerKeyBits 768
Compression yes <---- PAY ATTENTION TO THIS : see below
压缩指令必须设置为“yes”或“no”,默认情况下它是“delayed”——这是一个引发问题的值。