上周五,我将 Ubuntu 服务器升级到 11.10,现在运行的是 3.0.0-12-server 内核。从那时起,整体性能急剧下降。升级之前,系统负载约为 0.3,但目前在具有 16GB RAM(10GB 可用,未使用交换)的 8 核 CPU 系统上,系统负载为 22-30。
我本来打算责怪 BTRFS 文件系统驱动程序和底层 MD 阵列,因为 [md1_raid1] 和 [btrfs-transacti] 消耗了大量资源。但所有的 [kworker/*:*] 消耗更多。
sar
自周五以来不断输出类似的内容:
11:25:01 CPU %user %nice %system %iowait %steal %idle
11:35:01 all 1,55 0,00 70,98 8,99 0,00 18,48
11:45:01 all 1,51 0,00 68,29 10,67 0,00 19,53
11:55:01 all 1,40 0,00 65,52 13,53 0,00 19,55
12:05:01 all 0,95 0,00 66,23 10,73 0,00 22,10
并iostat
确认写入率非常低:
sda 129,26 3059,12 614,31 258226022 51855269
sdb 98,78 24,28 3495,05 2049471 295023077
md1 191,96 202,63 611,95 17104003 51656068
md0 0,01 0,02 0,00 1980 109
问题是:如何找出 kworker 线程消耗如此多资源(以及哪一个)的原因?或者更好:这是 3.0 内核的已知问题吗?我可以使用内核参数来调整它吗?
编辑:
我按照 BTRFS 开发人员的建议将内核更新到了全新的 3.1 版本。但不幸的是这并没有改变任何事情。
答案1
我发现lkml 上的这个线程这稍微回答了你的问题。 (似乎连莱纳斯本人都对如何找出这些线程的来源感到困惑。)
基本上,有两种方法可以做到这一点:
$ echo workqueue:workqueue_queue_work > /sys/kernel/debug/tracing/set_event
$ cat /sys/kernel/debug/tracing/trace_pipe > out.txt
(wait a few secs)
为此你需要追踪在您的内核中进行编译,并使用以下命令启用它:
mount -t debugfs nodev /sys/kernel/debug
有关 Linux 函数跟踪器工具的更多信息,请参阅ftrace.txt 文档。
这将输出线程都在做什么,并且对于跟踪多个小作业很有用。
cat /proc/THE_OFFENDING_KWORKER/stack
这将输出单个线程执行大量工作的堆栈。它可以让您找出导致该特定线程占用 CPU 的原因(例如)。THE_OFFENDING_KWORKER
是进程列表中kworker的pid。
答案2
解决办法是:不知道如何查找原因。到目前为止还没人告诉我。
但与 BTRFS 开发人员交谈后发现,当在很短的时间内写入许多小文件时,btrfs 驱动程序中存在一个错误。这是 3.0 到 3.1 内核上的问题。也许它会在 3.2 中得到修复。
与此同时,我得到了当前内核的补丁来解决这个问题。
答案3
我也有类似的问题;查看kworker的线程堆栈:
while true ; do clear ; grep -n ^ /proc/24910/stack | sort -rn | cut -d: -f 2- ; sleep 1 ; done
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff810908f0>] kthread+0x0/0xe0
[<ffffffff81576432>] ret_from_fork+0x42/0x70
[<ffffffff810908f0>] kthread+0x0/0xe0
[<ffffffff810909b1>] kthread+0xc1/0xe0
[<ffffffff8108b520>] worker_thread+0x0/0x550
[<ffffffff8108b573>] worker_thread+0x53/0x550
[<ffffffff8108aa4b>] process_one_work+0x14b/0x420
[<ffffffff81405a3d>] rpm_idle+0x1ad/0x220
[<ffffffff8140509d>] __rpm_callback+0x2d/0xb0
[<ffffffffa01aef16>] usb_runtime_idle+0x26/0x30 [usbcore]
[<ffffffffa01aeef0>] usb_runtime_idle+0x0/0x30 [usbcore]
[<ffffffff8140686c>] __pm_runtime_suspend+0x5c/0x90
[<ffffffff81405b19>] __pm_runtime_idle+0x69/0x90
[<ffffffff81405295>] rpm_suspend+0x105/0x620
[<ffffffff8140513f>] rpm_callback+0x1f/0x70
[<ffffffff8140509d>] __rpm_callback+0x2d/0xb0
[<ffffffffa01aee50>] usb_runtime_suspend+0x0/0x80 [usbcore]
[<ffffffffa01aee7e>] usb_runtime_suspend+0x2e/0x80 [usbcore]
[<ffffffffa01adc4f>] usb_suspend_both+0xef/0x1f0 [usbcore]
[<ffffffffa01adb06>] usb_resume_interface.isra.6+0xa6/0x100 [usbcore]
[<ffffffffa01a0c63>] hub_resume+0x23/0x60 [usbcore]
[<ffffffffa01a0636>] hub_activate+0xc6/0x5c0 [usbcore]
[<ffffffffa01a9d3f>] usb_kill_urb+0x3f/0xa0 [usbcore]
[<ffffffffa019d249>] hub_port_status+0xd9/0x120 [usbcore]
[<ffffffff81088a4f>] __queue_work+0x12f/0x340
[<ffffffff810888b6>] insert_work+0x46/0xb0
[<ffffffffa01aa6d4>] usb_control_msg+0xc4/0x110 [usbcore]
[<ffffffffa01aa55a>] usb_start_wait_urb+0x9a/0x150 [usbcore]
[<ffffffff810a36f7>] update_curr+0xd7/0x120
我认为这是 USB 模块。我之前在另一台机器上愉快地 rmmod'd 所有 USB 和 [uex]hci 模块都意识到我已经关闭了键盘(甚至没有 ctrl-shift-sysrq-U !),所以我最终这样做了:
MODS="uvcvideo ohci_hcd ehci_hcd xhci_hcd ohci_pci ehci_pci xhci_pci usbcore"
( echo $MODS $MODS | xargs -n 1 rmmod -v ; sleep 3 ; echo $MODS | xargs -n 1 modprobe -v ; )
root@hp:~# ( echo $MODS $MODS | xargs -n 1 rmmod -v ; sleep 3 ; echo $MODS | xargs -n 1 modprobe -v ; )
rmmod: ERROR: Module ohci_hcd is in use by: ohci_pci
rmmod: ERROR: Module ehci_hcd is in use by: ehci_pci
rmmod: ERROR: Module xhci_hcd is in use by: xhci_pci
rmmod: ERROR: Module uvcvideo is not currently loaded
rmmod: ERROR: Module ohci_pci is not currently loaded
rmmod: ERROR: Module ehci_pci is not currently loaded
rmmod: ERROR: Module xhci_pci is not currently loaded
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/media/usb/uvc/uvcvideo.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/ehci-hcd.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/ohci-hcd.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/xhci-hcd.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/ehci-pci.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/ohci-pci.ko
insmod /lib/modules/4.1.0-2-amd64/kernel/drivers/usb/host/xhci-pci.ko
成功了:
grep -n ^ /proc/24910/stack | sort -rn | cut -d: -f 2-
[<ffffffffffffffff>] 0xffffffffffffffff
[<ffffffff810908f0>] kthread+0x0/0xe0
[<ffffffff81576432>] ret_from_fork+0x42/0x70
[<ffffffff810908f0>] kthread+0x0/0xe0
[<ffffffff810909b1>] kthread+0xc1/0xe0
[<ffffffff8108b520>] worker_thread+0x0/0x550
[<ffffffff8108b5ec>] worker_thread+0xcc/0x550
所以我主要怀疑的是这个小工具:RTL8723B* WIFI+蓝牙模块。我现在想知道,如果电源管理代码尝试关闭未使用的 BT 适配器,它是否意识到它是同一设备。
语境:
root@hp:~# lsusb
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 002 Device 002: ID 0c45:651b Microdia
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 002: ID 0bda:b001 Realtek Semiconductor Corp.
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 009 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 008 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 007 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 006 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
root@hp:~# lsmod | grep usb
btusb 45056 0
btbcm 16384 1 btusb
btintel 16384 1 btusb
bluetooth 438272 5 bnep,btbcm,btusb,btintel
usbcore 200704 8 btusb,uvcvideo,ohci_hcd,ohci_pci,ehci_hcd,ehci_pci,xhci_hcd,xhci_pci
usb_common 16384 1 usbcore
root@hp:~# lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Debian GNU/Linux stable-updates (sid)
Release: stable-updates
Codename: sid
root@hp:~# uname -a
Linux hp 4.1.0-2-amd64 #1 SMP Debian 4.1.6-1 (2015-08-23) x86_64 GNU/Linux
root@hp:~# dmesg | tail -n 20
[97865.088740] usb 2-4: SerialNumber: HP Webcam
[97865.091557] uvcvideo: Found UVC 1.00 device HP Webcam (0c45:651b)
[97865.105948] input: HP Webcam as /devices/pci0000:00/0000:00:13.2/usb2/2-4/2-4:1.0/input/input17
[97865.189817] usb 3-3: new full-speed USB device number 2 using ohci-pci
[97865.350981] usb 3-3: No LPM exit latency info found, disabling LPM.
[97865.368958] usb 3-3: New USB device found, idVendor=0bda, idProduct=b001
[97865.368969] usb 3-3: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[97865.368976] usb 3-3: Product: Bluetooth Radio
[97865.368981] usb 3-3: Manufacturer: Realtek
[97865.368985] usb 3-3: SerialNumber: 00e04c000001
[97865.375859] Bluetooth: hci0: rtl: examining hci_ver=06 hci_rev=000b lmp_ver=06 lmp_subver=8723
[97865.375867] Bluetooth: hci0: rtl: loading rtl_bt/rtl8723b_fw.bin
[97865.375896] usb 3-3: firmware: failed to load rtl_bt/rtl8723b_fw.bin (-2)
[97865.375902] usb 3-3: Direct firmware load for rtl_bt/rtl8723b_fw.bin failed with error -2
[97865.375907] Bluetooth: hci0: Failed to load rtl_bt/rtl8723b_fw.bin
[97865.397812] Bluetooth: hci0: rtl: examining hci_ver=06 hci_rev=000b lmp_ver=06 lmp_subver=8723
[97865.397821] Bluetooth: hci0: rtl: loading rtl_bt/rtl8723b_fw.bin
[97865.397850] usb 3-3: firmware: failed to load rtl_bt/rtl8723b_fw.bin (-2)
[97865.397856] usb 3-3: Direct firmware load for rtl_bt/rtl8723b_fw.bin failed with error -2
[97865.397861] Bluetooth: hci0: Failed to load rtl_bt/rtl8723b_fw.bin
答案4
echo N >/sys/module/drm_kms_helper/parameters/poll
(在root模式下)
英特尔显卡有问题