我有一台机器(test-server
),上面有一个 rabbitmq 服务器和 4 个 celery 工作进程,另一台机器(test-worker
)上面有 240 个 celery 工作进程,它们连接到 上的 rabbitmq 服务器test-server
。
所有队列当前都是空的。
通过这种设置beam.smp
(我收集到的是一个与 rabbitmq 相关的进程),CPU 使用率为 200-250%,并且消耗几百 MB 的 RAM(这可能没问题,但不确定)。
如果我停止远程机器上的 worker,它就会恢复正常。如果我只启动 40 个 worker 而不是 240 个,那么它或多或少是没问题的 - 仍然消耗 CPU,但大约是 50%。
主 beam.smp 线程卡在 上select
,我认为这没什么问题,因为它只是在监听子线程。下面是子线程的 strace。有一些调用 ,epoll_wait
超时为零,还有很多futex
调用。
我也发现了这个错误,在 oslo 中有描述(不知道那是什么)https://bugs.launchpad.net/oslo.messaging/+bug/1518430,其中还提到了零超时epoll_wait
调用,并提到了 rabbitmq。
你知道在这种情况下兔子是否会出现这种预期行为吗?我应该在哪里寻找原因?
谢谢
test-server$ sudo strace -p 26866 2>&1 | head -n 50
Process 26866 attached
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, {}, 256, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 785829269}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
writev(473, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\5\326\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 786592082}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, {}, 256, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 787427449}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, {}, 256, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 788308663}) = 0
writev(201, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\35\245\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 789017598}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 789278489}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
writev(392, [{NULL, 0}, {"\1\0\3\0\0\0-\0<\0<\5None3\0\0\0\0\0\0\16\270\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\3\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 6) = 445
clock_gettime(CLOCK_MONOTONIC, {87999, 792374556}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 792553480}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 792796024}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793154206}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793493003}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 793842449}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {87999, 794054061}) = 0
writev(318, [{NULL, 0}, {"\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\370\0\10celer"..., 72}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316\1\0\2\0\0\0-\0<\0<\5None2\0\0\0\0\0\0\25\371\0\10cele"..., 73}, {"\370\0\20application/json\5utf-8\0\0\0*\10ho"..., 73}, {"\316\3\0\2\0\0\1#", 8}, {"{\"sw_sys\": \"Linux\", \"clock\": 136"..., 291}, {"\316", 1}], 10) = 890
clock_gettime(CLOCK_MONOTONIC, {87999, 794411001}) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 795090977}) = 0
epoll_wait(3, {}, 256, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {87999, 796129182}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
另一段摘录:
Process 26867 attached
clock_gettime(CLOCK_MONOTONIC, {88350, 863599878}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN (Resource temporarily unavailable)
clock_gettime(CLOCK_MONOTONIC, {88350, 865231792}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 865436250}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 865776903}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 872757864}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 872984686}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873209787}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873382297}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 873578979}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
epoll_wait(3, {}, 256, 0) = 0
clock_gettime(CLOCK_MONOTONIC, {88350, 875428570}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 875624976}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 875847357}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
clock_gettime(CLOCK_MONOTONIC, {88350, 876478262}) = 0
futex(0x82e500, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x82e500, FUTEX_WAIT_PRIVATE, 2, NULL) = 0
答案1
我没能真正解决这个问题,但我通过减少工作线程数量和增加并发性来解决这个问题。看来 Rabbit 的每个工作线程都有开销……
因此,
celery multi start -A proj 240 -c2
我现在知道了
celery multi start -A proj 20 -c24
仅供参考