我有一个如下所示的 ansible playbook,它大多数时候都运行良好。但最近我注意到它卡在了ALL
组中的某些服务器上,只是呆在那里。它甚至没有移动到ALL
列表中的其他服务器。
# This will copy files
---
- hosts: ALL
serial: "{{ num_serial }}"
tasks:
- name: copy files
shell: "(ssh -o StrictHostKeyChecking=no abc.com 'ls -1 /var/lib/jenkins/workspace/copy/stuff/*' | parallel -j20 'scp -o StrictHostKeyChecking=no abc.com:{} /data/records/')"
- name: sleep for 5 sec
pause: seconds=5
因此,当我开始调试时,我注意到实际服务器卡住了——我可以正常 ssh(登录),但是当我运行ps
命令时它就挂起了,我没有找回我的光标,这意味着 ansible 也卡在该服务器上执行上述 scp 命令时。
所以我的问题是,即使我的某个服务器处于这种状态,为什么不让 Ansible 超时并移至其他服务器?我们能做些什么,以便 ansible 不会暂停一切,只是等待该服务器响应。
注意服务器已启动并正在运行,我可以正常使用 ssh,但是当我们运行ps
命令时它就挂起了,因此 ansible 也挂起了。
有没有办法在组ps aux | grep app
内的所有服务器上运行此命令ALL
,并列出所有执行此命令的服务器(如果某个服务器挂起,则超时并移至所有列表中的其他服务器),然后将该列表传递到我的上述 ansible 剧本中?我们可以在一个剧本中完成所有这些操作吗?
更新:-
我收到如下错误:
ERROR! The 'pause' module bypasses the host loop, which is currently not supported in the free strategy and would instead execute for every host in the inventory list.
The error appears to have been in '/var/lib/jenkins/workspace/process/check.yml': line 10, column 9, but may
be elsewhere in the file depending on the exact syntax problem.
The offending line appears to be:
- name: sleep for 5 sec
^ here
Build step 'Execute shell' marked build as failure
答案1
(Ansible is getting stuck executing scp command on that server). If I have some server in that state, why not just Ansible times out and move to other servers?
很可能线性策略
默认情况下,游戏采用线性策略运行,其中所有主机将在任何主机开始下一个任务之前运行每个任务
使用免费策略
这使得每个主持人尽可能快地运行直到游戏结束
- hosts: all
serial: "{{ num_serial }}"
strategy: free
tasks:
尝试异步下面的剧本中有 3 个主机在组“测试”中异步让主机休眠随机数(1-10)秒,并等待 5 秒让主机完成。然后重试一次以收集异步状态并退出播放。下一轮播放本地主机当至少一台主机成功时将会运行并打印每台主机的状态。
- hosts: test
strategy: free
vars:
max_sleep_time: 10
max_wait_time: 5
tasks:
- set_fact:
my_time: "{{ max_sleep_time|random(start=1) }}"
- debug:
msg: "Sleep {{ my_time }} seconds"
- command: "sleep {{ my_time }}"
register: play_status
async: "{{ max_wait_time }}"
poll: 0
- async_status:
jid: "{{ play_status.ansible_job_id }}"
register: play_status
until: play_status.finished
retries: 1
- hosts: localhost
tasks:
- debug:
msg: "{{ item }} finished: {{ hostvars[item].play_status.finished }}"
loop: "{{ groups['test'] }}"
输出(删节版)显示主机 test_01 和 test_03 休眠了 9 秒,没有及时完成(max_wait_time:5)并且失败了。
TASK [debug]
ok: [test_01] => {
"msg": "Sleep 9 seconds"
}
ok: [test_02] => {
"msg": "Sleep 1 seconds"
}
ok: [test_03] => {
"msg": "Sleep 9 seconds"
}
TASK [async_status]
changed: [test_02]
fatal: [test_01]: FAILED! => {"ansible_job_id": "10701665445.1564", "attempts": 1, "changed": false, "finished": 0, "started": 1}
fatal: [test_03]: FAILED! => {"ansible_job_id": "752000555573.1558", "attempts": 1, "changed": false, "finished": 0, "started": 1}
...
TASK [debug]
ok: [localhost] => (item=test_01) => {
"msg": "test_01 finished: 0"
}
ok: [localhost] => (item=test_02) => {
"msg": "test_02 finished: 1"
}
ok: [localhost] => (item=test_03) => {
"msg": "test_03 finished: 0"
}
...
PLAY RECAP
localhost : ok=2 changed=0 unreachable=0 failed=0
test_01 : ok=3 changed=1 unreachable=0 failed=1
test_02 : ok=4 changed=2 unreachable=0 failed=0
test_03 : ok=3 changed=1 unreachable=0 failed=1
设置max_wait_time > max_sleep_time
为查看所有主机已完成。
答案2
你可以收集事实。
---
hosts: all
gather_facts: True
tasks:
通过明确收集事实,您可以强制 Ansible 尝试连接到每个主机(并更新事实缓存)。如果主机无法访问,它将在剧本的其余部分被跳过。默认情况下,收集事实的超时时间为10 秒,这样可以减少您需要等待的时间。