Combine Ansible async with retries - loops

Ok, now I have a really tricky case with Ansible.
I need to run my task asynchronously with retries (i.e. with until loop) and then fail task if timeout exceeds. So both two parameters must control my play: retries count (play fails if retry count exceeded) and timeout (play fails if timeout exceeded).
I can implement each strategy separately:
- name: My shell task with retries
shell: set -o pipefail && ./myscript.sh 2>&1 | tee -a "{{mylogfile}}"
args:
chdir: "{{myscript_dir}}/"
executable: /bin/bash
register: my_job
until: my_job is succeeded
retries: "{{test_retries}}"
delay: 0
or with async:
- name: My async shell task
shell: set -o pipefail && ./myscript.sh 2>&1 | tee -a "{{mylogfile}}"
args:
chdir: "{{myscript_dir}}/"
executable: /bin/bash
register: my_job
async: "{{test_timeout}}"
poll: 0
- name: Tracking for async shell task
wait_for:
path: "{{mylogfile}}"
search_regex: '^.*Done in \S+'
timeout: "{{test_timeout}}"
ignore_errors: yes
register: result
The second task parses previous task log until the job is finished - i.e. searches "Done in x seconds" string. Maybe it's not the best practice and I should use async_status but can't find how to set timeout with it (it's only have retries for checking job status which is pretty silly for me).
Sooo... Can I combine both strategies to control my task both with retries count and timeout?
UPD: I tried to run both until and async for shell module and surprisingly my play doesn't fail but retries doesn't work. The task was just started as fire-and-forget task and executed only one time without retries. So this is not an option.
- name: My shell task with retries and async
shell: set -o pipefail && ./myscript.sh 2>&1 | tee -a "{{mylogfile}}"
args:
chdir: "{{myscript_dir}}/"
executable: /bin/bash
register: my_job
until: my_job is succeeded
retries: "{{test_retries}}"
delay: 0
async: "{{test_timeout}}"
poll: 0
- name: My async shell task
wait_for:
path: "{{mylogfile}}"
search_regex: '^.*Done in \S+'
timeout: "{{test_timeout}}"
ignore_errors: yes
register: result

Have you tried using retries and delay together?
- name: Checking the build status 1
async_status:
jid: "{{ job1.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: 10
delay: 1
when: job1.ansible_job_id is defined
This will check the async status 10 times with a 1 second interval and it will fail if during the 10 seconds the "until" conditions is not satisfied

Related

Getting the status of Ansible import_tasks to satisfy an until loop's condition

I'm attempting to get an until loop working for an import_tasks, and to break the loop when it meets a condition within the import_tasks. Not sure if this is even possible, or if there's a better way to acheive this? Making it slightly more tricky is one of the tasks in include_tasks is a powershell script which returns a status message which is used to satisfy the until requirement.
So yeah, the goal is to run check_output.yml until there is no longer any RUNNING status reported by the script.
main.yml:
- import_tasks: check_output.yml
until: "'RUNNING' not in {{outputStatus}}"
retries: 100
delay: 120
check_output.yml
---
- name: "Output Check"
shell: ./get_output.ps1" # this will return `RUNNING` in std_out if it's still running
args:
executable: /usr/bin/pwsh
register: output
- name: "Debug output"
debug: outputStatus=output.stdout_lines
For the record, this works just fine if I don't use import_tasks and just use an until loop on the "Output Check" task. The problem with that approach is you have to run Ansible with -vvv to get the status message for each of the loops which causes a ton of extra, unwanted debug messages. I'm trying to get the same status for each loop without having to add verbosity.
Ansible version is 2.11
Q: "(Wait) until there is no longer any RUNNING status reported by the script."
A: An option might be the usage of the module wait_for.
For example, create the script on the remote host. The script takes two parameters. The PID of a process to be monitored and DELAY monitoring interval in seconds. The script writes the status to /tmp/$PID.status
# cat /root/bin/get_status.sh
#!/bin/sh
PID=$1
DELAY=$2
ps -p $PID > /dev/null 2>&1
STATUS=$?
while [ "$STATUS" -eq "0" ]
do
echo "$PID is RUNNING" > /tmp/$PID.status
sleep $DELAY
ps -p $PID > /dev/null 2>&1
STATUS=$?
done
echo "$PID is NONEXIST" > /tmp/$PID.status
exit 0
Start the script asynchronously
- command: "/root/bin/get_status.sh {{ _pid }} {{ _delay }}"
async: "{{ _timeout }}"
poll: 0
register: get_status
In the next step wait for the process to stop running
- wait_for:
path: "/tmp/{{ _pid }}.status"
search_regex: NONEXIST
retries: "{{ _retries }}"
delay: "{{ _delay }}"
After the condition passed or the module wait_for reached the timeout run async_status to make sure the script terminated
- async_status:
jid: "{{ get_status.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: "{{ _retries }}"
delay: "{{ _delay }}"
Example of a complete playbook
- hosts: test_11
vars:
_timeout: 60
_retries: "{{ (_timeout/_delay|int)|int }}"
tasks:
- debug:
msg: |-
time: {{ '%H:%M:%S'|strftime }}
_pid: {{ _pid }}
_delay: {{ _delay }}
_timeout: {{ _timeout }}
_retries: {{ _retries }}
when: debug|d(false)|bool
- command: "/root/bin/get_status.sh {{ _pid }} {{ _delay }}"
async: "{{ _timeout }}"
poll: 0
register: get_status
- debug:
var: get_status
when: debug|d(false)|bool
- wait_for:
path: "/tmp/{{ _pid }}.status"
search_regex: NONEXIST
retries: "{{ _retries }}"
delay: "{{ _delay }}"
- debug:
msg: "time: {{ '%H:%M:%S'|strftime }}"
when: debug|d(false)|bool
- async_status:
jid: "{{ get_status.ansible_job_id }}"
register: job_result
until: job_result.finished
retries: "{{ _retries }}"
delay: "{{ _delay }}"
- debug:
msg: "time: {{ '%H:%M:%S'|strftime }}"
when: debug|d(false)|bool
- file:
path: "/tmp/{{ _pid }}.status"
state: absent
when: _cleanup|d(true)|bool
On the remote, start a process to be monitored. For example,
root#test_11:/ # sleep 60 &
[1] 28704
Run the playbook. Fit _timeout and _delay to your needs
shell> ansible-playbook pb.yml -e debug=true -e _delay=3 -e _pid=28704
PLAY [test_11] *******************************************************************************
TASK [debug] *********************************************************************************
ok: [test_11] =>
msg: |-
time: 09:06:34
_pid: 28704
_delay: 3
_timeout: 60
_retries: 20
TASK [command] *******************************************************************************
changed: [test_11]
TASK [debug] *********************************************************************************
ok: [test_11] =>
get_status:
ansible_job_id: '331975762819.28719'
changed: true
failed: 0
finished: 0
results_file: /root/.ansible_async/331975762819.28719
started: 1
TASK [wait_for] ******************************************************************************
ok: [test_11]
TASK [debug] *********************************************************************************
ok: [test_11] =>
msg: 'time: 09:07:27'
TASK [async_status] **************************************************************************
changed: [test_11]
TASK [debug] *********************************************************************************
ok: [test_11] =>
msg: 'time: 09:07:28'
TASK [file] **********************************************************************************
changed: [test_11]
PLAY RECAP ***********************************************************************************
test_11: ok=8 changed=3 unreachable=0 failed=0 skipped=0 rescued=0 ignored=0

Ansible - retry failed iterations in a loop

I have a list of entities that I need to remove, but if I iterate the loop too quickly, the removal may fail as the operation can only be done serially and needs approx 10 seconds between removals. So, I am doing this
- name: Loop through removing all hosts
shell: "echo yes | gravity remove --force {{ item }}"
loop: "{{ result.stdout_lines }}"
loop_control:
pause: 12
this generally works fine, but very occasionally I may get an error when the 12 seconds is not enough. I don't want to increase the pause, so am trying to figure out how to test and retry any failures.
A simple additional pause and retry again if an individual node fails would work. Any idea how I can do this?
Duh! I had originally tried an until, but stupidly forgot to register the var, so it wasn't working, but this now works
- name: Loop through removing all hosts
shell: "echo yes | gravity remove --force {{ item }}"
register: result
loop: "{{ file_list.stdout_lines }}"
loop_control:
pause: 12
retries: 2
delay: 12
until: result is not failed

How to get a rundeck job PID and echo it

How to get a rundeck jobs PID and then echo it.
Simple task of echo a rundeck jobs PID
ls &
echo $!
When i tried to execute this commands on Local Command option, result is the an error: 'ls cannot access '&': No such file
Rundeck encapsulates the jobs in Java threads, a good way to track the Rundeck threads is to use visualvm, anyway, you can obtain the Steps PIDs in the same UNIX/Linux way. The best approach is to do it on scripts steps, for example:
ls &
echo $!
Job Definition example:
- defaultTab: nodes
description: ''
executionEnabled: true
id: 0aeaa0f4-d090-4083-b0a5-2878c5f558d1
loglevel: INFO
name: HelloWorld
nodeFilterEditable: false
plugins:
ExecutionLifecycle: null
scheduleEnabled: true
sequence:
commands:
- fileExtension: .sh
interpreterArgsQuoted: false
script: |-
ls &
echo $!
scriptInterpreter: /bin/bash
keepgoing: false
strategy: node-first
uuid: 0aeaa0f4-d090-4083-b0a5-2878c5f558d1
Result.
Update: same job but using Command step:
- defaultTab: nodes
description: ''
executionEnabled: true
id: 8c59758e-32f3-4166-92b0-50d818074368
loglevel: INFO
name: HelloWorld
nodeFilterEditable: false
plugins:
ExecutionLifecycle: null
scheduleEnabled: true
sequence:
commands:
- exec: ls & echo $!
keepgoing: false
strategy: node-first
uuid: 8c59758e-32f3-4166-92b0-50d818074368
Result.

Check if a file is 20 hours old in ansible

I'm able to get the timestamp of a file using Ansible stat module.
- stat:
path: "/var/test.log"
register: filedets
- debug:
msg: "{{ filedets.stat.mtime }}"
The above prints mtime as 1594477594.631616 which is difficult to understand.
I wish to know how can I put a when condition check to see if the file is less than 20 hours old?
You can also achieve this kind of tasks without going in the burden to do any computation via find and its age parameter:
In your case, you will need a negative value for the age:
Select files whose age is equal to or greater than the specified time.
Use a negative age to find files equal to or less than the specified time.
You can choose seconds, minutes, hours, days, or weeks by specifying the first letter of any of those words (e.g., "1w").
Source: https://docs.ansible.com/ansible/latest/modules/find_module.html#parameter-age
Given the playbook:
- hosts: all
gather_facts: no
tasks:
- file:
path: /var/test.log
state: touch
- find:
paths: /var
pattern: 'test.log'
age: -20h
register: test_log
- debug:
msg: "The file is exactly 20 hours old or less"
when: test_log.files | length > 0
- file:
path: /var/test.log
state: touch
modification_time: '202007102230.00'
- find:
paths: /var
pattern: 'test.log'
age: -20h
register: test_log
- debug:
msg: "The file is exactly 20 hours old or less"
when: test_log.files | length > 0
This gives the recap:
PLAY [all] **********************************************************************************************************
TASK [file] *********************************************************************************************************
changed: [localhost]
TASK [find] *********************************************************************************************************
ok: [localhost]
TASK [debug] ********************************************************************************************************
ok: [localhost] => {
"msg": "The file is exactly 20 hours old or less"
}
TASK [file] *********************************************************************************************************
changed: [localhost]
TASK [find] *********************************************************************************************************
ok: [localhost]
TASK [debug] ********************************************************************************************************
skipping: [localhost]
PLAY RECAP **********************************************************************************************************
localhost : ok=5 changed=2 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
- stat:
path: "/var/test.log"
register: filedets
- debug:
msg: "{{ (ansible_date_time.epoch|float - filedets.stat.mtime ) > (20 * 3600) }}"

Ansible - Loops with_fileglob - become_user not working -- running action on source machine

Env is: Ansible 1.9.4 or 1.9.2, Linux CentOS 6.5
I have a role build where:
$ cat roles/build/defaults/main.yml:
---
build_user: confman
build_group: confman
tools_dir: ~/tools
$ cat roles/build/tasks/main.yml
- debug: msg="User is = {{ build_user }} -- {{ tools_dir }}"
tags:
- koba
- name: Set directory ownership
file: path="{{ tools_dir }}" owner={{ build_user }} group={{ build_group }} mode=0755 state=directory recurse=yes
become_user: "{{ build_user }}"
tags:
- koba
- name: Set private key file access
file: path="{{ item }}" owner={{ build_user }} group={{ build_group }} mode=0600 state=touch
with_fileglob:
- "{{ tools_dir }}/vmwaretools-lib-*/lib/insecure_private_key"
# with_items:
# - ~/tools/vmwaretools/lib/insecure_private_key
become_user: "{{ build_user }}"
tags:
- koba
In my workspace: hosts file (inventory) contains:
[ansible_servers]
server01.project.jenkins
site.yml (playbook) contains:
---
- hosts: ansible_servers
sudo: yes
roles:
- build
I'm running the following command:
$ ansible-playbook site.yml -i hosts -u confman --private-key ${DEPLOYER_KEY_FILE} -t koba
I'm getting the following error and for some reason, become_user in Ansible while using Ansible loop: with_fileglob is NOT using ~ (home directory) of confman user (which is set in variable {{ build_user }}, instead of that, it's picking my own user ID (c123456).
In the console output for debug action, it's clear that the user (due to become_user) is confman and value of tools_dir variable is ~/tools.
PLAY [ansible_servers] ********************************************************
GATHERING FACTS ***************************************************************
ok: [server01.project.jenkins]
TASK: [build | debug msg="User is = {{ build_user }} -- {{ tools_dir }}"] *****
ok: [server01.project.jenkins] => {
"msg": "User is = confman -- ~/tools"
}
TASK: [build | Set directory ownership] ***************************************
changed: [server01.project.jenkins]
TASK: [build | Set private key file access] ***********************************
failed: [server01.project.jenkins] => (item=/user/home/c123456/tools/vmwaretools-lib-1.0.8-SNAPSHOT/lib/insecure_private_key) => {"failed": true, "item": "/user/home/c123456/tools/vmwaretools-lib-1.0.8-SNAPSHOT/lib/insecure_private_key", "parsed": false}
BECOME-SUCCESS-ajtxlfymjcquzuolgfrrxbssfolqgrsg
Traceback (most recent call last):
File "/tmp/ansible-tmp-1449615824.69-82085663620220/file", line 1994, in <module>
main()
File "/tmp/ansible-tmp-1449615824.69-82085663620220/file", line 372, in main
open(path, 'w').close()
IOError: [Errno 2] No such file or directory: '/user/home/c123456/tools/vmwaretools-lib-1.0.8-SNAPSHOT/lib/insecure_private_key'
OpenSSH_5.3p1, OpenSSL 1.0.1e-fips 11 Feb 2013
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Applying options for *
debug1: auto-mux: Trying existing master
debug1: mux_client_request_session: master session id: 2
debug1: mux_client_request_session: master session id: 2
Shared connection to server01.project.jenkins closed.
As per the error above, the file it's trying for variable item is /user/home/c123456/tools/vmwaretools-lib-1.0.8-SNAPSHOT/lib/insecure_private_key but there's no such file inside my user ID's home directory. But, this file does exist for user confman's home directory.
i.e. the following file exists.
/user/home/confman/tools/vmwaretools-lib-1.0.7-SNAPSHOT/lib/insecure_private_key
/user/home/confman/tools/vmwaretools-lib-1.0.7/lib/insecure_private_key
/user/home/confman/tools/vmwaretools-lib-1.0.8-SNAPSHOT/lib/insecure_private_key
All, I want is to iterate of these files in ~confman/tools/vmwaretools-lib-*/.. location containing the private key file and change the permission but using "with_fileglob" become_user to set the user during an action is NOT working.
If I comment out the with_fileglob section and use/uncomment with_items section in the tasks/main.yml, then it (become_user) works fine and picks ~confman (instead of ~c123456) and gives the following output:
TASK: [build | Set private key file access] ***********************************
changed: [server01.project.jenkins] => (item=~/tools/vmwaretools/lib/insecure_private_key)
One strange thing I found is, there is no user c123456 on the target machine (server01.project.jenkins) and that's telling me that with_fileglob is using the source/local/master Ansible machine (where I'm running ansible-playbook command) to find the GLOB Pattern (instead of finding / running it over SSH on server01.project.jenkins server), It's true that on local/source Ansible machine, I'm logged in as c123456. Strange thing is, in the OUTPUT, it still shows the target machine but pattern path is coming from source machine as per the output above.
failed: [server01.project.jenkins]
Any idea! what I'm missing here? Thanks.
PS:
- I don't want to set tools_dir: "~{{ build_user }}/tools" or hardcode it as a user can pass tools_dir variable at command line (while running ansible-playbook command using -e / --extra-vars "tools_dir=/production/slave/tools"
Further researching it, I found with_fileglob is for List of local files to iterate over, described using shell fileglob notation (e.g., /playbooks/files/fooapp/*) then, what should I use to iterate over on target/remote server (server01.project.jenkins in my case) using pattern match (fileglob)?
Using with_fileglob, it'll always run on the local/source/master machine where you are running ansible-playbook/ansible. Ansible docs for Loops doesn't clarifies this info (http://docs.ansible.com/ansible/playbooks_loops.html#id4) but I found this clarification here: https://github.com/lorin/ansible-quickref
Thus, while looking for the pattern, it's picking the ~ for user c123456.
Console output is showing [server01.project.jenkins] as it's a different processing/step to read what's there in the inventory/hosts file.
I tried to use with_lines as well as per this post: ansible: Is there something like with_fileglobs for files on remote machine?
But, when I tried the following, it still didn't work i.e. read the pattern on local machine instead of target machine (Ansible docs tells with_items doesn't run on local machine but on the controlling machine):
file: path="{{ item }}" ....
with_items: ls -1 {{ tools_dir }}/vmwaretools-lib-*/lib/insecure_private_key
become_user: {{ build_user }}
Finally to solve the issue, I just went on the plain OS command round using shell (again, this might not be a very good solution if the target env is not a Linux type OS) but for now I'm good.
- name: Set private key file access
shell: "chmod 0400 {{ tools_dir }}/vmtools-lib-*/lib/insecure_private_key"
become_user: "{{ build_user }}"
tags:
- koba

Resources