Ansible at scale 2 of 2

Template (with Jinja2) and files

In an Ansible role, we can use files or templates to achieve similar results for configuration files. If the configuration file is the same across all targets then we can place it in files directory to push out. If the content of configuration file varies depending on the cluster size, we use Jinja2 template. For example, when you configure zookeeper configuration, a first entry may require total number of nodes in the cluster, a second entry may require the hostname of the server itself; and a third entry may require a comma separated line with hostnames of all nodes in the cluster. This is a typical use case of Jinja template.

We need to make sure Jinjas version is above 2.11.2 (as of May 2020) because older version such as 2.7.2 has known issues with namespaces. To check version and then upgrade Jinja2, we need to use pip:

pip show Jinja2
pip install -U Jinja2

The Ansible template module takes Jinja2 file as input and delivers result file on target host. Note that if the template references host variables from Ansible playbook, then you need to gather facts about host. This means you will have to use a basic playbook like below instead of adhoc command.

A basic playbook to test Jinja2 template is:

- hosts: '{{ansible_limit}}'
  gather_facts: yes
  tasks:
  - template:
      src: cassandra_xml.j2
      dest: /tmp/cassandra.xml

Although Jinja2 offers a lot of flexibility with loop and if-else statement, it is just a templating language and not a programming language. It requires some tricks to achieve what you may otherwise easily do with programming language. One example is persisting a variable outside of a loop. As per the document, it is not possible to set variables inside a block and have them show up outside of it. This also applies to loops. The only exception to that rule are if statements which do not introduce a scope. To achieve that, you would have to use namespace, for each loop where you need to access the variable afterwards from outside of the loop.

{% block db_cluster_config_nobackup %}
{% set ns=namespace(nodeid=0) %}
{% for host in groups[my_db_group]|sort %}
   <var name="DBHost{{ns.nodeid+1}}" value="{{hostvars[host].inventory_hostname}}" />
{% set ns.nodeid=ns.nodeid+1 %}
{% endfor %}
   <var name="DBClusterHosts" value="{% for i in range(ns.nodeid) %}
${DBHost{{i+1}}}{% if not loop.last %},{% endif %}
{% endfor %}" />
{% endblock cass_cluster_config_nobackup %}

For the same reason, you might as well clearly define the start and end of each block in order to not run into trouble with scoping behaviours of variables. These limitations makes Jinja2 template not easy to read and may take several rounds of playbook runs to troubleshoot.

Handler vs conditional task

Sometimes you only want to run a task when its previous task results a change. There are two ways to achieve this: conditional task and handler.

With conditional task, we register the result of previous task to a variable, and execute the ensuing tasks conditionally based on assessment of the variable. We’d have to specify the condition for each of the subsequent tasks that needs to execute conditionally. These tasks, if condition is met, can execute immediately after the first task that registers the variable.

The alternative is through an Ansible mechanism called handler. Handler implements a series of tasks in a separate yaml file in the handers directory under the role. In the triggering task we need to notify the handler. The tasks in the hander will fire if the triggering task returns “changed” in its result. Handler is a great way to shorten the length of task or Playbook. However, we need to understand several subtleties with regard to handlers:

Although handler is notified during a task run, it is not fired until the end of each block of tasks in a play. They are not immediately fired after triggering task.
A handler will only execute once at the end of play, even if it was notified multiple times by different tasks during the play run.
Handler tasks are executed in the order of declaration, not in the order of notification.

In summary, Ansible’s notification handling mechanism is asynchronous, once-only, and out of sequence. The points above are illustrated in the following playbook:

---
- hosts: ghdocker
  tasks:
    - name: CopyFile3
      copy:
        src: ~/ansible/file3.txt
        dest: /tmp/file3.txt
      notify:
        - handler3
        - handlergeneral
    - name: CopyFile2
      copy:
        src: ~/ansible/file2.txt
        dest: /tmp/file2.txt
      notify:
        - handler2
        - handlergeneral
    - name: CopyFile1
      copy:
        src: ~/ansible/file1.txt
        dest: /tmp/file1.txt
      notify:
        - handler1
        - handlergeneral
    - debug: msg="end of play!"
  handlers:
    - name: handler1
      debug: msg="file1.txt has been copied."
    - name: handler2
      debug: msg="file2.txt has been copied."
    - name: handler3
      debug: msg="file3.txt has been copied."
    - name: handlergeneral
      debug: msg="A file has been copied"

Here is the output of the playbook run:

PLAY [ghdocker] ******************************************************************************

TASK [Gathering Facts] ******************************************************************************
ok: [ghdocker]

TASK [CopyFile3] ******************************************************************************
changed: [ghdocker]

TASK [CopyFile2] ******************************************************************************
changed: [ghdocker]

TASK [CopyFile1] ******************************************************************************
changed: [ghdocker]

TASK [debug] ******************************************************************************
ok: [ghdocker] => {
    "msg": "end of play!"
}
RUNNING HANDLER [handler1] ******************************************************************************
ok: [ghdocker] => {
    "msg": "file1.txt has been copied."
}

RUNNING HANDLER [handler2] ******************************************************************************
ok: [ghdocker] => {
    "msg": "file2.txt has been copied."
}

RUNNING HANDLER [handler3] ******************************************************************************
ok: [ghdocker] => {
    "msg": "file3.txt has been copied."
}

RUNNING HANDLER [handlergeneral] ******************************************************************************
ok: [ghdocker] => {
    "msg": "A file has been copied"
}
PLAY RECAP ******************************************************************************
ghdocker                   : ok=9    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

Handler is a good way to keep idempotency. For example, Ansible does not have a way to import a yum .repo file to create a repo. We have to take two steps:

use get_url module to download the repo file (e.g. to /tmp),
use shell module to call yum-config-manager.

The problem is these two steps are not idempotent. If you repeat them, it will attempt to import the same repo file again. A little trick here is to use force=no option on get_url so it does not attempt to download if the file is already present in target directory. Then notify a handler to import repo file so the shell command is only called if there is a change.

The task looks like this:

- name: download repo file
  get_url:
    url: https://download.docker.com/linux/centos/docker-ce.repo
    dest: /tmp/docker-ce.repo
    mode: '0755'
    force: no
  notify:
    - Add docker repository

The handler looks like this:

- name: Add docker repository
  shell: yum-config-manager --add-repo=/tmp/docker-ce.repo

The handler is only fired when it is notified after get_url module returns changed in its result. Running the task again will not cause it to attempt to add the same repo again.

Note that when you use command or shell module, Ansible typically reports changed status. If this is not desired (e.g. you don’t want it to notify handler all the time), this behaviour can be overridden with changed_when parameter. You can specify conditions to meet in order to consider the shell/command module to have a changed result. Here is an example.

Ansible commands

In operation, our engineer needs to run a command on a group of servers. I encourage the use of Ansible adhoc command whenever possible. I recommend start with the following two commands:

ansible-inventory --graph
ansible all -m ping

The ping module triggers an “Ansible ping” to targets in the specified group. Over the years, Ansible community developed many helpful modules, such as yum, yum_repository, apt_rpm, uri, synchronize, fine, copy, etc and many can be used instead of bash command. However, sometimes, the expected Ansible module is either unavailable or missing function. For example, Ansible’s uri module cannot replace curl command with the following switches:

curl -s -XGET http://{{inventory_hostname}}:8080/objects/{{object_id}}/binary/all -o /dev/null -w '%{response_code} %{size_download} %{time_total} %{speed_download}\n' | awk '{if ($1==200) print "size="$2/1048576"MB,time="$3"s,speed="$4/1048576"MB/s"; else if($1==404) print "Cannot find object {{object_id}}"; else print "Unknown error. Code "$1 " when retrieving object {{object_id}}";}'

To leverage all these curl options, we still need to use the shell module in Ansible to call the command in shell.

Other helpful Ansible commands include ansible-pull for pulling playbooks from VCS repo, and ansible-console for interactive adhoc command execution.

Tags and extra variables

Both tags(-t) and extra variables (-e) are great ways to achieve flow control in playbooks. You can specify to run tasks with certain tags or skip tasks with certain tags. Extra variables can override the default variables from the host or the group. Both are great tools to improve re-usability of a Playbook.

Speed up execution

To speed up execution of Ansible tasks, there are several ways. For example, we can disable fact gathering by default so it only gathers fact if explicitly specified. This can be set in gathering=explicit under defaults section of ansible configuration file. If you have to gather facts, you may cache the facts using the following:

[defaults]
gathering = smart
fact_caching_timeout = 86400
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_fact_cache

Other than caching, Ansible allows you to select from several execution strategies for playbook. The linear strategy introduces configurable parallelization per task. The free strategy introduces parallelization per play.

linear (by default): Up to the fork limit of hosts will execute each task at the same time and then the next series of hosts until the batch is done, before going on to the next task. This mode ensures the progress is synchronized at each task.
free: as specified above, this is preferred when there is no need to coordinate the progress between each host target. It is a “free run” for each host all the way till the end of the playbook.
debug: essentially linear strategy except that the progress is controlled by an interactive debug session

The fork limit, with a conservative default of 5, can be adjusted in Ansible configuration. The execution strategy can be either specified in Ansible configuration, or specified per play. For example, the following snippet sets the strategy to free for the current play:

---
- hosts: all
    strategy: free
  tasks:
...

Ansible documentation also mentions some play-level keywords to control execution. The serial keyword, is one of them. It can be set along with any strategy above, and it introduces the effect of hosts batching. The value can be a single number, a percentage, or even a list of numbers (if size for each batch is different). Note that the batch size should not exceed the fork limit. This is particularly useful in rolling upgrades. For example:

---
- name: test play
  hosts: webservers
  serial: "30%"

With the parallelization capacity outlined above, a potential concern is some heavy-lifting task may consume a lot of resources, if being executed for all hosts at the same time. Luckily, Ansible has a task/block level keyword throttle, which “de-parallelize” the multi-host progress at a particular task, or block. Here is an example provided by Ansible documentation:

tasks:
- command: /path/to/cpu_intensive_command
  throttle: 1

If there are long running tasks, we can specify async and poll values so Ansible leaves a task running and check back later. For example, the following task allows Ansible to move on and check back every 5 seconds, if the task takes longer than 45 seconds, it is considered failed:

---
  - hosts: all
    remote_user: root
    tasks:
      - name: simulate long running task for 15 sec, wait for up to 45 sec, poll every 5 sec
        command: /bin/sleep 15
        async: 45
        poll: 5

Python Version

The recommendation is to use Python3 for any new development because there is no dependency. If there is no preference specified, Ansible tries to find out the appropriate interpreter and it can be seen in the response of ansible ping module. You can also force the interpreter by providing additional parameter ansible_python_interpreter. To change default interpreter, specify interpreter_python in ansible.cfg. For example:

[defaults]
inventory=~/ansible/inventories/site.yml
library=~/ansible/library/
vault_password_file = ~/ansible/.vault_key
host_key_checking = False
display_skipped_hosts = False
retry_files_enabled = False
interpreter_python=/usr/bin/python3

[privilege_escalation]
become_method=sudo

[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=1h
pipelining = True

My open issues

I have some minor details that I have not been able to address, after a lot of time googling around. So I have to leave them for future reference.

If an Ansible playbook involves multiple plays (i.e. each with their own host), there is no way to persist a variable across different plays. A dumb alternative is to make all the variables to use available for every single host (under all directory).

In Jinja2 template, if I need to access the group of a target host (as defined in inventory), and the target belongs to multiple groups, I cannot filter to match the group I need.