Ansible at scale 1 of 2

The Ansible In Depth white paper outlines Ansible’s use cases in four categories:

  • Configuration management
  • Application deployment
  • Orchestration: for coordinating a multi-machine process such as interacting with load balancer and rolling cluster upgrade
  • As-needed task execution: ad-hoc tasks on large number of hosts

At work, my original automation scheme involves several Ansible Playbooks that started off simple but have been sprawling ever since. I have to spend some time to revamp the Ansible code base, following the best practices from the official documentation. The goal is to:

  1. Reduce the number of Playbooks;
  2. Improve code re-usability (by Ansible roles);
  3. Increase portability across different customer environment;
  4. Improve security;
  5. Re-organize the directory so that more team members can contribute to different parts of it.

The Ansible code base is used by our customer service engineers, some of whom are ingrained with the established way they have been using certain Playbooks in their daily tasks. This requires me, throughout the development initiative, to ensure a consistent interaction between them and their Playbook commands.

Inventory and variables

We cannot guarantee that every custom environment are identical, but what we can do is make sure that for a new environment, the only change to make is inventory and variables. This is where we can strike a balance between portability and customization. No changes should be made to tasks, roles or Playbooks when the Ansible directory is deployed at a different customer environment.

If the total number of servers to manage are under 100, all can be listed in a single inventory file, and specify the only inventory file as default so the -i switch is not required for every Ansible command run. If there are more than 100 servers, it is advisable to separate them out into several inventory files in YAML, each less than 200 lines. You will have to specify inventory file with -i each time you run Ansible command.

The inventory may contain a hierarchy of groups, in order to facilitate command calls to specific groups of servers. For example:

all:
  children:
    prod:
      children:
        prod_app:
          children:
            prod_app_dc1:
              hosts:
                apphost01:
                apphost03:
                  zk_id: 1
                apphost05:
                apphost07:
                  zk_id: 2
                apphost09:
                apphost11:
                  zk_id: 3
                apphost13:
            prod_app_dc2:
              hosts:
                apphost02:
                apphost04:
                  zk_id: 1
                apphost06:
                apphost08:
                  zk_id: 2
                apphost10:
                apphost12:
                  zk_id: 3
                apphost14:
          vars:
            has_app: yes
            has_nginx: yes
            has_db: no
        prod_db:
          children:
            prod_db_dc1:
              hosts:
                dbhost01:
                dbhost03:
                dbhost05:
                dbhost07:
                dbhost09:
                dbhost11:
              vars:
                clustered: yes
                bk_node: dbhost11
            prod_db_dc2:
              hosts:
                dbhost02:
                dbhost04:
                dbhost06:
                dbhost08:
                dbhost10:
                dbhost11:
              vars: 
                clustered: yes
                bk_node: dbhost12
          vars:
            has_app: no
            has_nginx: no
            has_db: yes
      vars:
        ansible_become_pass: '{{site_prod_root_pw}}'
    test:
      children:
        test_dc1:
          hosts:
            tapphost01:
        test_dc2:
          hosts:
            tapphost02:
      vars:
        ansible_become_pass: '{{site_test_root_pw}}'
        has_app: yes
        has_nginx: yes
        has_db: yes
  vars:
    ansible_become: yes
    ansible_become_method su
    ansible_become_user: root

User of Ansible Playbook can use -l to specify a pattern that matches a single or multiple groups, such as prod_db_dc*. In the above example, the password is not stored in clear text. They should reference a variable from a separate file encrypted by ansible-vault.

From Playbooks to roles

As a refresher from the white paper, a single “task” in Ansible is essentially a module call with parameters. A “play” consists of a series of tasks (defined under “tasks” section) all to execute on a specified host (defined under “hosts” section). A Playbook consist of one or several plays, as shown in this example. In reality though, a Playbook usually contains only one play. Even that one play can grow to an unmanageable length, as complexity increases over time. This is where we need to change our approach towards scalability and manageability.

The Ansible community advocates the use of roles in place of Playbooks. The concept of Ansible “role” seems fairly abstract and confusing at the beginning. The word “role” pictures a static server state, whereas our existing Playbooks are full of actions (think of shell scripts). How would one convert an action list into static states? After some thought, I came to the understanding that roles should be thought of as desired end state. Yes, the end state is static, but that’s all we care about. This is essentially the whole idea of Ansible’s desired state configuration: you start from the end state and leave it to modules to complete what needs to be done to reach that state. The concept of role perfectly reflects how Ansible wants you to think about solving an infrastructure problem – stop thinking about what you need to do. Instead, think about what you ultimately want, start from the desired state and work backwards.

In our own setup, the best practice turned out to be: if the Playbook involves a single play with less than 5 tasks, just stick to Playbook. We don’t get rid of Playbooks just for the sake of it. Otherwise, if a Playbook has grown to more than 5 tasks, we need to think about our desired state, and either implement a new role, or incorporate it into an existing role. This is the time we have to transition from the Playbook oriented thinking to the role oriented thinking. Each role directory can include a task sub-directory with main.yml that references the rest of the tasks. Each role can define its own role-related variables. If there’s a lot in common between two roles, we can even have a common role with or without its main.yml.

A simplified version of our Ansible directory structure looks like this:

├── deploy-app.yml
├── deploy-db.yml
├── inventories
│   ├── group_vars
│   │   ├── all
│   │   │   ├── all.yml
│   │   │   └── vault_all.yml
│   │   ├── prod_dc1_db.yml
│   │   ├── prod_dc2_db.yml
│   │   ├── test_dc1_db.yml
│   │   └── test_dc2_db.yml
│   ├── host_vars
│   └── site_inventory.yml
├── roles
│   ├── common
│   │   ├── files
│   │   └── tasks
│   │       ├── log.yml
│   │       ├── skip_self.yml
│   │       └── validate_path.yml
│   ├── db_conf
│   │   ├── defaults
│   │   │   └── main.yml
│   │   ├── files
│   │   ├── handlers
│   │   │   └── main.yml
│   │   ├── meta
│   │   │   └── main.yml
│   │   ├── README.md
│   │   ├── tasks
│   │   │   ├── main.yml
│   │   │   ├── start_db.yml
│   │   │   ├── stop_db.yml
│   │   │   └── update_cluster_var.yml
│   │   ├── templates
│   │   │   ├── myid.j2
│   │   │   ├── db_properties.j2
│   │   │   └── zookeeper_properties.j2
│   │   └── vars
│   │       └── main.yml
│   └── app_conf
│       ├── defaults
│       │   └── main.yml
│       ├── files
│       ├── handlers
│       │   └── main.yml
│       ├── meta
│       │   └── main.yml
│       ├── README.md
│       ├── tasks
│       │   ├── bk_app_conf.yml
│       │   ├── empty_app_conf.yml
│       │   ├── main.yml
│       │   ├── push_app_conf.yml
│       │   ├── start_app.yml
│       │   ├── stop_app.yml
│       │   ├── tar_app_conf.yml
│       │   ├── untar_app_conf.yml
│       │   └── update_cluster_var.yml
│       ├── templates
│       │   └── dbref_xml.j2
│       └── vars
│           └── main.yml
├── service-app.yml
└── service-db.yml

Variables specific to a group of hosts or individual hosts can be included in different yml files. When the entire directory is moved to a different customer environment, our engineers will need to update the inventory and variable files. The task, roles and Playbooks should build their logics using those variables.

Vault

Our previous implementation of Ansible Playbook stores sudo password base64 encoded and use no_log to avoid displaying values. Now we move those to encrypted variable yml file using ansible-vault. We reference the value to encrypt as regular variable:

ansible_become_pass: '{{passtoencrypt}}'
ansible_become_method: sudo
ansible_become: yes

Then we run the following:

ansible-vault create vault_all.yml

This prompt for a key, and once you type in the key it opens a text editor where we can store the real password. For example:

passtoencrypt: MyP@ssw0rd4real!

Use the text editor to save file. The file is now saved encrypted and must be open with correct key (aka vault password). If we call Ansible Playbook with –ask-vault-pass switch then the Playbook will prompt for key input, or use include_vars to include variable from vault file (example). If we want to even skip this, we can store the key in a file and reference them from vault_password_file in ansible.cfg

Ansible Vault has more commands to edit or view the encrypted variables in the documentation.

Optimize connection

OpenSSH 5.6 and later supports multiplexing where multiple SSH sessions share a TCP connection. This can be turned on so that the following SSH connections save the time of TCP handshake. This can be configured in ansible configuration file under ssh_connection. Below is an example of this option with ControlPersist=1h. So the TCP connection is torn down after 1 hour.

[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=1h

The other option we can leverage is pipelining. Ansible takes three steps to execute a task:

  1. build a python script based on module used
  2. copy the python script to remote host
  3. execute the python script on the remote host

If pipelining is turned on, the python script is passed in along with the SSH session, this would save a roundtrip and increase performance. Pipelining can be configured under ssh_connection in Ansible configuration file:

[ssh_connection]
pipelining = True

In the example below we can see by pipelining we cut the number of connection in half:

# with pipelining
[ghunch@control-host ~]$ ansible remote-host -vvvv -m ping | grep EST
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch


# without pipelining
[ghunch@control-host ~]$ ansible remote-host -vvvv -m ping | grep EST
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch
<remote-host> ESTABLISH SSH CONNECTION FOR USER: ghunch

Note that if we use sudo command, then we need to disable requiretty in /etc/sudoers on the remote host.

Custom Module

It’s fairly straightforward to build a custom module in Ansible. Just place the module file (modulename.py) in inventory directory and use it as you would with regular Ansible module. The module file needs to be completed in Python with certain return value. Before creating custom module, you should look for existing modules to avoid re-inventing the wheel. You may also need to determine whether you simply need to run a python script on target host (with Ansible’s script module), or you really need an Ansible module. The former is procedural, and the latter focus on desired state. Custom module is more used in proprietary development.