Coordination between resources in AWS CloudFormation

Update 2023: the practice outlined in this post has been outdated. This post is for archive ony.

One of the reasons I prefer CloudFormation over Terraform is access to Helper scripts. Many legacy applications are not built with statelessness and the installation depends on host information of other layers in the stack. This requires communication among instances during stack creation. The cloudformation Helper script (cfn-init, cfn-signal, cfn-hup and cfn-getmetadata) plays a key role to bring non-cloud optimized solutions alive in an automated cloud environment.

Here we provide an example of a solution stack with three layers – an application layer, a database layer and a search engine layer. In each layer, configuring a newly created node requires private IP address of all nodes in all layers. This is common in many enterprise applications without service discovery mechanism or load balancing cross layers.

This is common in many enterprise applications without service discovery or load balancing. Self healing mechanism is in place but load triggered auto-scaling is not in the picture. The key is to hold off installation until all EC2 instances are provisioned with private IP. A good way to indicate this is checking auto scaling group status through aws cli and make sure sufficient number of instances are reporting InService. The diagram below shows a simplified scenario with auto-scaling group A and auto-scaling group B. Instances in each group reports status to their respective group through internal mechanism. In the mean time, instances in each group also query the status of both groups using coordinator script issuing aws cli commands.

Cloud Formation has some useful helpers:

cfn-init: you may call cfn-init from user data. It allows you to execute config set defined in the metadata of the same resource (typically EC2 instance or a launch config). Think of config set in metadata as an Ansible playbook. cfn-init gives you a chance to initiate the execution of a playbook upon instance start.
cfn-signal: cfn-signal provides a mechanism for an instance to notify an outside resource (an auto scaling group in this case) of success or failure status of itself. It can be called from within user data but after cfn-init and other initialization activities.
cfn-hup: cfn-hup provides a mechanism for an instance to be notified of changes of outside resource in the stack, and trigger activities. You typically configure cfn-hup trigger in user data through cfn-init in metadata.

The role of each node is pre-determined by the auto scaling group that creates it, and can be indicated in a file on the instance node. The aforementioned coordinator script plays the following tasks:

It collects local IP from instance metadata. self-awareness of the role is achieved by using user data.
It pulls installer from S3 based on the role of the server. Avoid using wget because it requires public access to the installer. Instead, utilize AWS::CloudFormation::Authentication in metadata to initiate protected access.
It waits for the instance creation from each stack by querying each stack periodically.
Once all stack have required number of instances in service, it collects private IP address on all nodes in each stack and stores them.
It launches the installer on each server and configure the application using the IP information of all stacks.
It flags installation status in a file.

This script is executed at the end of CloudFormation::Init but before cfn-signal. Due to the fact that this script needs to wait for instance availability and coordinate the actual installation, the execution time may take long and it is important to ensure the creation policy of auto scaling group is configured with sufficient timeout. Otherwise the duration of coordinator script before sending signal may fail the auto scaling group creation due to timeout.

The cfn-hup service is configured in /etc/cfn/hooks.d/cfn-auto-reloader.conf with triggers=post.update and action to execute cfn-init again. This ensures that on stack update, the coordinator script can be launched again with updated stack information.

The implementation template is available on my Github, as coordination-example project. The template creates an application layer in public subnet, a database layer and a search engine layer both in private subnets. The private subnets connect to Internet through NAT instance which serves as a bastion host as well. There is still some work (multi-AZ, load balancer, smart addressing, etc), but this should be a sufficient jump start to move many legacy environment to cloud.

The resource types in CloudFormation might come off as overwhelming. Here is an incomplete diagram of their relationships:

Happy Cloud.

Coordination between resources in AWS CloudFormation

The IT journey across industries

Log shipping through ELK