AWS Systems Manager is an Omnipotent Hodgepodge

Introduction to Systems Manager

AWS Systems Manager addresses a lot of SysOps requirements for configuration management, including server automation. In this domain, there is another AWS service called OpsWorks. However, with OpsWorks Stack, OpsWorks Chef and OpsWorks Puppet all coming EOL in 2024, the entire OpsWorks service is mostly deprecated. By partnering with leaders such as Chef and Puppet, OpsWork services represent the era when AWS needed to mirror the configuration management capability on premise, in an effort to convince customers migrating to the cloud. Today, AWS Systems Manager has evolved to fill a lot of gaps around configuration management for servers in the cloud.

Although AWS Systems Manager sounds like a single service. It consists a collection of many seemingly disparate capabilities that serves similar requirements around configuration management. In fact, many of the Systems Manager capabilities are built on top of a couple of what I call core capabilities, such as Session Manager, RunCommand, Automation. This post will review these core capabilities and how Systems Manager employs them to expand with other capabilities.

SSM Agent and Session Manager

What enables all other capabilities is the SSM agent installed on the EC2 instances. The agent running as a systemctl task by ssm-user on EC2 instances. Most of AMIs come with this agent pre-installed. It stores the logs in /var/log/amazon/ssm/. This agent works with an instance profile with a role with the AmazonSSMManagedInstanceCore managed policy, in order to communicate with AWS Systems Manager (ssm.<region>.amazonaws.com) backend. Because of that, you also need to provide a network path to the backend endpoint, either via Internet, or interface endpoint.

This communication also allows an IAM user to connect to an instance’s shell. A common use case is for private instance that do not have Internet access but do have access to SSM backend endpoint. In a previous post I discussed using Session Manager to replace a bastion host to connect to EKS nodes.

When launching an instance using an AMI with SSM pre-installed, the SSM agent should launch after all the config sets from Cloudformation Init are finished. As a result, the Cloudformation Init script is not able to communicate with SSM backend via the agent, unless you install and start SSM agent first on your own, in CloudFormation Init. To troubleshoot SSM, it is important to review its logs.

Through Systems Manager Hybrid Activation, the SSM agent can also work on virtual machines out of AWS and report back to with SSM backend. This gives on-prem servers the identities (instance tags, instance profiles) required for Systems Manager to manage them as if they were EC2 instances. As a result, extend Systems Manager capabilities to on-prem fleet (requiring advanced instances tier).

Types of SSM Documents

There are several types of document that SSM uses, including:

  • Command Document
  • Automation Document
  • Package Document
  • Session Document
  • Policy Document
  • Change Calendar Document

The AWS documentation has a table on what they each are for. Here I’ll focus on three types of documents: Command Document, Automation Document and Session Document.

The Command Document is for the RunCommand capability. It executes on EC2 instances usually performing tasks relating to the operating system or application. I think of a Command Document as an Ansible Playbook that consists of Ansible tasks. We can author Command document that runs configuration steps using plugins, such as aws:downloadContents, aws:runShellScript, etc. This feature directly competes with Ansible. To troubleshoot why a command fail on an instance, check the file ssm-document-worker.log in the ssm agent log directory. Each log entry should have a command ID as reference.

The Automation Document (aka runbooks) is for the Automation Capability. You can define sequence of actions for automation. There are many pre-defined actions such as executing AWS API calls (aws:executeAwsApi), run commands (aws:runCommand), or executing a Lambda function. Therefore a runbook requires an IAM role (Automation Role). The schema of action sequence (YAML or JSON) looks very similar to an Ansible playbook. The web console comes with an UI to visualize the action sequence but most of the time I’d rather .

Session Document is for Session Manager capability. AWS Systems Manager Session Manager uses Session documents to determine which type of session to start, such as a standard session, a port forwarding session, or a session to run an interactive command. In most cases, automation developers do not need to create their own Session document, because the pre-built ones are sufficient:

  • AWS-PasswordReset
  • AWS-StartInteractiveCommand
  • AWS-StartPortForwardingSession
  • AWS-StartPortForwardingSessionToSocket
  • AWS-StartSSHSession

In my experience, I use the AWS-StartSSHSession and AWS-StartPortForwardingSession documents most often. To establish SSH connection for forward port to connecting host for Remote Desktop session.

To author your own document, reference the schema correctly and use the latest SSM agent. However, I would explore if any existing shared document in the library already covers what you need. For example, the command document AWS-JoinDirectoryServiceDomain help join a Windows server to a managed Active Directory domain. The command document AWS-RunPatchBaseline is used by Systems Manager Patch Manager capability to check and apply operating system patches. They include steps for Windows, MacOS and Linux instances. The automation runbook AWS-AttachIAMToInstance helps you add IAM role to an EC2 instance.

RunCommand and Automation

The Run Command capability run on top of SSM agent. You can specify one or more target instances. You also specify other other options such as command parameters, rate control and where the output goes. This capability allows an IAM user to run command directly on the OS of an instance (using an OS user ssm-agent) and centrally keep track of those command runs on the AWS side. The most common commands to run on the OS is packaged into Command Documents. There is even a Command Document that allows you to run a pre-built Ansible playbook.

Another way this capability is extremely helpful, is that we can reduce the load of cloud init process. Traditionally, we put a log of logics in the user data script for the cloud init process to execute. The cloud-init mechanism comes from Linux OS and the execution of the user data script is not very transparent to troubleshoot. You have to check the cloud-init-output log from the OS. The use of the UserData script should be reserved for establishing communication with CloudFormation endpoint and SSM endpoint. From there, other automation tasks should be done using SSM capabilities (e.g. State Manager) for better manageabilities.

Take an example of joining a newly provisioned Windows server to a domain. If we do this in user data script, we will have a few problems. First, we can only tell success/fail state from logs in the OS. Second, if an OS user inadvertently removed the instance from domain, there is no mechanism to capture that. If we use Systems Manager’s RunCommand capability, along with State Manager association, the AWS management console will be able to tell whether domain joining is successful, and the association can detect when the instance is removed from domain, report this finding as out of compliance, and remediate the issue. We’ll discuss State Manager in more detail in the next section.

As part of automation, we often have to invoke AWS API calls, which happens outside of any target VMs. The Automation capability of Systems Manager is for this scenario. You can orchestrate your API calls using Automation runbooks. These automation steps do not execute on any target EC2 instance, so they do not rely on SSM agent. However, it needs its own IAM role to perform API tasks. This capability saves you from having to run API calls by creating a new Shell environment to run AWS CLI, or from your own Lambda function using the boto3 SDK library.

When we combine Automation and RunCommand capabilities, we can perform most of the automation orchestration steps. They are the core capabilities that further enable a variety of other Systems Manager capabilities.

Maintenance Window and State Manager

Maintenance window is a very straight forward capability to schedule RunCommand activity with a cron or rate expression. You can specify target by instance tags, define one or more tasks, and define a window of activity and at what point prior to the end of Windows should the agent stop performing more activities (cutoff). Each task can be a type of a RunCommand command, Step Function, Lambda function and automation runbooks.

State Manager is a similar capabilities with a lot of feature overlap with Maintenance Window. State Manager operates on the concept of associations. An association connects target instances to command document or automation runbook to execute. Similar to Maintenance Window, you can specify a schedule expression, document parameters and instance tags. State Manager was brought in to combat configuration drift. The associated document should consist of idempotent scripts so that a State Manager association can repeatedly execute these documents to ensure compliance.

Maintenance Window is more about scheduling one or more tasks. On the State Manager side however, association failure by default will be reported as out of compliance compliance. This is useful in scenarios such as keeping a Window instance in the domain, or keeping SSM agent up to date. You can choose either capability for many common setups but they have subtle differences. For example, for Patch management, you can use State Manager to detect missing patches and report compliance, and Maintenance Window to actually apply the missing patches. In fact, there is a document page on choosing between State Manager and Maintenance Windows to distinguish their best use cases.

Fleet Manager and Inventory

Fleet Manager presents a centralized view for all instances for users to perform common administration tasks, such as exploring file systems and logs, admin users and groups, manage registry and events on Windows instances, check processes and performance metrics. It also gives shortcuts to patch nodes, run commands, start session, etc. I think of Fleet Manager as a minimalist configuration management UI. It is not as sophisticated as those from Ansible Tower or Puppet but it comes at no additional cost.

A very useful feature of Fleet Manager is to run a web-based remote desktop to connect to Windows Instances. This saves the need for a bastion host as long as the instances have SSM connection. You will need the RSA private key to decrypt the Administrator password, which I would not recommend. If the Windows server is on a domain, you can enter your domain credential via Fleet Manager. If the users logged in via IAM identity center, Fleet Manager also has the login option for them via SSO using IAM Identity Center identity. When a user logs in this way, Fleet Manager uses RunCommand capability to execute AWSSSO-CreateSSOUser document against the server to create a local admin user.

Another aspect of configuration management is the inventory management. Unlike in Ansible, the term inventory in the context of Systems Manager refers to the metadata of instances, which includes installed applications, AWS components, network configurations, instance details, services, Windows registry and roles, etc. The full list of what is part of metadata is in the document and you can even define your own inventory item. To gather inventory data, we can makes use of a State Manager association to execute the AWS-GatherSoftwareInventory document. Once we set up the association, the agents will report inventory data back to Systems Manager. More importantly, we can create Resource Data Sync objects to write inventory data (along with compliance data) to S3 buckets, allowing downstream applications to consume. A common use case is to run Athena query against those bucket and produce QuickSight dashboard.

Patch Manager and Compliance

The Patch Manager also operates on State Manager associations. The automation runbook is AWSRunPatchBaseline, where you can just scan for missing patches or install them as well. The SSM document can run on all three platforms (Windows, Linux and MacOS) and determines which patches are missing relative to a the Patch Baseline. There should be at least one default Patch baseline. Each OS (e.g. Ubuntu, Debian, Amazon Linux, etc) classifies patches differently, and a patch baseline is a configuration that defines whether a patch is approved based on operating system and their classifications. The automation document also allows you to override the patch baseline. When executing the document to scan for patches, it records patch compliance information using the PutInventory API command. When using the document to install patches, you can run the document from a Maintenance Window and specify whether you need to reboot the target instance if required.

The compliance capability reports compliance status for instances. By default there are two types of compliance: association and patch. The association compliance detects whether a state manager association is failed on certain instances. The patch compliance, as just mentioned, checks whether patches are up to date relative to the specified patch baseline. You can also define custom compliance item (with put-compliance-items API) but the documentation isn’t clear on what exactly it can achieve and where on the instance does it pull the compliance status. From the example in put-compliance-items, custom compliance type seems to check the installation of additional software package in the inventory.

Other capabilities

Amongst the other capabilities, the one I use the most often is parameter store, which is a way to store a variable for different services to consume.

In the domain of change management, the change manager is a mini change management system. Organization can use it to manage their change process such as approvals. More importantly, you can fire automation runbook from change manager and tie it back to the change control item. Change calendar allows you to block changes during specific period. Both of them are organization level capabilities.

When it comes to operations management, the Incident Manager capability allows you to create response plan for incidents. Response plan can execute runbook actions once an incident is logged. It also helps you notify the on-call incident response team. On the other hand, OpsCenter capability allows you to create OpsItem, which also includes a way to execute runbook. The OpsData can aggregate to Explorer, which is a centralized dashboard for operations data. The Explorer, OpsCenter and Incident Manager capabilities can operate at organization level.

These capabilities around change management and operations management come nowhere close to full-fledged ITSM solutions such as ServiceNow or SMAX. However, they have the ability to trigger runbooks and natively integrate with other AWS services.

There is also a quick setup capability which uses pre-baked CloudFormation template to configure other services. For Patch manager the current recommendation is to use quick setup to configure patch policy.

Summary

Systems Manager has so many capabilities that I cannot cover everything in a single post. Here is a good walk-through. Some capabilities like session manager, fleet manager and state manager, are extremely helpful. However, in my opinion, there are two problems with grouping all these capabilities under Systems Manager. First, With too many different capabilities, this service lacks focus, which makes it difficult to learn. Second, some capabilities have overlap with other capabilities, or another AWS services, which also makes it confusing. I try to sort out how these capabilities enable each other in the diagram below:

This diagram may not be 100% accurate but it demonstrate the dependencies and can assist troubleshooting. For example, when compliance is missing data, check the execution history of run command. It also illustrates the key role of SSM agent as the underlying enabler of most of the other capabilities.

Overall, Systems Manager is extremely powerful. You can try to replace your server management solutions (e.g. Ansible, Chef and Puppet) with Systems Manager configurations. With a good understanding of its capabilities, you can build your fleet automation in an efficient and scalable way.