Introduction to Systems Manager
AWS Systems Manager addresses a lot of SysOps requirements for configuration management, including server automation. In this domain, there is another AWS service called OpsWorks. However, with OpsWorks Stack, OpsWorks Chef and OpsWorks Puppet all coming EOL in 2024, the entire OpsWorks service is mostly deprecated. By partnering with leaders such as Chef and Puppet, OpsWork services represent the era when AWS needed to mirror the configuration management capability on premise, in an effort to convince customers migrating to the cloud. Today, AWS Systems Manager has evolved to fill a lot of gaps around configuration management for servers in the cloud.
Although AWS Systems Manager sounds like a single service. It consists a collection of many seemingly disparate capabilities that serves similar requirements around configuration management. In fact, many of the Systems Manager capabilities are built on top of a couple of what I call core capabilities, such as Session Manager, RunCommand, Automation. This post will review these core capabilities and how Systems Manager employs them to expand with other capabilities.
SSM Agent and Session Manager
What enables all other capabilities is the SSM agent installed on the EC2 instances. The agent running as a systemctl task by ssm-user on EC2 instances. Most of AMIs come with this agent pre-installed. It stores the logs in /var/log/amazon/ssm/. This agent works with an instance profile with a role with the AmazonSSMManagedInstanceCore managed policy, in order to communicate with AWS Systems Manager (ssm.<region>.amazonaws.com) backend. Because of that, you also need to provide a network path to the backend endpoint, either via Internet, or interface endpoint.
This communication also allows an IAM user to connect to an instance’s shell. A common use case is for private instance that do not have Internet access but do have access to SSM backend endpoint. In a previous post I discussed using Session Manager to replace a bastion host to connect to EKS nodes.
When launching an instance using an AMI with SSM pre-installed, the SSM agent should launch after all the config sets from Cloudformation Init are finished. As a result, the Cloudformation Init script is not able to communicate with SSM backend via the agent, unless you install and start SSM agent first on your own, in CloudFormation Init. To troubleshoot SSM, it is important to review its logs.
Through Systems Manager Hybrid Activation, the SSM agent can also work on virtual machines out of AWS and report back to with SSM backend. This gives on-prem servers the identities (instance tags, instance profiles) required for Systems Manager to manage them as if they were EC2 instances. As a result, extend Systems Manager capabilities to on-prem fleet (requiring advanced instances tier).
RunCommand and Automation
SSM uses many document types. When you author an SSM document, follow the schema but I would check predefined document library first because they cover a lot of common use cases. The most commonly used document types are as follows:
|Command Document||Run Command, State Manager, Maintenance Windows|
|Automation runbook||Automation, State Manager, Maintenance Windows|
|Session Document||Session Manager|
|Policy Document||State Manager|
|Change Calendar Document||Change Calendar|
AWS combines scripts for different platforms (e.g. Linux, Windows) into Automation Document (runbook) and has published a whole library of documents (predefined runbooks) shared with all AWS clients. For example, the runbook AWS-JoinDirectoryServiceDomain help join a Windows server to a managed Active Directory domain. The runbook AWS-RunPatchBaseline is used by Systems Manager Patch Manager capability to check and apply operating system patches. The runbook includes steps for Windows, MacOS and Linux instances. To troubleshoot why a command fail on an instance, check the file
ssm-document-worker.log in the ssm agent log directory. Each log entry should have a command ID as reference.
The Run Command and Automation capabilities run on top of SSM agent. You can specify one or more target instances. You also specify other other options such as command parameters, rate control and where the output goes. As an IAM user you can run command directly on an instance and centrally log those events. The automation capability in addition, allows calling AWS APIs as part of the execution and allows multi-step in the document.
RunCommand and automation are core capabilities that further enable a variety of other Systems Manager capabilities.
Maintenance Window and State Manager
Maintenance window is a very straight forward capability to schedule RunCommand activity with a cron or rate expression. You can specify target by instance tags, define one or more tasks, and define a window of activity and at what point prior to the end of Windows should the agent stop performing more activities (cutoff). Each task can be a type of a RunCommand command, Step Function, Lambda function and automation workflows.
State Manager is a similar capabilities with a lot of feature overlap with Maintenance Window. State Manager operates on the concept of associations. An association connects target instances to SSM documents to run. You can even run Ansible playbooks or Chef recipes. Similar to Maintenance Window, you can specify a schedule expression, document parameters and instance tags. State Manager was first brought in to combat configuration drift. The associated document should consist of idempotent scripts so that a State Manager association can repeatedly execute these documents to ensure compliance.
Maintenance Window is more about scheduling one or more tasks. On the State Manager side however, association failure by default will be reported as out of compliance compliance. This is useful in scenarios such as keeping a Window instance in the domain, or keeping SSM agent up to date. You can choose either capability for many common setups but they have subtle differences. For example, for Patch management, you can use State Manager to detect missing patches and report compliance, and Maintenance Window to actually apply the missing patches. In fact, there is a document page on choosing between State Manager and Maintenance Windows to distinguish their best use cases.
Fleet Manager and Inventory
Fleet Manager presents a centralized view for all instances for users to perform common administration tasks, such as exploring file systems and logs, admin users and groups, manage registry and events on Windows instances, check processes and performance metrics. It also gives shortcuts to patch nodes, run commands, start session, etc. I think of Fleet Manager as a minimalist configuration management UI. It is not as sophisticated as those from Ansible Tower or Puppet but it comes at no additional cost.
A very useful feature of Fleet Manager is to run a web-based remote desktop to connect to Windows Instances. This saves the need for a bastion host as long as the instances have SSM connection. You will need the RSA private key to decrypt the Administrator password, which I would not recommend. If the Windows server is on a domain, you can enter your domain credential via Fleet Manager. If the users logged in via IAM identity center, Fleet Manager also has the login option for them via SSO using IAM Identity Center identity. When a user logs in this way, Fleet Manager uses RunCommand capability to execute
AWSSSO-CreateSSOUser document against the server to create a local admin user.
Another aspect of configuration management is the inventory management. Unlike in Ansible, the term inventory in the context of Systems Manager refers to the metadata of instances, which includes installed applications, AWS components, network configurations, instance details, services, Windows registry and roles, etc. The full list of what is part of metadata is in the document and you can even define your own inventory item. To gather inventory data, we can makes use of a State Manager association to execute the AWS-GatherSoftwareInventory document. Once we set up the association, the agents will report inventory data back to Systems Manager. More importantly, we can create Resource Data Sync objects to write inventory data (along with compliance data) to S3 buckets, allowing downstream applications to consume. A common use case is to run Athena query against those bucket and produce QuickSight dashboard.
Patch Manager and Compliance
The Patch Manager also operates on State Manager associations. The automation runbook is AWSRunPatchBaseline, where you can just scan for missing patches or install them as well. The SSM document can run on all three platforms (Windows, Linux and MacOS) and determines which patches are missing relative to a the Patch Baseline. There should be at least one default Patch baseline. Each OS (e.g. Ubuntu, Debian, Amazon Linux, etc) classifies patches differently, and a patch baseline is a configuration that defines whether a patch is approved based on operating system and their classifications. The automation document also allows you to override the patch baseline. When executing the document to scan for patches, it records patch compliance information using the PutInventory API command. When using the document to install patches, you can run the document from a Maintenance Window and specify whether you need to reboot the target instance if required.
The compliance capability reports compliance status for instances. By default there are two types of compliance: association and patch. The association compliance detects whether a state manager association is failed on certain instances. The patch compliance, as just mentioned, checks whether patches are up to date relative to the specified patch baseline. You can also define custom compliance item (with put-compliance-items API) but the documentation isn’t clear on what exactly it can achieve and where on the instance does it pull the compliance status. From the example in put-compliance-items, custom compliance type seems to check the installation of additional software package in the inventory.
Amongst the other capabilities, the one I use the most often is parameter store, which is a way to store a variable for different services to consume.
In the domain of change management, the change manager is a mini change management system. Organization can use it to manage their change process such as approvals. More importantly, you can fire automation runbook from change manager and tie it back to the change control item. Change calendar allows you to block changes during specific period. Both of them are organization level capabilities.
When it comes to operations management, the Incident Manager capability allows you to create response plan for incidents. Response plan can execute runbook actions once an incident is logged. It also helps you notify the on-call incident response team. On the other hand, OpsCenter capability allows you to create OpsItem, which also includes a way to execute runbook. The OpsData can aggregate to Explorer, which is a centralized dashboard for operations data. The Explorer, OpsCenter and Incident Manager capabilities can operate at organization level.
These capabilities around change management and operations management come nowhere close to full-fledged ITSM solutions such as ServiceNow or SMAX. However, they have the ability to trigger runbooks and natively integrate with other AWS services.
There is also a quick setup capability which uses pre-baked CloudFormation template to configure other services. For Patch manager the current recommendation is to use quick setup to configure patch policy.
Systems Manager has so many capabilities that I cannot cover everything in a single post. Here is a good walk-through. Some capabilities like session manager, fleet manager and state manager, are extremely helpful. However, in my opinion, there are two problems with grouping all these capabilities under Systems Manager. First, With too many different capabilities, this service lacks focus, which makes it difficult to learn what Systems Manager does. Second, some capabilities have overlap with other capabilities, or another AWS services, which also makes it confusing. I try to sort out how these capabilities enable each other in the diagram below:
This diagram may not be 100% accurate but it demonstrate the dependencies and can assist troubleshooting. For example, when compliance is missing data, check the execution history of run command. It also illustrates the key role of SSM agent as the underlying enabler of most of the other capabilities.