To monitor the running system, we're making use of CloudWatch alarms. If an alarm gets triggered, it automatically sends out an email to TBD. To have immediate actioning, we're getting assisted by the supporting Amazon team located in Berlin. If an alarm gets triggered, an on-call person is immediately paged and will look into the case as soon as possible.
We currently have the following alarms in place:
- AuthorizationFailuresAlarm: Detect any attempts to access unauthorized resources
- LimitExceededFailuresAlarm: Detect if any account limits are reached
- IAMPolicyChangesAlarm: Detect IAM policy changes
- HighCpuUsageAlarm: Detect high CPU usage
- HighNetworkUsageAlarm: Detect high network usage
Deployment
Deployment is done using CloudFormation. Make sure to have the AWS-CLI profiles named 'mxnetci' (prod) or 'mxnetcidev' (test) before you continue.
To deploy, simply run ./deploy-stack.sh in the cloudformation_metrics directory. The output looks like follows:
Structure
In order to separate concerns, we're making use of nested CloudFormation stacks. These allow to have mostly independent templates that can be managed without influence the others. They are all managed in the master-metrics.yml file. Dependencies are marked using the DependsOn-Attribute. Further documentation is available at https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-nested-stacks.html.