Provisioning an AWS ECS cluster using Terraform

Lessons learned while automating the infrastructure provisoning of an ECS sluster of EC2 virtual machines, that run Docker and scale with your apps – using Terraform as the infrastructure orchestration tool.

What is AWS, Docker, ECS, Terraform?

Amazon Web Services is the obvious choice when it comes to deploying apps on the cloud.

Its staggering 47% market share speaks for itself.

Docker, a containerization tool, has also been around for a while and the DevOps community has seen its potential, responding with rapid adoption and community support.

Amazon also saw this potential, and created an innovative solution for deploying and managing a fleet of virtual machines – AWS ECS. Under the hood, ECS utilizes AWSs’ well-known concept of EC2 virtual machines, as well as CloudWatch for monitoring them, auto scaling groups (for provisioning and deprovisioning machines depending on the current load of the cluster), and most importantly – Docker as a containerization engine.

Terraform is an infrastructure orchestration tool (also known as “infrastructure as code”). Using Terraform, you declare every single piece of your infrastructure once, in static files, allowing you to deploy and destroy cloud infrastructure easily (through a single command), make incremental changes to the infrastructure, do rollbacks, infrastructure versioning, etc.

The goal of this article is to teach you how to create the Terraform “recipes” for defining an AWS ECS cluster in terms of Terraform code, so that you can deploy or redeploy this cluster in a repeatable, predictable, scalable and error-free way. A brief description is also available for all steps involved.

Preparation

We will not go into much details of how to download Terraform or how to run it locally because such information is readily available on their website.

What you will need though, is an AWS user with root privileges, access key and secret key of that user (generated from the AWS IAM console). Configuring the AWS provider of Terraform is also relatively easy and is also out of scope for this document (also a requirement).

Terraform structure

ecs-cluster.tf

We’ll start by creating the AWS ECS cluster, which is the most basic building block of the AWS ECS service. It has no dependencies (e.g. it doesn’t need a VPC), so we just give it a name that comes from a Terraform variable that we’ll pass during the creation of the infrastructure. This parameterizing allows us to easily create multiple ECS clusters (and its satellite resources) using the same set of Terraform files, if needed.

resource "aws_ecs_cluster" "ecs_cluster" {
  name = var.cluster_name
}

vpc.tf

For this tutorial, we’ll assume you want to create a brand new AWS VPC within the current region. This VPC will contain the EC2 instances that are launched within the ECS cluster and will allow them to communicate securely and privately, without resorting to the public internet and public IPs (as a matter of fact, the EC2 instances could be entirely hidden from the internet). Let’s create a new VPC now:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "VPC of cluster ${var.cluster_name}"
  cidr = "10.0.0.0/16"

  azs = [
    data.aws_availability_zones.available.names[0],
    data.aws_availability_zones.available.names[1],
    data.aws_availability_zones.available.names[2]
  ]
  private_subnets = [
    "10.0.1.0/24",
    "10.0.2.0/24",
  "10.0.3.0/24"]
  public_subnets = [
    "10.0.101.0/24",
    "10.0.102.0/24",
  "10.0.103.0/24"]

  # If you want the EC2 instances to live within the private subnets but still be able to communicate with the public internet
  enable_nat_gateway = true
  # Save some money but less resilient to AZ failures
  single_nat_gateway = true
}

We’ll refer to the above new VPC and its private/public subnets in other resources below.

data.tf

You may have noticed that the above VPC resource has a hardcoded (the subnet ranges) and dynamic part (the three AZs or Availability Zones). The dynamic part is useful, because it allows you to redeploy the ECS to different AWS regions without having to worry about changing hardcoded values every time you do it. Another advantage of this dynamics is that your deployments will adapt seamlessly to future changes to AZs by Amazon (should they decide to bring down an entire Availability Zone or include new ones). The data block below helps you to “retrieve” the up to date AZs from the current AWS region during deployment. This data source is used during the VPC creation above.

data "aws_availability_zones" "available" {
  state = "available"
}

Another data source we will need later is one that will help us get th emost up-to-date AWS EC2 AMI that is ECS optimized. The AMI is nothing more than a codename (e.g. “ami-1234567”) that identifies a template that you can use to jump start a brand new EC2. There are AMIs for the popular Linux distributions: Ubuntu, Debian, etc. The one we will retrieve below, is a Linux based AMI that is created and maintained by Amazon and includes the essential tools for an EC2 to be able to work as an ECS instance (Docker, Git, the ECS agent, SSH).

data "aws_ami" "ecs" {
  most_recent = true # get the latest version

  filter {
    name = "name"
    values = [
      "amzn2-ami-ecs-*"] # ECS optimized image
  }

  filter {
    name = "virtualization-type"
    values = [
      "hvm"]
  }

  owners = [
    "amazon" # Only official images
  ]
}

autoscaling_groups.tf

Now comes the interesting part. Using AWS autoscaling groups, we could automate the launch of EC2 instances when the load of the ECS cluster reaches a certain metric (e.g. the cluster has 70%+ of its RAM utilized).

First we create an autoscaling group that defines the minimum the maximum and the desired EC2 instances count. These parameters help us define minumum available resources under scenarios with small server load, while still keeping costs under control within periods of high load and unexpected spikes of traffic.

resource "aws_autoscaling_group" "ecs_cluster_spot" {
  name_prefix = "${var.cluster_name}_asg_spot_"
  termination_policies = [
     "OldestInstance" # When a “scale down” event occurs, which instances to kill first?
  ]
  default_cooldown          = 30
  health_check_grace_period = 30
  max_size                  = var.max_spot_instances
  min_size                  = var.min_spot_instances
  desired_capacity          = var.min_spot_instances

  # Use this launch configuration to define “how” the EC2 instances are to be launched
  launch_configuration      = aws_launch_configuration.ecs_config_launch_config_spot.name

  lifecycle {
    create_before_destroy = true
  }

  # Refer to vpc.tf for more information
  # You could use the private subnets here instead,
  # if you want the EC2 instances to be hidden from the internet
  vpc_zone_identifier = module.vpc.public_subnets

  tags = [
    {
      key                 = "Name"
      value               = var.cluster_name,

      # Make sure EC2 instances are tagged with this tag as well
      propagate_at_launch = true
    }
  ]
}

# Attach an autoscaling policy to the spot cluster to target 70% MemoryReservation on the ECS cluster.
resource "aws_autoscaling_policy" "ecs_cluster_scale_policy" {
  name = "${var.cluster_name}_ecs_cluster_spot_scale_policy"
  policy_type = "TargetTrackingScaling"
  adjustment_type = "ChangeInCapacity"
  lifecycle {
    ignore_changes = [
      adjustment_type
    ]
  }
  autoscaling_group_name = aws_autoscaling_group.ecs_cluster_spot.name

  target_tracking_configuration {
    customized_metric_specification {
      metric_dimension {
        name = "ClusterName"
        value = var.cluster_name
      }
      metric_name = "MemoryReservation"
      namespace = "AWS/ECS"
      statistic = "Average"
    }
    target_value = 70.0
  }
}

The above basically automates the following:

  • Whenever the cluster has less tha 70% of memory used, the autoscaling policy will make sure there are “var.max_spot_instances” number of instances running
  • As soon as the cluster hits 70% memory used or more, the autoscaling policy will kick in and create new EC2 instances inside the cluster (using the “aws_launch_configuration.ecs_config_launch_config_spot.name” launch configuration) and keep duing that every 30 seconds (default_cooldown=30) until the criteria is no longer satisfied (cluster has less than 70% of memory used).

launch_configuration.tf

We saw above the automation of the scaling of the cluster, but we still haven’t defined what type of EC2 instances will be launched when scaling occurs. E.g. what AMI will they use, will they be larger ones (cost more per month) or smaller ones (cost less per month), will they assume an AIM role (to be able to access other AWS resources on your behalf), etc.

The launch configuration defines all of these parameters:

resource "aws_launch_configuration" "ecs_config_launch_config_spot" {
  name_prefix                 = "${var.cluster_name}_ecs_cluster_spot"
  image_id                    = data.aws_ami.ecs.id # Use the latest ECS optimized AMI
  instance_type               = var.instance_type_spot # e.g. t3a.medium

  # e.g. “0.013”, which represents how much you are willing to pay (per hour) most for every instance
  # See the EC2 Spot Pricing page for more information:
  # https://aws.amazon.com/ec2/spot/pricing/
  spot_price                  = var.spot_bid_price

  enable_monitoring           = true
  associate_public_ip_address = true
  lifecycle {
    create_before_destroy = true
  }

  # This user data represents a collection of “scripts” that will be executed the first time the machine starts.
  # This specific example makes sure the EC2 instance is automatically attached to the ECS cluster that we create earlier
  # and marks the instance as purchased through the Spot pricing
  user_data = <<EOF
#!/bin/bash
echo ECS_CLUSTER=${var.cluster_name} >> /etc/ecs/ecs.config
echo ECS_INSTANCE_ATTRIBUTES={\"purchase-option\":\"spot\"} >> /etc/ecs/ecs.config
EOF

  # We’ll see security groups later
  security_groups = [
    aws_security_group.sg_for_ec2_instances.id
  ]

  # If you want to SSH into the instance and manage it directly:
  # 1. Make sure this key exists in the AWS EC2 dashboard
  # 2. Make sure your local SSH agent has it loaded
  # 3. Make sure the EC2 instances are launched within a public subnet (are accessible from the internet)
  key_name             = var.ssh_key_name

  # Allow the EC2 instances to access AWS resources on your behalf, using this instance profile and the permissions defined there
  iam_instance_profile = aws_iam_instance_profile.ec2_iam_instance_profile.arn
}

security_groups.tf

AWS is big on security and almost every resource that you create is locked down for outside access by default. The same goes with EC2 instances. If you want these instances to receive any internet traffic (e.g. have an HTTP server installed) or if you want to SSH into the machines from your computer through the public internet, you need to make sure the Security Group, attached to the EC2 instances allows all this:

# Allow EC2 instances to receive HTTP/HTTPS/SSH traffic IN and any traffic OUT
resource "aws_security_group" "sg_for_ec2_instances" {
  name_prefix = "${var.cluster_name}_sg_for_ec2_instances_"
  description = "Security group for EC2 instances within the cluster"
  vpc_id      = data.aws_vpc.main.id
  lifecycle {
    create_before_destroy = true
  }
  tags = {
    Name = var.cluster_name
  }
}

resource "aws_security_group_rule" "allow_ssh" {
  type      = "ingress"
  from_port = 22
  to_port   = 22
  protocol  = "tcp"
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  security_group_id = aws_security_group.sg_for_ec2_instances.id
}
resource "aws_security_group_rule" "allow_http_in" {
  from_port         = 80
  protocol          = "tcp"
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  to_port           = 80
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  type = "ingress"
}

resource "aws_security_group_rule" "allow_https_in" {
  protocol  = "tcp"
  from_port = 443
  to_port   = 443
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  type              = "ingress"
}
resource "aws_security_group_rule" "allow_egress_all" {
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks = [
  "0.0.0.0/0"]
}

Of course, if you plan on launching something fancy like MySQL within the EC2 instances, you may want to expose other port ranges as well (e.g. 3306 for a MySQL server). Feel free to play around and add new security group rules to the above security group as needed.

variables.tf

This file defines all the variables that you will pass in while creating the infrastructure:

variable "cluster_name" {
  description = "The name to use to create the cluster and the resources. Only alphanumeric characters and dash allowed (e.g. 'my-cluster')"
}
variable "ssh_key_name" {
  description = "SSH key to use to enter and manage the EC2 instances within the cluster. Optional"
  default     = ""
}
variable "instance_type_spot" {
  default = "t3a.medium"
}
variable "spot_bid_price" {
  default = "0.0113"
 description “How much you are willing to pay as an hourly rate for an EC2 instance, in USD”
}
variable "min_spot_instances" {
  default     = "1"
  description = "The minimum EC2 spot instances to have available within the cluster when the cluster receives less traffic"
}
variable "max_spot_instances" {
  default     = "5"
  description = "The maximum EC2 spot instances that can be launched during period of high traffic"
}

Conclusion

We’be published all of the above as a Terraform module on GitHub, that you could easily inject into your project.

Now that you are ready to launch your AWS ECS cluster using Terraform, what will you build next? Let us know in the comments.

Posted in General.