AWS CodeBuild vs GitHub Actions – Pricing Comparison

AWS CodeBuild, being one of the services within the suite of CI solutions by Amazon Web Service (along with CodePipeline, CodeDeploy, and CodeCommit) is a common choice for CI needs among projects that are ultimately deployed to the AWS Cloud.

GitHub on the other hand, started as a Git repository as-a-service, but evolved beyond that since these early days, and is now offering a CI solution of its own – GitHub Actions.

Although offered by different companies – they both satisfy the same requirement – to set up a CI/CD pipeline for your company’s project. Also, they both work for projects that are just getting started or ones with thousands of contributors.

In this article, we’ll try to shed some light on how these two services compare in terms of cost when used on a real-life scale.

Free Tier comparison

AWS CodeBuild includes 100 free build minutes per month for a CodeBuild job, based on the “general1.small” type, which includes 2 vCPUs, 3GB of memory, and 64 GB of disk space, as per the documentation.

GitHub Actions includes 2000 build minutes per month, and the hardware of each job is fixed to 2 vCPUs, 7GB of memory, and 14 GB of disk space, as per the documentation.

Real life example

Let’s assume developers are committing to your project’s repository, on average, 20 times per day, triggering a CI run, and each build takes, on average, 15 minutes to complete. This brings us to 9000 build minutes per month (20x15x30).

Comparing the CI costs for running the same CI pipeline on GitHub Actions vs AWS CodeBuild, the difference, from a cost perspective, would be:

AWS CodeBuild:

  • First 100 minutes from the free tier are consumed
  • The remaining 8900 minutes are charged at $0.005 per minute
  • Total cost: $44.5 per month

GitHub Actions:

  • First 2000 minutes from the free tier are consumed
  • The remaining 7000 minutes will cost you $0,008 per minute
  • Total cost: $56

Conclusion

Of course, the decision whether or not you should use CodeBuild or GitHub Actions is rarely based purely on price. There are other factors that come into play, including flexibility, integration with other services your project uses (e.g. other AWS services), observability (how easy it is for your developers to inspect build logs), scalability, parallelism, etc.

For this reason, I can not highlight one service as being superior to the other. We, at ScavaSoft, use both AWS CodeBuild and GitHub actions for our internal projects. We use GitHub actions mostly for lining and unit testing operations, whereas for Continuous Deployment to production environments – we rely on AWS CodeBuild and CodePipeline’s tight integration with the other AWS services like CloudFormation.

Fix ordered MySQL table with messed up data, using only SQL queries

Recently I came across a very interesting issue regarding SQL databases that contain ordered information (e.g. imagine a table with TV channel, where each channel has an index: 1,2,3 and so on), that got messed up – positions were out of order and there are “holes” in the positioning.

In the perfect scenario such table would look like this:

  • id – The unique identifier of the channel
  • ordering – The order of the channel on your TV remote

However, a recent bug in one of the applications that are maintaining, caused data to be messed up -“holes” started appearing in the “ordering” column as you “delete” channels.

This caused bugs that are hard to debug, related to functionality for reordering channels, getting channels count, etc. After we patched the original issue that is causing the holes in ordering, we had multiple approaches on how to fix the other affected features, affected by the messed up ordering.

We could either modify every single feature that relies on the “ordering” column to be okay with holes (not very feasible since there are many features that rely on correct “ordering”), or we could implement a mechanism for correcting the data.

We used the last approach. Using a simple SQL query that iterates all rows and increments an “ordering” index in memory. Finally, the result of this incrementing selection is treated as a subquery in a “UPDATE” SQL operation.

Here’s the final code that we build to handle this:

SET @channelPosition := 0;
UPDATE channels c1
INNER JOIN
(SELECT c.id, @channelPosition := @channelPosition + 1 AS channelPosition
  FROM channels c
  ORDER BY c.ordering) c2
ON c2.id = c1.id
SET c1.ordering = c2.channelPosition;

Final result looks like this:

Hope you find this useful. Do you have suggestions for better ways to approach this issue?

Filter DynamoDB, using Query/Scan and implement a full-text search?

DynamoDB is the Amazon take on fully managed NoSQL database services that provide fast and predictable performance with seamless scalability. It is a great solution for projects that require single-digit millisecond response of read/write operations and easy integration with other AWS services like Lambda, ElasticSearch, etc.

By essence DynamoDB provides a couple of ways to filter the returned results. First let’s look at the properties that can be provided as a criteria.

On creation of a Table in DynamoDB you have to specify a Partition Key by which you will have a uniquely identifiable record in the database.

The secord parameter by which you can further refine the list of returned data is the Sort key. The name implies that it is used for sorting purposes but in fact it can be also used as an uniquefilter for the outgoing data from the DB.

The third option that DynamoDB provides to its users is the GSI (global secondary index). You can create as many global secondary indexes as you need and this is a great way to cover various usage patterns for data retrieval. The GSIs also provide the capability to specify nested structures as the values. So let’s say we have the following structure in a column of the DynamoDB table “apartments_to_rent” named “rentor”:

{
“apartment_id: {N: 666},
"rentor": {
    "id": {N: 1234}
    "name": {S: "John Doe"}
    "rented_on": {N: 1234123421}
}
}

If you are curious to understand what N and S stand for in the complex structured values – it’s just DynamoDB’s way to “marshall” and “unmarshall” data – in other words – to keey track of the type of the value stored and not just the value itself. N stands for “Number” and S for “String”. Let’s forget about that for the purpose of this example, and imagine that the object is saved as a plain JSON.

"rentor": {
    "id": "1234"
    "name": "John Doe"
    "rented_on": "1234123421"
}

To make the “id” of this nested object a GSI Partition Key, we have to specify it when we create our resources. We will use AWS CDK code below to illustrate the example. It will look something like this:

this.table.addGlobalSecondaryIndex({
    indexName: "index-rentor",
partitionKey: {name: "rentor.id", type: dynamodb.AttributeType.STRING},
});

The this.table is an instance of an “Table” class that creates and defines the DynamoDB table. Using this instance we execute the method addGlobalSecondaryIndex by passing the name that the index will have and binding it with the nested structure that we will use as a range key, to be used later for queries.

That’s all you need to define the gsk.

Now for the interesting part! How do we filter by such a nested structure?

DynamoDB provides two main mechanisms to retrieve data in a filtered manner – those methods are query and scan. They both have their use cases, and I will explain which one to use for what now.

The query method is the better performer compared to the scan method. The reason for that lies in the way DynamoDB works under the hood. Query is searching for the value by the passed Partition Key and is only returning the results that match. This technique gives the query operation a greater advantage in execution time and cost for the pocket of the consumer.

You can use the scan operation for smaller tables that have a reasonable amount of records. This statement should be taken really in mind, because scan gets the full data of the table first and then starts to filter by some criteria. This will for sure impact the budget of the client and the execution time. Despite that scan has some advantages compared to query and for example one of them is nested querying of data. Yes! Query doesn’t allow it. Query allows you to do your search on a flat structure like a string, number, boolean but not on an object. Scan on the other hand doesn’t even have a limit on the depth of that object; you just have to specify all of the object keys in the valid order. On top of that scan allows you to use the “contains” keyword to make a full text body search, neat!

So sorry, but query is out of the question for this topic.

How do we implement the search by this gsk that we’ve created previously. Lets explain the params that have to be passed and how the method works. Example request:

const params: ScanInput = {
    TableName: "tableName",
    FilterExpression: "#rentor.#id = :rentorId",
    ExpressionAttributeNames: {
        "#rentor": "rentor",
        "#id": "id"
    },
    ExpressionAttributeValues: { 
        ":rentorId": "abcd-efgh-abcd-efgh" // UUID id from the database 
    }
};

try {
 const queryOutput = await DynamoDB().scan(paramsQuery).promise();
    return queryOutput.Items;
} catch (e) {
    throw e;
}

So what just happened? First we declared a variable named params in which we constructed our input parameters. We started off with the table that we will be searching, after that we created our filter selection. The filter string is binded in the “ExpressionAttributeNames” and “ExpressionAttributeValues” properties. As you can see in “ExpressionAttributeNames” we have added two key value pairs, first is the column name and then the nested “id” of the object. The “ExpressionAttributeValues” are the values we have passed in our API request for the data.

After we have built the requirements for the filtration we just need to pass them to the scan method of DynamoDB and return the “Items” it has found.

This method comes with its pro’s and con’s. Yes! we can filter by nested object attributes but we have to be careful of the table size we want to filter. On bigger tables it will be a better solution to create a separate column in the db with the id and query the results and not do the scan technique we just covered.

Hope this helps you on your quest to greatness!

Automatically delete an S3 bucket with the AWS CDK stack

AWS CDK is the latest Infrastructure as Code tool, made by AWS itself. It makes it super easy to deploy the various pieces of the infrastructure that your application needs. However, it’s having a hard time cleaning up stale S3 buckets when you no longer need them.

The basic mechanism for creating an S3 bucket as part of CDK stack is this:

const bucket = new Bucket(this, 'my-data-bucket');

However, try deleting the CDK stack using cdk destroy later. You will see that the CloudFormation stack is deleted but the S3 bucket remains. Why?


By default, the Construct that comes from the S3 package, has a default prop called removalPolicy: cdk.RemovalPolicy.RETAIN. This makes sense and is a meaningful default for a lot of use cases, where you wouldn’t want important data (e.g. user avatars or file uploads) to disappear with a simple command.

However, even if we try setting removalPolicy: cdk.RemovalPolicy.DESTROY, we see that on stack removal, it fails with an error that says the Bucket is not empty (CloudFormation does not destroy buckets that are not empty).

We have a couple of solutions here, and picking the right one very much depends on your needs:

Option 1: Manually clean the bucket contents before destroying the stack

This is okay for most cases. You could do this from the AWS S3 user interface or through the command line, using the AWS CLI:

# Cleanup bucket contents without removing the bucket itself
aws s3 rm s3://bucket-name --recursive

Then the cdk destroy should proceed without errors. However, this can quickly become a tedious activity if your stacks contain multiple S3 buckets or you use stacks as a temporary resource (e.g. you deploy a CDK stack for every client of your platform programmatically). In any case, some automation would help. Which brings us to the next option.

Option 2: Automatically clear bucket contents and delete the bucket

An interesting third party package called @mobileposse/auto-delete-bucket comes to the rescue. It provides a custom CDK construct that wraps around the standard S3 construct and internally uses the CloudFormation Custom Resources framework, to trigger an automated bucket contents cleanup when a stack destroy is triggered. The usage is pretty trivial.

Install the package:

npm i @mobileposse/auto-delete-bucket

Use the new CDK construct instead of the standard one:

import { AutoDeleteBucket } from '@mobileposse/auto-delete-bucket'

const bucket = new AutoDeleteBucket(this, 'my-data-bucket')

Enjoy.


Need AWS CDK consulting? We are here to help. Drop us a line.

Provisioning an AWS ECS cluster using Terraform

Lessons learned while automating the infrastructure provisoning of an ECS sluster of EC2 virtual machines, that run Docker and scale with your apps – using Terraform as the infrastructure orchestration tool.

What is AWS, Docker, ECS, Terraform?

Amazon Web Services is the obvious choice when it comes to deploying apps on the cloud.

Its staggering 47% market share speaks for itself.

Docker, a containerization tool, has also been around for a while and the DevOps community has seen its potential, responding with rapid adoption and community support.

Amazon also saw this potential, and created an innovative solution for deploying and managing a fleet of virtual machines – AWS ECS. Under the hood, ECS utilizes AWSs’ well-known concept of EC2 virtual machines, as well as CloudWatch for monitoring them, auto scaling groups (for provisioning and deprovisioning machines depending on the current load of the cluster), and most importantly – Docker as a containerization engine.

Terraform is an infrastructure orchestration tool (also known as “infrastructure as code”). Using Terraform, you declare every single piece of your infrastructure once, in static files, allowing you to deploy and destroy cloud infrastructure easily (through a single command), make incremental changes to the infrastructure, do rollbacks, infrastructure versioning, etc.

The goal of this article is to teach you how to create the Terraform “recipes” for defining an AWS ECS cluster in terms of Terraform code, so that you can deploy or redeploy this cluster in a repeatable, predictable, scalable and error-free way. A brief description is also available for all steps involved.

Preparation

We will not go into much details of how to download Terraform or how to run it locally because such information is readily available on their website.

What you will need though, is an AWS user with root privileges, access key and secret key of that user (generated from the AWS IAM console). Configuring the AWS provider of Terraform is also relatively easy and is also out of scope for this document (also a requirement).

Terraform structure

ecs-cluster.tf

We’ll start by creating the AWS ECS cluster, which is the most basic building block of the AWS ECS service. It has no dependencies (e.g. it doesn’t need a VPC), so we just give it a name that comes from a Terraform variable that we’ll pass during the creation of the infrastructure. This parameterizing allows us to easily create multiple ECS clusters (and its satellite resources) using the same set of Terraform files, if needed.

resource "aws_ecs_cluster" "ecs_cluster" {
  name = var.cluster_name
}

vpc.tf

For this tutorial, we’ll assume you want to create a brand new AWS VPC within the current region. This VPC will contain the EC2 instances that are launched within the ECS cluster and will allow them to communicate securely and privately, without resorting to the public internet and public IPs (as a matter of fact, the EC2 instances could be entirely hidden from the internet). Let’s create a new VPC now:

module "vpc" {
  source = "terraform-aws-modules/vpc/aws"

  name = "VPC of cluster ${var.cluster_name}"
  cidr = "10.0.0.0/16"

  azs = [
    data.aws_availability_zones.available.names[0],
    data.aws_availability_zones.available.names[1],
    data.aws_availability_zones.available.names[2]
  ]
  private_subnets = [
    "10.0.1.0/24",
    "10.0.2.0/24",
  "10.0.3.0/24"]
  public_subnets = [
    "10.0.101.0/24",
    "10.0.102.0/24",
  "10.0.103.0/24"]

  # If you want the EC2 instances to live within the private subnets but still be able to communicate with the public internet
  enable_nat_gateway = true
  # Save some money but less resilient to AZ failures
  single_nat_gateway = true
}

We’ll refer to the above new VPC and its private/public subnets in other resources below.

data.tf

You may have noticed that the above VPC resource has a hardcoded (the subnet ranges) and dynamic part (the three AZs or Availability Zones). The dynamic part is useful, because it allows you to redeploy the ECS to different AWS regions without having to worry about changing hardcoded values every time you do it. Another advantage of this dynamics is that your deployments will adapt seamlessly to future changes to AZs by Amazon (should they decide to bring down an entire Availability Zone or include new ones). The data block below helps you to “retrieve” the up to date AZs from the current AWS region during deployment. This data source is used during the VPC creation above.

data "aws_availability_zones" "available" {
  state = "available"
}

Another data source we will need later is one that will help us get th emost up-to-date AWS EC2 AMI that is ECS optimized. The AMI is nothing more than a codename (e.g. “ami-1234567”) that identifies a template that you can use to jump start a brand new EC2. There are AMIs for the popular Linux distributions: Ubuntu, Debian, etc. The one we will retrieve below, is a Linux based AMI that is created and maintained by Amazon and includes the essential tools for an EC2 to be able to work as an ECS instance (Docker, Git, the ECS agent, SSH).

data "aws_ami" "ecs" {
  most_recent = true # get the latest version

  filter {
    name = "name"
    values = [
      "amzn2-ami-ecs-*"] # ECS optimized image
  }

  filter {
    name = "virtualization-type"
    values = [
      "hvm"]
  }

  owners = [
    "amazon" # Only official images
  ]
}

autoscaling_groups.tf

Now comes the interesting part. Using AWS autoscaling groups, we could automate the launch of EC2 instances when the load of the ECS cluster reaches a certain metric (e.g. the cluster has 70%+ of its RAM utilized).

First we create an autoscaling group that defines the minimum the maximum and the desired EC2 instances count. These parameters help us define minumum available resources under scenarios with small server load, while still keeping costs under control within periods of high load and unexpected spikes of traffic.

resource "aws_autoscaling_group" "ecs_cluster_spot" {
  name_prefix = "${var.cluster_name}_asg_spot_"
  termination_policies = [
     "OldestInstance" # When a “scale down” event occurs, which instances to kill first?
  ]
  default_cooldown          = 30
  health_check_grace_period = 30
  max_size                  = var.max_spot_instances
  min_size                  = var.min_spot_instances
  desired_capacity          = var.min_spot_instances

  # Use this launch configuration to define “how” the EC2 instances are to be launched
  launch_configuration      = aws_launch_configuration.ecs_config_launch_config_spot.name

  lifecycle {
    create_before_destroy = true
  }

  # Refer to vpc.tf for more information
  # You could use the private subnets here instead,
  # if you want the EC2 instances to be hidden from the internet
  vpc_zone_identifier = module.vpc.public_subnets

  tags = [
    {
      key                 = "Name"
      value               = var.cluster_name,

      # Make sure EC2 instances are tagged with this tag as well
      propagate_at_launch = true
    }
  ]
}

# Attach an autoscaling policy to the spot cluster to target 70% MemoryReservation on the ECS cluster.
resource "aws_autoscaling_policy" "ecs_cluster_scale_policy" {
  name = "${var.cluster_name}_ecs_cluster_spot_scale_policy"
  policy_type = "TargetTrackingScaling"
  adjustment_type = "ChangeInCapacity"
  lifecycle {
    ignore_changes = [
      adjustment_type
    ]
  }
  autoscaling_group_name = aws_autoscaling_group.ecs_cluster_spot.name

  target_tracking_configuration {
    customized_metric_specification {
      metric_dimension {
        name = "ClusterName"
        value = var.cluster_name
      }
      metric_name = "MemoryReservation"
      namespace = "AWS/ECS"
      statistic = "Average"
    }
    target_value = 70.0
  }
}

The above basically automates the following:

  • Whenever the cluster has less tha 70% of memory used, the autoscaling policy will make sure there are “var.max_spot_instances” number of instances running
  • As soon as the cluster hits 70% memory used or more, the autoscaling policy will kick in and create new EC2 instances inside the cluster (using the “aws_launch_configuration.ecs_config_launch_config_spot.name” launch configuration) and keep duing that every 30 seconds (default_cooldown=30) until the criteria is no longer satisfied (cluster has less than 70% of memory used).

launch_configuration.tf

We saw above the automation of the scaling of the cluster, but we still haven’t defined what type of EC2 instances will be launched when scaling occurs. E.g. what AMI will they use, will they be larger ones (cost more per month) or smaller ones (cost less per month), will they assume an AIM role (to be able to access other AWS resources on your behalf), etc.

The launch configuration defines all of these parameters:

resource "aws_launch_configuration" "ecs_config_launch_config_spot" {
  name_prefix                 = "${var.cluster_name}_ecs_cluster_spot"
  image_id                    = data.aws_ami.ecs.id # Use the latest ECS optimized AMI
  instance_type               = var.instance_type_spot # e.g. t3a.medium

  # e.g. “0.013”, which represents how much you are willing to pay (per hour) most for every instance
  # See the EC2 Spot Pricing page for more information:
  # https://aws.amazon.com/ec2/spot/pricing/
  spot_price                  = var.spot_bid_price

  enable_monitoring           = true
  associate_public_ip_address = true
  lifecycle {
    create_before_destroy = true
  }

  # This user data represents a collection of “scripts” that will be executed the first time the machine starts.
  # This specific example makes sure the EC2 instance is automatically attached to the ECS cluster that we create earlier
  # and marks the instance as purchased through the Spot pricing
  user_data = <<EOF
#!/bin/bash
echo ECS_CLUSTER=${var.cluster_name} >> /etc/ecs/ecs.config
echo ECS_INSTANCE_ATTRIBUTES={\"purchase-option\":\"spot\"} >> /etc/ecs/ecs.config
EOF

  # We’ll see security groups later
  security_groups = [
    aws_security_group.sg_for_ec2_instances.id
  ]

  # If you want to SSH into the instance and manage it directly:
  # 1. Make sure this key exists in the AWS EC2 dashboard
  # 2. Make sure your local SSH agent has it loaded
  # 3. Make sure the EC2 instances are launched within a public subnet (are accessible from the internet)
  key_name             = var.ssh_key_name

  # Allow the EC2 instances to access AWS resources on your behalf, using this instance profile and the permissions defined there
  iam_instance_profile = aws_iam_instance_profile.ec2_iam_instance_profile.arn
}

security_groups.tf

AWS is big on security and almost every resource that you create is locked down for outside access by default. The same goes with EC2 instances. If you want these instances to receive any internet traffic (e.g. have an HTTP server installed) or if you want to SSH into the machines from your computer through the public internet, you need to make sure the Security Group, attached to the EC2 instances allows all this:

# Allow EC2 instances to receive HTTP/HTTPS/SSH traffic IN and any traffic OUT
resource "aws_security_group" "sg_for_ec2_instances" {
  name_prefix = "${var.cluster_name}_sg_for_ec2_instances_"
  description = "Security group for EC2 instances within the cluster"
  vpc_id      = data.aws_vpc.main.id
  lifecycle {
    create_before_destroy = true
  }
  tags = {
    Name = var.cluster_name
  }
}

resource "aws_security_group_rule" "allow_ssh" {
  type      = "ingress"
  from_port = 22
  to_port   = 22
  protocol  = "tcp"
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  security_group_id = aws_security_group.sg_for_ec2_instances.id
}
resource "aws_security_group_rule" "allow_http_in" {
  from_port         = 80
  protocol          = "tcp"
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  to_port           = 80
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  type = "ingress"
}

resource "aws_security_group_rule" "allow_https_in" {
  protocol  = "tcp"
  from_port = 443
  to_port   = 443
  cidr_blocks = [
    "0.0.0.0/0"
  ]
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  type              = "ingress"
}
resource "aws_security_group_rule" "allow_egress_all" {
  security_group_id = aws_security_group.sg_for_ec2_instances.id
  type              = "egress"
  from_port         = 0
  to_port           = 0
  protocol          = "-1"
  cidr_blocks = [
  "0.0.0.0/0"]
}

Of course, if you plan on launching something fancy like MySQL within the EC2 instances, you may want to expose other port ranges as well (e.g. 3306 for a MySQL server). Feel free to play around and add new security group rules to the above security group as needed.

variables.tf

This file defines all the variables that you will pass in while creating the infrastructure:

variable "cluster_name" {
  description = "The name to use to create the cluster and the resources. Only alphanumeric characters and dash allowed (e.g. 'my-cluster')"
}
variable "ssh_key_name" {
  description = "SSH key to use to enter and manage the EC2 instances within the cluster. Optional"
  default     = ""
}
variable "instance_type_spot" {
  default = "t3a.medium"
}
variable "spot_bid_price" {
  default = "0.0113"
 description “How much you are willing to pay as an hourly rate for an EC2 instance, in USD”
}
variable "min_spot_instances" {
  default     = "1"
  description = "The minimum EC2 spot instances to have available within the cluster when the cluster receives less traffic"
}
variable "max_spot_instances" {
  default     = "5"
  description = "The maximum EC2 spot instances that can be launched during period of high traffic"
}

Still having trouble?

We are available for Terraform consulting. Get in touch.


Conclusion

We’be published all of the above as a Terraform module on GitHub, that you could easily inject into your project.

Now that you are ready to launch your AWS ECS cluster using Terraform, what will you build next? Let us know in the comments.