Zero-Downtime Deployments with Ansible and EC2? Yes, please!

02 April 2023

Deploying with Ansible is easy. Just copy files over, restart services, and observe a new version of the app. But that might break some connections if some user was just waiting for the server's response. What to do then?

Base infrastructure

An obvious answer is to load balance the traffic between two instances. Let's model our infrastructure. On a diagram it will look like this:

Infrastructure diagram

Bottom part of the diagram can be written in Terraform and can look something like this:

resource "aws_instance" "app" {
  count = 2
  ami =   # Amazon Linux 2 for ARM
  instance_type = "t4g.nano"
  key_name = "app-key"

# This will be attached to our ELB listener on some path like /app
resource "aws_lb_target_group" "app" {
    name = "Apps"
    port = 8080
    protocol = "HTTP"
    vpc_id =     # default VPC
    health_check {
        path = "/"
        port = "8080"
        protocol = "HTTP"
        interval = 10                   # Every 10 seconds

resource "aws_lb_target_group_attachment" "apps" {
    count = 2
    target_group_arn =
    target_id =[count.index].id

The whole Terraform configuration can be found in the repository here.

Deployment in practice

Just deploying to both instances in parallel can make access to the app unavailable for a split second or two. The load balancer is not aware of us changing anything on the targets. We can do it with serial flag in Ansible and with a slight sleep, also causing one of the instances to become unhealthy when the app is down. This approach could work to some extent, when the target group realizes to not send traffic to unhealthy instances. However, the load balancer can still direct users to our instance just before we shut the service down.

Ideally we want to deregister targets before stopping the app and updating the code and then registering them again. So the play in practice would have the following steps:

  1. Deregister App-1 from the target group.
  2. Wait until App-1 is drained (unused).
  3. Update the code and restart/reload the service on App-1.
  4. Register App-1 to the target group.
  5. Wait until registration is done and health check passes.
  6. Repeat steps 1-5 for App-2.

Our main playbook will look like this:

- hosts: apps
  become: yes
  serial: 1

    deploy_dir: /opt/app

    - import_tasks: tasks/deregister.yml
    - import_tasks: tasks/deploy.yml
    - import_tasks: tasks/reload-service.yml
    - import_tasks: tasks/register.yml

So how to implement these steps? Reloading service and deploying the app will be very specific to the app we want to update. But deregistering and registering the instance should be similar for each solution on EC2.

Despite Ansible has a module for interacting with AWS, there's no such for EC2 Target Groups. For that we can use AWS SDK, such as boto3 for Python or just AWS CLI. We can also utilize raw HTTP requests but that's more complex and requires more work than just using a library.

Deregistering and registering

For EC2 we first have to setup a role with attached policy that will allow us to deregister or register an instance from a target group. It's also possible to do this on the host of the playbook using connection: local but let's keep the structure simple and make our instances be able to register and deregister themselves similar to microservices. The policy will look like this:

resource "aws_iam_policy" "app-target-group-policy" {
    name = "app-target-group-policy"
    policy = <<-EOF
        "Version": "2012-10-17",
        "Statement": [
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": [
                "Resource": "${aws_alb_target_group.apps.arn}"
                "Sid": "VisualEditor1",
                "Effect": "Allow",
                "Action": [
                "Resource": "*"

Where is the ARN of our target group from above. This is just the policy, see this file in the repository for the full IAM role recipe.

So once we have a role with the policy attached, we can test it on our machines by installing AWS CLI and running the following command:

$ yum install awscli   # if you use AmazonLinux2, you can also use any other SDK
$ aws sts get-caller-identity
    "Account": "999901234567", 
    "UserId": "AROA6RABCDEFGHJKL1234:i-0abcde12345678901", 
    "Arn": "arn:aws:sts::999901234567:assumed-role/app-role/i-0abcde12345678901"

Once we have the permissions to modify target groups from within our instance, let's write tasks that will:

This will be our tasks/deregister.yml file:

- name: Get IMDSv2 token
      X-aws-ec2-metadata-token-ttl-seconds: 21600
    method: PUT
    return_content: yes
  register: token

- name: Get instance ID
      X-aws-ec2-metadata-token: "{{ token.content }}"
    method: GET
    return_content: yes
  register: instance_id

- name: Deregister instance from ELB
    cmd: >-
      aws elbv2 deregister-targets --region=eu-central-1
      --target-group-arn "{{ target_group_arn }}"
      --targets "Id={{ instance_id.content }}"

- name: Wait for target to become unused
    cmd: >-
      aws elbv2 describe-target-health --region=eu-central-1
      --target-group-arn "{{ target_group_arn }}"
      --targets "Id={{ instance_id.content }}"
  register: health
  until: "( health.stdout | from_json ).TargetHealthDescriptions[0].TargetHealth.State == 'unused'"
  changed_when: false
  retries: 10
  delay: 10

The last task will wait up to 100 seconds for the instance to become unused. We can control how long the draining should take on AWS side by adding an argument to aws_lb_target_group resource in Terraform:

resource "aws_lb_target_group" "app" {
    name = "Apps"
    port = 8080
    deregistration_delay = 60   # This will make the target drain for 60 seconds

Be sure to adapt this value to your use case. It should be as long as the longest request your app can handle before timeout.

Registration will look almost the same. We already have the instance ID so we can skip first two tasks. We will just use aws elbv2 register-targets and wait until the .TargetHealth.State is equal to healthy. Control the speed of the instance reaching this status by changing the health_check block parameters. Complete file with changes is available here.

End Result

We can test how effective our deployment is by running the playbook alongside a loop that will send requests constantly to our app.

$ for i in {1..10000}; do\
 curl; echo; sleep 0.1;\
 # If you don't have a domain, use load balancer's DNS name with k flag in curl

We can also observe graphs in AWS console. It's visible that some instances become unhealthy but there's no errors in the curl loop. Observe also how the reported hostname becomes constant when one of the instances is deregistered and soon, when Ansible starts deploying again, the hostname changes to the second instance.

If we remove the deregistration and registration routines, we get a lot of 502 before the load balancer reacts to the healthcheck.