Speeding up ECS containers with SOCI

04 December 2024

Waiting for your deployment to finish. What a painful experience. You have to wait for build. You have to wait for tests. And then you have to wait for the startup. 10 minutes later, the new task definition is up and running. Then you realize that you made a mistake in the configuration and it was not covered by health checks. So you rush to roll back the image and it takes even more time.

Usually, starting a new container on ECS is pretty long, especially if you have a lot of big layers (such as Datadog plugins 😅). Recently, my colleague sent me the link to the article about SOCI containers on AWS. Seekable image sounds like some kind of tape you loaded back in C64 days 📼. But according to the article it speeds up the process of starting a new container. By how much? Today, we will find out!

To find complete code for this post, you can visit this GitHub repository: https://github.com/ppabis/ecs-soci-experiments/.

Base boilerplate

First, we need some basic infrastructure to perform our tests in. I generated (using Cursor and Sonnet) the following resources in Terraform:

VPC, Internet Gateway,
Two public and two private subnets with respective route tables,
VPC endpoints for:
- S3 gateway,
- ECR API,
- ECR dkr,
- and CloudWatch Logs,
Application Load Balancer with HTTP only listener,
ELB Target Group with target type of ip,
ECR repository,
ECS Cluster,
ECS Service with desired count of 0,
ECS Task Definition which for now points to inexistent ECR image,
Required Security Groups and IAM Roles.

Base infrastructure

The load balancer will be deployed into two public subnets and allow connections from the ECS security group. The ECS service will be deployed into two private subnets and reference the target group to register with the load balancer. As we have enabled DNS and hostname resolution in the VPC, the ECS service should automatically pick up all the VPC endpoints.

For the above infrastructure, the Terraform code is available in the repository on GitHub. For my own convenience I also created an EC2 instance to build and upload the images but it's not necessary, just the upload times are much faster than using home internet 😄. If you also want to use it, change count to 1 in file bastion.tf and connect to the instance using Session Manager. Also optionally specify an S3 bucket that you will use to share the scripts with the machine - again using copy-resources.sh script.

Enter the infrastructure directory and deploy it using Terraform/OpenTofu. You can also change some of the variables such as aws_region and vpc_cidr.

$ cd infrastructure
$ tofu init
$ tofu apply

Building the first image

For both images we will use Nginx that will simply serve files. Additionally, I asked Sonnet to create a simple CLI tool in Go that will generate several types of hashes of a specified file. This procedure will simulate some boot sequence that might happen when you start your own application in a container.

The first image will have just one large file in a single layer. I used Big Buck Bunny movie that is around 700 MB. The Dockerfile below will first compile the Go application, copy it over to the target image, install curl (that will help us get some ECS metadata later on), copy the script that will call hasher for each found file and finally copy the movie file. Check the recipe here: github.com/ppabis/ecs-soci-experiments/tree/main/image

# Build stage for Go application
FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY hasher/ .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o hasher

FROM nginx:alpine-slim

COPY --from=builder /app/hasher /usr/local/bin/hasher
RUN chmod +x /usr/local/bin/hasher

RUN apk add --no-cache curl
COPY 60-generate-hashes.sh /docker-entrypoint.d/60-generate-hashes.sh
RUN chmod +x /docker-entrypoint.d/60-generate-hashes.sh

COPY *.mov /usr/share/nginx/html/

The script for generating hashes is also very simple. We just run a loop through all files found in /usr/share/nginx/html/ and call the hasher application on it. Store the output of the hash in a file in tmp and copy back to Nginx root. At the end of the file, we will append the task ARN and current date. This will give us estimate on how long it took since the creation of the task to finishing this start up sequence.

#!/bin/sh

# Process each file in the nginx html directory
for file in /usr/share/nginx/html/*; do
    if [ -f "$file" ]; then
        hasher "$file" >> "/tmp/hashes.txt"
        echo "" >> "/tmp/hashes.txt"
        echo "Generated hashes for $file"
    fi
done

# Get the container ARN "TaskARN":"arn:aws:ecs:eu-west-2:123456789012:task/demo-cluster/0123abcd0123abcd0123abcd"
curl -X GET ${ECS_CONTAINER_METADATA_URI_V4}/task | grep -oE "TaskARN\":\"arn:aws:ecs:${AWS_DEFAULT_REGION}:[0-9]+:task/[a-z0-9\\-]+/[a-z0-9]+" | cut -d':' -f 2- | tr -d '"' >> "/tmp/hashes.txt"
echo -n " Finished at " >> "/tmp/hashes.txt"
date >> "/tmp/hashes.txt"
echo "" >> "/tmp/hashes.txt"

# Move the hash files to nginx html directory
mv /tmp/hashes.txt /usr/share/nginx/html/

Such generated image can be pushed to ECR and tagged appropriately. Avoid using latest as some time ago ECS introduced a feature that prevents accidental updates of the image using the same tag. I just use current date and time as the tag. To even more speed up the process I created a script that downloads the movie, builds the image, saves tag into infrastructure/tag.txt and uploads the image to ECR. Next step is to simply redeploy the infrastructure and generate task definition.

#!/bin/bash

# You can alternatively use $(terraform output -raw) but this is easier to read
source infrastructure/outputs.env

ECR_REGION=$(echo $ECR_REPO_URL | cut -d '.' -f 4)

TAG=$(date "+%y%m%d%H%M")

# if the movie does not exist, download it
if [ ! -f "image/big_buck_bunny_1080p_h264.mov" ]; then
    wget https://download.blender.org/peach/bigbuckbunny_movies/big_buck_bunny_1080p_h264.mov -O image/big_buck_bunny_1080p_h264.mov
fi

aws ecr get-login-password --region $ECR_REGION | docker login --username AWS --password-stdin $ECR_REPO_URL
docker build -f image/Dockerfile.v1 -t $ECR_REPO_URL:$TAG image
docker push $ECR_REPO_URL:$TAG

echo "Pushed $ECR_REPO_URL:$TAG"

echo "$TAG" > infrastructure/tag.txt

Next up run deployment of the infrastructure again so that the task definition loads tag.txt and changes the image. For now the desired count of tasks will remain 0. We will use Python to control the scaling later.

$ ./docker-build-v1.sh
$ cd infrastructure
$ tofu apply

Testing startup times

This script is a bit more complicated. We need to control scaling of ECS service and also wait for all the processes to complete. Let's define some basic functions that will be useful for us. Assuming that you already have Python packages installed for requests and boto3, import the following modules.

import datetime, time, re, csv, requests, argparse, boto3

Next we will load some arguments that we will need in order to control scaling and read the boot time from the hashes.txt file. For that we will need three arguments: the ECS service ARN, ECS cluster name and ELB DNS name (the load balancer fronting the service). We will also define a global client for ECS.

parser = argparse.ArgumentParser(description="Test startup time of an ECS service")
parser.add_argument("--service-arn", help="ARN of the ECS service")
parser.add_argument("--cluster-name", help="Name of the ECS cluster")
parser.add_argument("--load-balancer-dns", help="DNS name of the load balancer")
args = parser.parse_args()

ECS = boto3.client("ecs")

The first function we will define will be one for scaling the ECS service. Once the service is scaled we need to wait until it status changes to ACTIVE. We should keep some delay in the loop to not overwhelm the API. It works for both scaling up and down.

def scale_service(service_arn: str, cluster_name: str, desired_count: int):
    """Change desired tasks count of a given service and wait for completion."""
    status = "?"
    ECS.update_service(
        cluster=cluster_name,
        service=service_arn,
        desiredCount=desired_count,
    )
    time.sleep(5)

    # Wait for the service to scale
    while status != "ACTIVE":
        status = ECS.describe_services(
            cluster=cluster_name,
            services=[service_arn],
        )["services"][0]["status"]
        time.sleep(10)

Next thing is to define a function that will get a list of tasks that are assigned to the service. This seems trivial but I found out the hard way that AWS likes to lag behind and you often get an empty result after scaling up. For that reason I added a retry logic into the function.

def get_tasks(service_arn: str, cluster_name: str) -> list[str]:
    """Get a list of task ARNs for a given ECS service."""
    response = ECS.list_tasks(
        cluster=cluster_name,
        serviceName=service_arn,
    )

    repeats = 8
    while len(response["taskArns"]) == 0 and repeats > 0:
        print(f"Weird, I can't see any tasks, {repeats} repeats left")
        time.sleep(5)
        response = ECS.list_tasks(
            cluster=cluster_name,
            serviceName=service_arn,
        )
        repeats -= 1

    return response["taskArns"]

Next function will get information about the tasks. We will pass a list of tasks to it and it will format all the interesting information about the tasks for our use that will be more practical than just what AWS provides. We will calculate the start up time, connectivity time and pull time from the response.

def get_task_details(cluster_name: str, tasks: list[str]) -> list[dict]:
    """Calculates times for the task and gives information about it."""
    response = ECS.describe_tasks(
        cluster=cluster_name,
        tasks=tasks,
    )

    filtered_tasks = []
    for task in response["tasks"]:
        filtered_task = {
            "connectivity": task.get("connectivity"),
            "lastStatus": task.get("lastStatus"),
            "startTime": (task.get("startedAt") - task.get("createdAt")).total_seconds() if task.get("startedAt") and task.get("createdAt") else None,
            "firstConnectionTime": (task.get("connectivityAt") - task.get("createdAt")).total_seconds() if task.get("connectivityAt") and task.get("createdAt") else None,
            "pullTime": (task.get("pullStoppedAt") - task.get("pullStartedAt")).total_seconds() if task.get("pullStoppedAt") and task.get("pullStartedAt") else None,
        }
        filtered_tasks.append(filtered_task)
    return filtered_tasks

Now, not all of the tasks are obviously started immediately. What makes the matter worse, not all tasks are stopped immediately (from experience, the latter is way slower for no apparent reason). Before we do actual measurements, we need the tasks to be all in RUNNING state. When scaling down, we also want to make sure that no tasks are running in the service. The following function will wait for all the tasks to be in running or stopped states.

def wait_for_all_tasks(cluster_name: str, tasks: list[str]):
    while True:
        _tasks = get_task_details(cluster_name, tasks)
        # check if all lastStatus are "RUNNING"
        if all(task.get("lastStatus") == "RUNNING" or task.get("lastStatus") == "STOPPED" for task in _tasks):
            return tasks
        time.sleep(5)

Finally, we can finish the script that will: - scale up the tasks to a desired number - I tested it 3 times with 1, 2, 3 and 8 tasks, - wait for all the tasks to start, - wait 45 seconds to check if the boot sequence is finished, - save recorded times, - scale down the service to 0 and wait for it to finish.

def main():
    results = []

    for test_case in [1, 2, 3, 8] * 3:
        # Scaling up
        scale_service(args.service_arn, args.cluster_name, test_case)
        time.sleep(5)
        print(f"Service scaled to {test_case}")

        # All tasks are up
        tasks = get_tasks(args.service_arn, args.cluster_name)
        tasks = wait_for_all_tasks(args.cluster_name, tasks)
        print(f"All tasks are running")

        # Boot should be finished by then.
        time.sleep(45)

        # Get times.
        task_details = get_task_details(args.cluster_name, tasks)
        averageFirstConnectionTime = sum(task.get("firstConnectionTime") for task in task_details) / len(task_details)
        averageStartTime = sum(task.get("startTime") for task in task_details) / len(task_details)
        averagePullTime = sum(task.get("pullTime") for task in task_details) / len(task_details)

        results.append({
            "taskCount": test_case,
            "averageFirstConnectionTime": averageFirstConnectionTime,
            "averageStartTime": averageStartTime,
            "averagePullTime": averagePullTime,
        })

        # Scale to 0
        scale_service(args.service_arn, args.cluster_name, 0)
        time.sleep(15)
        print(f"Service scaled to 0")

        # Wait for the tasks to stop. Mind you, the list given here is the same
        # as when scaling up. It is important so that you wait for the correct
        # tasks to stop.
        tasks = wait_for_all_tasks(args.cluster_name, tasks)
        print(f"All tasks are stopped")

However, we are not finished yet. We still have to get the boot time directly from script in the container to have somewhat more realistic measurement of task readiness. This is a bit more tricky, as we need to parse the date and task ARN. I used regular expressions for that. But because we sometimes start many tasks, we will sample some of the boot times (assuming ELB routes us to different ones) and return the average.

def sample_boot_time(load_balancer_dns: str, count: int):
    """Samples boot time by requesting the /hashes.txt file from the load balancer."""
    # The following pattern should match <task ARN> Finished at <date time>
    pattern = r'(arn:aws:ecs:.*:task.*)\s*\n?\s*Finished at \w{3} (\w{3}\s+\d+\s+\d+:\d+:\d+ UTC \d+)'
    results = []

    for i in range(count):
        response = requests.get(f"http://{load_balancer_dns}/hashes.txt")
        match = re.search(pattern, response.text)

        if match:
            task = match.group(1)
            date = match.group(2)
            # Get created time from ECS using task ARN
            created_time = ECS.describe_tasks(cluster = args.cluster_name, tasks=[task])['tasks'][0]['createdAt']
            boot_time = datetime.datetime.strptime(date, "%b %d %H:%M:%S %Z %Y").replace(tzinfo=datetime.timezone.utc)
            created_time = created_time.astimezone(datetime.timezone.utc)
            results.append((boot_time - created_time).total_seconds())
        else:
            print(f"No match found for {i}")

    return (sum(results) / len(results)) if len(results) > 0 else None

Add this function to the main test run and append it also to results. At the end we can finally save all the results to a CSV file.

...
        averagePullTime = sum(task.get("pullTime") for task in task_details) / len(task_details)
        averageBootTime = sample_boot_time(args.load_balancer_dns, test_case * 2)

        results.append({
            "taskCount": test_case,
            "averageFirstConnectionTime": averageFirstConnectionTime,
            "averageStartTime": averageStartTime,
            "averagePullTime": averagePullTime,
            "averageBootTime": averageBootTime,
        })
...
    with open("results.csv", "w") as f:
        writer = csv.DictWriter(f, fieldnames=["taskCount", "averageFirstConnectionTime", "averageStartTime", "averagePullTime", "averageBootTime"])
        writer.writeheader()
        writer.writerows(results)

First results

I measured the boot times for this setup. The amount of tasks do not change the times too much - despite it it can be around 7%, it seems like it's luck dependent and can be treated as noise. For further comparisons I will simply take average of all the times for each measurement type. First connection time, boot time and start time are measured from the moment task was created in AWS. Pull time is measured independently.

Non-SOCI image, one big layer

Installing SOCI

I use commands for Amazon Linux 2023. Adapt them to your distribution. I've never tried installing this on Mac or Windows, so I can't tell how. But what is SOCI anyway? It's a special manifest that indexes the layers in the container so that the container runtime detects which file you want to access and downloads layer with this file only when needed. Comparing this to the standard behavior, where all layers are needed. SOCI provides tooling that you need to use in order to create these manifests. Installation is quite simple. Assuming that you have containerd runtime installed (modern distributions install it alongside Docker), you can do the following:

mkdir -p ~/soci && cd ~/soci
wget https://github.com/awslabs/soci-snapshotter/releases/download/v0.8.0/soci-snapshotter-0.8.0-linux-arm64.tar.gz
tar -xf soci-snapshotter-0.8.0-linux-arm64.tar.gz
sudo mv soci /usr/local/bin/
sudo mv soci-snapshotter-grpc /usr/local/bin/
sudo chmod +x /usr/local/bin/soci
sudo chmod +x /usr/local/bin/soci-snapshotter-grpc

You should now have a soci command available. But this is not all. In the next step, install soci-snapshotter service that will run as a daemon. Download the following systemd unit file and activate it.

wget https://github.com/awslabs/soci-snapshotter/raw/refs/heads/main/soci-snapshotter.service
sudo mv soci-snapshotter.service /etc/systemd/system/
sudo systemctl daemon-reload
sudo systemctl enable --now soci-snapshotter
sudo systemctl status soci-snapshotter

If you install Docker on latest Amazon Linux, it is very likely that SOCI plugin is included but still ensure that you have the following lines in your containerd config at /etc/containerd/config.toml:

# ...
[proxy_plugins]
  [proxy_plugins.soci]
    type = "snapshot"
    address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"

Obviously if you don't have this file you might not even have containerd installed. You can proceed with the following commands:

$ sudo yum install docker
$ sudo systemctl enable --now docker
$ sudo systemctl enable --now containerd
$ sudo ctr plugins ls | grep soci

Generating an index

Now, we use Docker to build the image and for SOCI we need to have the image in containerd store (Docker and ctr use different places to store images despite they are compatible in practice). The are three possibilities to do that:

build the image with another tool like BuildAh,
export it to tarball from Docker and import with ctr,
pull from ECR with ctr.

I decided to go with the last option because it so easy and foolproof with no need to store additional tar. The authentication for ECR is a bit different because ctr does not store the credentials like Docker so you have to specify them in an argument.

$ ECR_PASSWORD=$(aws ecr get-login-password --region eu-west-2)
$ sudo ctr image pull --user "AWS:$ECR_PASSWORD" 123456789012.dkr.ecr.eu-west-2.amazonaws.com/sample/nginx:2412011229
$ sudo ctr i ls
REF                                                                  TYPE                                       DIGEST                                                                  SIZE      PLATFORMS   LABELS 1
12345689012.dkr.ecr.eu-west-2.amazonaws.com/sample/nginx:2412011229 application/vnd.oci.image.manifest.v1+json sha256:f2cf117facb0075a6977f463923baea68305a6913f0a41a8b672c4822813ade8 720.2 MiB linux/arm64 -

Now we can create the index. Normally the minimum layer size for indexing is 10 MB but I decided to lower it to 2. Once the image is in local containerd store, just run the following command and three more for verification.

$ sudo soci create --min-layer-size 2097152 12345689012.dkr.ecr.eu-west-2.amazonaws.com/sample/nginx:2412011229
$ sudo soci index list
DIGEST                                                                     SIZE    IMAGE REF                                                               PLATFORM          MEDIA TYPE                                    CREATED      
sha256:342219668da34dcc20c328e1f21e9d27f57dc5598c2a95a29718361cc2bbda80    1868    12345689012.dkr.ecr.eu-west-2.amazonaws.com/sample/nginx:2412011229     linux/arm64/v8    application/vnd.oci.image.manifest.v1+json    5s ago
$ sudo soci ztoc list # pick a random digest for the below command
$ sudo soci ztoc info sha256:104e0df268aa5c96f12e86d44bb59b537ebe1d50a0be15adf125dd9b38ef07a3

The last step is to push this index into ECR. This is also very straightforward and you can do it directly from soci. After that in AWS you should see both the image and Soci Index. The image itself doesn't change, only a new artifact is uploaded to ECR without a tag.

$ ECR_PASSWORD=$(aws ecr get-login-password --region eu-west-2)
$ sudo soci push --user AWS:$ECR_PASSWORD 12345689012.dkr.ecr.eu-west-2.amazonaws.com/sample/nginx:2412011229

Soci index appeared

Startup times with SOCI

The times improved significantly. The connection time did not change much but pull time is down 80%, start time 40% and boot time 24%.

Soci image times

Soci vs Non-Soci

However, as we have one large file in the image that we access at boot time, what if we only needed to access a bit of large layers and leave others at rest, being ready to be accessed when needed? I created another Dockerfile that will have some random Parquet files from Hugging Face. I will only hash the test set and leave training data alone. What is more, I will make sure that Docker does not optimize layers anyhow by running a stat command on each file and saving the output.

# Build stage for Go application
FROM golang:1.21-alpine AS builder

WORKDIR /app
COPY hasher/ .
RUN CGO_ENABLED=0 GOOS=linux GOARCH=arm64 go build -o hasher

FROM nginx:alpine-slim

COPY --from=builder /app/hasher /usr/local/bin/hasher
RUN chmod +x /usr/local/bin/hasher

RUN apk add --no-cache curl

COPY 62-generate-one-hash.sh /docker-entrypoint.d/62-generate-one-hash.sh
COPY default.conf /etc/nginx/conf.d/default.conf

RUN chmod +x /docker-entrypoint.d/62-generate-one-hash.sh

# Create a lot of layers to fragment the image a bit
COPY test-00000-of-00001.parquet /usr/share/nginx/html/
RUN stat /usr/share/nginx/html/test-00000-of-00001.parquet > /usr/share/nginx/html/test-00000-of-00001.parquet.stat

COPY train-00000-of-00009.parquet /usr/share/nginx/html/
RUN stat /usr/share/nginx/html/train-00000-of-00009.parquet > /usr/share/nginx/html/train-00000-of-00009.parquet.stat

COPY train-00001-of-00009.parquet /usr/share/nginx/html/
RUN stat /usr/share/nginx/html/train-00001-of-00009.parquet > /usr/share/nginx/html/train-00001-of-00009.parquet.stat

COPY train-00002-of-00009.parquet /usr/share/nginx/html/
RUN stat /usr/share/nginx/html/train-00002-of-00009.parquet > /usr/share/nginx/html/train-00002-of-00009.parquet.stat

#!/bin/sh
# Process only the test parquet in the nginx html directory
for file in /usr/share/nginx/html/test*; do
    if [ -f "$file" ]; then
        hasher "$file" >> "/tmp/hashes.txt"
        echo "" >> "/tmp/hashes.txt"
        echo "Generated hashes for $file"
    fi
done

# Get the container ARN "TaskARN":"arn:aws:ecs:eu-west-2:123456789012:task/demo-cluster/0123abcd0123abcd0123abcd"
curl -X GET ${ECS_CONTAINER_METADATA_URI_V4}/task | grep -oE "TaskARN\":\"arn:aws:ecs:${AWS_DEFAULT_REGION}:[0-9]+:task/[a-z0-9\\-]+/[a-z0-9]+" | cut -d':' -f 2- | tr -d '"' >> "/tmp/hashes.txt"
echo -n " Finished at " >> "/tmp/hashes.txt"
date >> "/tmp/hashes.txt"
echo "" >> "/tmp/hashes.txt"

# Move the hash files to nginx html directory
mv /tmp/hashes.txt /usr/share/nginx/html/

I also updated the script to easily download files from Hugging Face if they do not exist yet. The script below contains the list of files I chose to make it also around 700 MB (3 x 200 MB + 100 MB).

#!/bin/bash

# You can alternatively use $(terraform output -raw) but this is easier to read
source infrastructure/outputs.env
ECR_REGION=$(echo $ECR_REPO_URL | cut -d '.' -f 4)

TAG=$(date "+%y%m%d%H%M")

# download some files from huggingface to make image larger
HF_FILES=(\
 "https://huggingface.co/datasets/HuggingFaceTB/smoltalk/resolve/main/data/all/test-00000-of-00001.parquet" \
 "https://huggingface.co/datasets/HuggingFaceTB/smoltalk/resolve/main/data/all/train-00000-of-00009.parquet" \
 "https://huggingface.co/datasets/HuggingFaceTB/smoltalk/resolve/main/data/all/train-00001-of-00009.parquet" \
 "https://huggingface.co/datasets/HuggingFaceTB/smoltalk/resolve/main/data/all/train-00002-of-00009.parquet" \
)

for file in "${HF_FILES[@]}"; do
    if [ ! -f "image/${file##*/}" ]; then
        wget $file -O "image/${file##*/}"
    fi
done

aws ecr get-login-password --region $ECR_REGION | docker login --username AWS --password-stdin $ECR_REPO_URL
docker build -f image/Dockerfile.v2 -t $ECR_REPO_URL:$TAG image
docker push $ECR_REPO_URL:$TAG
echo "Pushed $ECR_REPO_URL:$TAG"

echo "$TAG" > infrastructure/tag.txt

With that setup the gains are even more significant. Start time and pull time are still similar but the boot time increased by 46% compared to the non-SOCI image.

Times for non-SOCI with many layers

Times for SOCI with many layers

Comparison of the above

Summary

With that experiment, it is proven that SOCI images can very significantly improve the time to start ECS tasks. The creation of the index is very simple and doesn't require so much setup. There is even a serverless flow created by AWS based on EventBridge and Lambda that can do it for your in the background for each image you push to ECR. You can find it here: https://aws-ia.github.io/cfn-ecr-aws-soci-index-builder/

pabis.eu