pabis.eu

Daily Git diff into S3 - external Git repository

31 July 2024

This is an update to my recent post about creating daily Git diffs between today's and previous day's HEAD commits. However, in the previous post, I relied on CodeCommit that is unfortunately being slowly shut down. What's more, a lot of users prefer external Git providers like GitHub, GitLab, or Bitbucket. I modified the project to be compatible with both solutions. I will base my examples on a private GitHub repository but you can easily replicate it with other platforms.

New diagram

Previous post can be found here.

Completed project supporting both external Git repository and CodeCommit is tagged v2.

Getting Personal Access Token

I created a new private GitHub repository. With CodeCommit we relied on IAM authentication that made the integration very smooth. However, GitHub is not part of AWS and we must use a password of some sort to authenticate (or SSH key). GitHub uses access tokens to do this. You can generate one by clicking on your GitHub profile at the top, selecting Settings and at the bottom of the left panel you can find Developer settings.

How to create access token

On that page select Fine-grained tokens and create a new one. Select the repository you want access to (for public ones you don't need the token ๐Ÿคจ). Under repository permissions set Contents to Read-only. Scroll to the bottom and create the token. Copy it to a password manager - you can't read the token later and will need to create a new one.

Updates to the current infrastructure

GitHub credentials

First we will define an SSM parameter (or Secrets Manager secret if you prefer) with the obtained token from above. In a new file called secret.tf I created the following resource and output:

resource "aws_ssm_parameter" "git_http_auth" {
  name  = "/git-diff/git-http-auth"
  type  = "SecureString"
  value = "-"
  lifecycle { ignore_changes = [value] }
}

output "git_http_auth_name" {
  value = aws_ssm_parameter.git_http_auth.name
}

The default value cannot be empty unfortunately, so I used - as a placeholder. In order to put our own value in, it's more recommended to use AWS CLI or Console. After applying the infrastructure run the following commands:

$ read -s GIT_HTTP_CREDS # type username:password. It won't be shown
$ aws ssm put-parameter --name $(tofu output -raw git_http_auth_name) --value "$GIT_HTTP_CREDS" --overwrite --region eu-west-1
$ unset GIT_HTTP_CREDS

I specified my credentials as ppabis:github_pat_xxxyyyzzz (not an actual token ๐Ÿ˜„). This way it's simpler to replace in the Git repository URL.

Task definition

Because secrets cannot be passed via overrides, we have to set it in the task definition itself, which makes it a static reference. You will be able to update the value in SSM though. The new template for the task definition will look like this and we will pass one more template variable in Terraform.

---
- image: "${image}"
  name: "git"
  logConfiguration:
    logDriver: "awslogs"
    options:
      awslogs-group: "${log_group}"
      awslogs-region: "${region}"
      awslogs-stream-prefix: "diff"
  secrets:
    - name: "GIT_HTTP_CREDENTIALS"
      valueFrom: "${git_http_credentials_arn}"
resource "aws_ecs_task_definition" "GitDiffTask" {
  family       = "GitDiffTask"
  ...
  container_definitions = jsonencode(yamldecode(
    templatefile(
      "${path.module}/taskdef.yaml",
      {
        image                    = "${module.ecr.repository}:latest",
        results_bucket           = aws_s3_bucket.results_bucket.bucket,
        log_group                = module.iam.log_group_name,
        region                   = var.region,
        git_http_credentials_arn = aws_ssm_parameter.git_http_auth.arn
      }
    )
  ))
}

As the values are read from SSM by ECS controller rather than the container itself, we need to give execution role permission to read this parameter from SSM. In iam module I updated the policy to include the following statement:

data "aws_iam_policy_document" "ExecutionRolePolicy" {
  ...
  statement {
    sid       = "GetSecret"
    actions   = [ "ssm:GetParameters" ]
    resources = [ var.git_auth_parameter_arn ]
  }
}

External repo URL and conditional CodeCommit

I created a new variable that allows to specify external Git repository URL. If this is provided (not empty), the CodeCommit repository will not be created (or will be destroyed). It controls also many other parts of the resources so that there are no references to CodeCommit. First, let's define the variable in Terraform.

variable "external_repo_url" {
  description = "In case we don't want to use CodeCommit, we can use an external repository. It will override and discard repo_name."
  type        = string
  default     = ""
}

First I changed the resource for the CodeCommit repository. We will use count to create the repo or just use external one. If you do the changes to existing infrastructure, current CodeCommit repository will be destroyed. Because we use count, we need to change all the references to the repository resource to use array syntax, such as: aws_codecommit_repository.CodeCommitRepo[0].clone_url_http. The changes need to be done when importing iam module.

resource "aws_codecommit_repository" "CodeCommitRepo" {
  count           = var.external_repo_url == "" ? 1 : 0
  repository_name = var.repo_name
}
module "iam" {
  source                 = "./iam"
  ecr_repository_arn     = module.ecr.repository_arn
  # Lines below changed
  codecommit_repo_arn    = var.external_repo_url == "" ? aws_codecommit_repository.CodeCommitRepo[0].arn : ""
  git_auth_parameter_arn = aws_ssm_parameter.git_http_auth.arn
}

schedule module also needs some rework when passing the variable for the repository URL. We need to either select URL of CodeCommit repository if it exists or fallback to the external repo URL.

module "schedule" {
  source = "./schedule"

  ...
  repo_url                    = var.external_repo_url == "" ? aws_codecommit_repository.CodeCommitRepo[0].clone_url_http : var.external_repo_url
  bucket_name                 = "${aws_s3_bucket.results_bucket.bucket}/${var.repo_name}"
  parameter_name              = "/git-diff/${var.repo_name}"
}

One more change is needed in the iam module itself. When we are creating the task role policy, we have a statement that allows codecommit:GitPull action on CodeCommit repository resource. However, if external repo is specified, we pass an empty string. This is problematic because it might fail the policy creation. We will use another Terraform hack for conditional statements similar to count, namely dynamic block. If the passed string is empty, we will create blocks for an empty array (so zero such blocks), otherwise we will create one block with an array that has one element (it can be anything, I used value of "1"). In task role policy definition I transformed this statement block into dynamic block.

data "aws_iam_policy_document" "TaskRolePolicy" {
  ...
  dynamic "statement" {
    # If repo ARN is empty, don't create this block (use [] array)
    for_each = var.codecommit_repo_arn == "" ? [] : ["1"]
    content {
      sid       = "CodeCommitClone"
      actions   = [ "codecommit:GitPull" ]
      resources = [ var.codecommit_repo_arn ]
    }
  }
}

Changes to the script

The script itself also needs to be updated. We relied on AWS CLI acting as the Git credential helper and adapter between Git and IAM. However, GitHub, GitLab and many other providers don't support IAM as they are outside of AWS. Instead they use standard HTTP authentication over HTTPS and this is an official method of Git authentication.

In the script we will check if the credentials from SSM were provided and are not empty and different than - (which is "empty" placeholder). So the two lines for git config will be changed into such if-else structure. If the HTTP credentials are present, they will be inserted into Git HTTP clone URL.

### Configure git - if no HTTP credentials were provided, default to IAM authentication
if [[ -z "$GIT_HTTP_CREDENTIALS"  || "$GIT_HTTP_CREDENTIALS" == "-" ]]; then
  echo "Authenticating with IAM (CodeCommit only)"
  git config --global credential.helper '!aws codecommit credential-helper $@'
  git config --global credential.UseHttpPath true
else
  ### Otherwise, use HTTP basic auth
  echo "Authenticating with HTTP basic auth (GitHub, etc.)"
  GIT_REPO=$(echo $GIT_REPO | sed -e "s|https://|https://$GIT_HTTP_CREDENTIALS@|")
fi
### Clone and create diff
git clone $GIT_REPO /tmp/repo
cd /tmp/repo
git diff $LAST_COMMIT..HEAD > /tmp/changes.diff

Testing the new solution

I didn't have to rebuild the entire infrastructure. It was enough to just add the external repo variable and fill out the SSM Parameter credentials. After a while changes started appearing in S3 as usual. The diff file below aligns with the commit I see on GitHub.

diff --git a/README.md b/README.md
index 375fbe3..a265c8f 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
 This is a sample README
+This is an update to the readme

Committed change