Daily Git diff into S3 - external Git repository
31 July 2024
This is an update to my recent post about creating daily Git diffs between
today's and previous day's HEAD
commits. However, in the previous post, I
relied on CodeCommit that is unfortunately being slowly shut down. What's more,
a lot of users prefer external Git providers like GitHub, GitLab, or Bitbucket.
I modified the project to be compatible with both solutions. I will base my
examples on a private GitHub repository but you can easily replicate it with
other platforms.
Previous post can be found here.
Completed project supporting both external Git repository and CodeCommit is
tagged v2
.
Getting Personal Access Token
I created a new private GitHub repository. With CodeCommit we relied on IAM
authentication that made the integration very smooth. However, GitHub is not
part of AWS and we must use a password of some sort to authenticate (or SSH
key). GitHub uses access tokens to do this. You can generate one by clicking on
your GitHub profile at the top, selecting Settings and at the bottom of the left
panel you can find Developer settings
.
On that page select Fine-grained tokens
and create a new one. Select the
repository you want access to (for public ones you don't need the token ๐คจ).
Under repository permissions set Contents
to Read-only
. Scroll to the bottom
and create the token. Copy it to a password manager - you can't read the token
later and will need to create a new one.
Updates to the current infrastructure
GitHub credentials
First we will define an SSM parameter (or Secrets Manager secret if you prefer)
with the obtained token from above. In a new file called secret.tf
I created
the following resource and output:
resource "aws_ssm_parameter" "git_http_auth" {
name = "/git-diff/git-http-auth"
type = "SecureString"
value = "-"
lifecycle { ignore_changes = [value] }
}
output "git_http_auth_name" {
value = aws_ssm_parameter.git_http_auth.name
}
The default value cannot be empty unfortunately, so I used -
as a placeholder.
In order to put our own value in, it's more recommended to use AWS CLI or
Console. After applying the infrastructure run the following commands:
$ read -s GIT_HTTP_CREDS # type username:password. It won't be shown
$ aws ssm put-parameter --name $(tofu output -raw git_http_auth_name) --value "$GIT_HTTP_CREDS" --overwrite --region eu-west-1
$ unset GIT_HTTP_CREDS
I specified my credentials as ppabis:github_pat_xxxyyyzzz
(not an actual
token ๐). This way it's simpler to replace in the Git repository URL.
Task definition
Because secrets cannot be passed via overrides, we have to set it in the task definition itself, which makes it a static reference. You will be able to update the value in SSM though. The new template for the task definition will look like this and we will pass one more template variable in Terraform.
---
- image: "${image}"
name: "git"
logConfiguration:
logDriver: "awslogs"
options:
awslogs-group: "${log_group}"
awslogs-region: "${region}"
awslogs-stream-prefix: "diff"
secrets:
- name: "GIT_HTTP_CREDENTIALS"
valueFrom: "${git_http_credentials_arn}"
resource "aws_ecs_task_definition" "GitDiffTask" {
family = "GitDiffTask"
...
container_definitions = jsonencode(yamldecode(
templatefile(
"${path.module}/taskdef.yaml",
{
image = "${module.ecr.repository}:latest",
results_bucket = aws_s3_bucket.results_bucket.bucket,
log_group = module.iam.log_group_name,
region = var.region,
git_http_credentials_arn = aws_ssm_parameter.git_http_auth.arn
}
)
))
}
As the values are read from SSM by ECS controller rather than the container
itself, we need to give execution role permission to read this parameter from
SSM. In iam
module I updated the policy to include the following statement:
data "aws_iam_policy_document" "ExecutionRolePolicy" {
...
statement {
sid = "GetSecret"
actions = [ "ssm:GetParameters" ]
resources = [ var.git_auth_parameter_arn ]
}
}
External repo URL and conditional CodeCommit
I created a new variable that allows to specify external Git repository URL. If this is provided (not empty), the CodeCommit repository will not be created (or will be destroyed). It controls also many other parts of the resources so that there are no references to CodeCommit. First, let's define the variable in Terraform.
variable "external_repo_url" {
description = "In case we don't want to use CodeCommit, we can use an external repository. It will override and discard repo_name."
type = string
default = ""
}
First I changed the resource for the CodeCommit repository. We will use count
to create the repo or just use external one. If you do the changes to existing
infrastructure, current CodeCommit repository will be destroyed. Because we
use count, we need to change all the references to the repository resource to
use array syntax, such as:
aws_codecommit_repository.CodeCommitRepo[0].clone_url_http
. The changes need
to be done when importing iam
module.
resource "aws_codecommit_repository" "CodeCommitRepo" {
count = var.external_repo_url == "" ? 1 : 0
repository_name = var.repo_name
}
module "iam" {
source = "./iam"
ecr_repository_arn = module.ecr.repository_arn
# Lines below changed
codecommit_repo_arn = var.external_repo_url == "" ? aws_codecommit_repository.CodeCommitRepo[0].arn : ""
git_auth_parameter_arn = aws_ssm_parameter.git_http_auth.arn
}
schedule
module also needs some rework when passing the variable for the
repository URL. We need to either select URL of CodeCommit repository if it
exists or fallback to the external repo URL.
module "schedule" {
source = "./schedule"
...
repo_url = var.external_repo_url == "" ? aws_codecommit_repository.CodeCommitRepo[0].clone_url_http : var.external_repo_url
bucket_name = "${aws_s3_bucket.results_bucket.bucket}/${var.repo_name}"
parameter_name = "/git-diff/${var.repo_name}"
}
One more change is needed in the iam
module itself. When we are creating the
task role policy, we have a statement that allows codecommit:GitPull
action on
CodeCommit repository resource. However, if external repo is specified, we pass
an empty string. This is problematic because it might fail the policy creation.
We will use another Terraform hack for conditional statements similar to count,
namely dynamic
block. If the passed string is empty, we will create blocks for
an empty array (so zero such blocks), otherwise we will create one block with an
array that has one element (it can be anything, I used value of "1"
). In task
role policy definition I transformed this statement
block into dynamic
block.
data "aws_iam_policy_document" "TaskRolePolicy" {
...
dynamic "statement" {
# If repo ARN is empty, don't create this block (use [] array)
for_each = var.codecommit_repo_arn == "" ? [] : ["1"]
content {
sid = "CodeCommitClone"
actions = [ "codecommit:GitPull" ]
resources = [ var.codecommit_repo_arn ]
}
}
}
Changes to the script
The script itself also needs to be updated. We relied on AWS CLI acting as the Git credential helper and adapter between Git and IAM. However, GitHub, GitLab and many other providers don't support IAM as they are outside of AWS. Instead they use standard HTTP authentication over HTTPS and this is an official method of Git authentication.
In the script we will check if the credentials from SSM were provided and are
not empty and different than -
(which is "empty" placeholder). So the two
lines for git config
will be changed into such if-else
structure. If the
HTTP credentials are present, they will be inserted into Git HTTP clone URL.
### Configure git - if no HTTP credentials were provided, default to IAM authentication
if [[ -z "$GIT_HTTP_CREDENTIALS" || "$GIT_HTTP_CREDENTIALS" == "-" ]]; then
echo "Authenticating with IAM (CodeCommit only)"
git config --global credential.helper '!aws codecommit credential-helper $@'
git config --global credential.UseHttpPath true
else
### Otherwise, use HTTP basic auth
echo "Authenticating with HTTP basic auth (GitHub, etc.)"
GIT_REPO=$(echo $GIT_REPO | sed -e "s|https://|https://$GIT_HTTP_CREDENTIALS@|")
fi
### Clone and create diff
git clone $GIT_REPO /tmp/repo
cd /tmp/repo
git diff $LAST_COMMIT..HEAD > /tmp/changes.diff
Testing the new solution
I didn't have to rebuild the entire infrastructure. It was enough to just add the external repo variable and fill out the SSM Parameter credentials. After a while changes started appearing in S3 as usual. The diff file below aligns with the commit I see on GitHub.
diff --git a/README.md b/README.md
index 375fbe3..a265c8f 100644
--- a/README.md
+++ b/README.md
@@ -1 +1,2 @@
This is a sample README
+This is an update to the readme