Profiling AWS DocumentDB

31 October 2023

MongoDB, a NoSQL database used since a long time as the primary database of many web applications. AWS offers a managed solution for MongoDB called DocumentDB. It is highly available, easy to scale read capacity horizontally and the storage can scale automatically. However, even with those features, we are still likely to hit a bottleneck here or there when querying the database. Today, we will look at the profiling feature of DocumentDB that can help us identify queries that take a long time.

Creating a DocumentDB parameter group

Before we start we need to create a parameter group that will enable profiling for our database (and TLS as well). We will use Terraform/OpenTofu for that. I assume that you will prepare provider configuration and credentials yourself. Our profiler should trigger as soon as possible, which is when a query takes more than 50ms. We will also sample every query, as we don't have much traffic.

resource "aws_docdb_cluster_parameter_group" "profiler-params" {
  family      = "docdb4.0"
  name        = "param-group-profiler"
  description = "Parameter group with TLS and profiling enabled"

  parameter {
    name  = "tls"
    value = "enabled"
  }
  parameter {
    name  = "profiler"
    value = "enabled"
  }
  parameter {
    name  = "profiler_sampling_rate"
    value = "1.0"
  }
  parameter {
    name  = "profiler_threshold_ms"
    value = "50"
  }
}

Configuring VPC

Next up let's define a VPC that will be used by our DocumentDB cluster. I will use here data sources to get the default VPC and subnets. For production, it's advised to use a dedicated VPC and maintain full control over it. We will also define a subnet group for the DocumentDB, so it knows where to launch instances.

data "aws_vpc" "default" { default = true }

data "aws_subnet" "public-a" {
  vpc_id            = data.aws_vpc.default.id
  availability_zone = "eu-west-1a"
}

data "aws_subnet" "public-b" {
  vpc_id            = data.aws_vpc.default.id
  availability_zone = "eu-west-1b"
}

resource "aws_security_group" "documentdb" {
  vpc_id      = data.aws_vpc.default.id
  name        = "mycluster-sg"
  description = "Allows VPC access to DocumentDB cluster"
  ingress {
    from_port   = 27017
    to_port     = 27017
    protocol    = "tcp"
    cidr_blocks = [data.aws_vpc.default.cidr_block]
  }

  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
}

resource "aws_docdb_subnet_group" "subnets" {
  subnet_ids = [data.aws_subnet.public-a.id, data.aws_subnet.public-b.id]
  name       = "default-ab"
}

Apply these things, they should be pretty straightforward and fast.

Starting a cluster and instance

Now it's the time for the main guest of today's episode. We will create a DocumentDB cluster with a single instance. The cluster will use the previously defined parameter group as well as subnet and security groups. The password will be also stored here, so we can use it later to connect to the database. In real world scenario, you should add lifecycle clause to Terraform to ignore master_password changes and rotate the password to something complex.

resource "aws_docdb_cluster" "mycluster" {
  cluster_identifier = "mycluster"
  # Network settings
  db_subnet_group_name   = aws_docdb_subnet_group.subnets.id
  vpc_security_group_ids = [aws_security_group.documentdb.id]
  # Cluster settings
  engine                          = "docdb"
  engine_version                  = "4.0.0"
  master_username                 = "administrator"
  master_password                 = "Staple3-Battery2-Horse1"
  enabled_cloudwatch_logs_exports = ["profiler"]
  db_cluster_parameter_group_name = aws_docdb_cluster_parameter_group.profiler-params.id
  apply_immediately               = true
}

resource "aws_docdb_cluster_instance" "myinstance" {
  cluster_identifier = aws_docdb_cluster.mycluster.id
  instance_class     = "db.t4g.medium"
  engine             = "docdb"
  identifier         = "myinstance"
}

output "mongo_url" {
  value     = "${aws_docdb_cluster.mycluster.master_username}:${aws_docdb_cluster.mycluster.master_password}@${aws_docdb_cluster.mycluster.endpoint}:${aws_docdb_cluster.mycluster.port}"
  sensitive = true
}

Because we enabled profiling it's also important that we enable logs export to CloudWatch from the cluster configuration. Then after we perform the tests we will see the slow queries. To get the connections string, type the following:

$ terraform output mongo_url

Creating an instance to run tests

DocumentDB is deployed inside VPC. That means we need to have a machine inside it. We will use a simple EC2 instance for that. I created one that is managed by SSM because of safety and convenience. You can use plain SSH for this.

data "aws_ssm_parameter" "AmazonLinux" {
  name = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-arm64"
}

resource "aws_security_group" "all-egress" {
  name   = "all-egress"
  vpc_id = data.aws_vpc.default.id
  egress {
    from_port        = 0
    to_port          = 0
    protocol         = "-1"
    cidr_blocks      = ["0.0.0.0/0"]
    ipv6_cidr_blocks = ["::/0"]
  }
  # Add ingress for SSH here if you want SSH from your PC
}

... # Skipped SSM role and profile config, refer to GitHub repo

resource "aws_instance" "TestInstance" {
  ami                         = data.aws_ssm_parameter.AmazonLinux.value
  instance_type               = "t4g.nano"
  iam_instance_profile        = aws_iam_instance_profile.test-ssm.name # Optional SSM
  subnet_id                   = data.aws_subnet.public-a.id
  vpc_security_group_ids      = [aws_security_group.all-egress.id]
}

After boot (you can use user data for it), we need to install some tools. These are Python PIP, MongoDB client library, requests library and Amazon RDS public certificate (despite RDS in the name, DocumentDB uses the same).

$ sudo yum install -y python3-pip
$ sudo pip3 install pymongo requests
$ sudo mkdir -p /usr/local/share/ca-certificates/
$ sudo wget https://truststore.pki.rds.amazonaws.com/global/global-bundle.pem -O /usr/local/share/ca-certificates/rds.pem
$ sudo chmod 0644 /usr/local/share/ca-certificates/rds.pem

Time for testing

Now we have to somehow put pressure on the DocumentDB. t4g.medium is a pretty strong instance so the data has to be large. I created a simple Python script available in the repo here that will insert 5000 student IDs, 1000 test questions and 5'000'000 answers (every student for every question). Already inserting these records generates enough load on the database to trigger the profiler. Be sure to set the correct environment variable with the connection string.

$ export MONGO_URL=<output of terraform output mongo_url>
$ python3 insert_plenty.py

After you load all the records (this can take a good minute), try querying the database with another script. This one will do some joins in aggregate. The query script is available here.

Watching the profiler

Now let's go to the AWS Console and open CloudWatch. You should find a log group /aws/docdb/mycluster/profiler. Inside it, you will find a log stream with all the queries that were slow.

Slow queries

The advantage of this solution is that setting this up is very easy and no need for any additional software. It's not the most advanced solution but a good starting point to know what functions of your application are putting too much pressure on the database. You can adjust the profiler to your needs by changing threshold of slow queries.