pabis.eu

Let's Play Machine Learning

10 February 2024

I came to a conclusion that a thing that is not written down and explained is not learned. So as I am preparing for AWS MLS-C01, I need to refresh some of my small machine learning knowledge and expand it further. I decided to go with an exercise where I would use at least one unsupervised algorithm and one supervised. I will create my own dataset which might not be representative but at least it will be useful to demonstrate something. I will use a Jupyter Docker image to have easy access to notebooks. Unfortunately, on macOS, I cannot use GPU acceleration. But as this is an AWS exam, I need to also know how to use SageMaker so I can access more powerful hardware.

The dataset

So my idea for the dataset is a list of MMORPG players and their stats. It will contain columns with level, class, gold, skills, weekly time played, etc. I will try to make some correlations between the columns but also keep them somewhat random. Logically more leveled and skilled players will spend more gold and play longer. The generator for the dataset is available in the GitHub repository here.

After generating players.csv we can describe it with pandas:

import pandas as pd
pd.read_csv('players.csv').describe()
              Level         Sword        Shield   Magic_Level  Average_Weekly_Time_Minutes    Gold_Spent
count  50000.000000  50000.000000  50000.000000  50000.000000                 50000.000000  5.000000e+04
mean     250.178180    434.595500    484.750320    535.391920                   543.259320  5.585367e+05
std      143.914529    305.446319    328.761113    370.838816                   585.400904  9.022953e+05
min        1.000000      8.000000      8.000000      2.000000                    15.000000  3.102000e+03
25%      126.000000    203.000000    224.000000    235.000000                   109.000000  1.916550e+04
50%      250.000000    392.000000    436.000000    461.000000                   282.000000  1.122970e+05
75%      374.000000    585.000000    648.000000    791.000000                   793.000000  6.599965e+05
max      500.000000   1413.000000   1413.000000   1408.000000                  2369.000000  3.992809e+06

Running Jupyter notebook and loading the dataset

I use Jupyter in Docker to simplify the setup. I want to just install the basic scipy stack for now.

docker run --rm -it\
 -p 18888:8888\
 -v $HOME/ML:/home/jovyan/work\
 quay.io/jupyter/scipy-notebook:latest

Wait for the notebook to start and copy the URL with the token (one that starts with 127.0.0.1). Edit the URL and replace port 8888 with 18888 (or any other you specified in the command above). When you are already in Jupyter GUI, change the directory on the left to work so that all the changes are saved to your local computer.

You can then uploads players.csv via Jupyter or just copy it to your local directory at ~/ML.

We now need to do some cleaning for the data. Firstly, our data contains player professions that are strings but have only 4 possible values (verify this with csv['Profession'].unique() in pandas). We will use one-hot encoding where each string will become a new column and the value 1 in this column will mean that the player has this profession.

import pandas as pd
csv = pd.read_csv('players.csv')
csv['Profession'].unique()
professions = pd.get_dummies(csv['Profession'], dtype=int)
csv = pd.concat([csv, categorical], axis=1)
csv.head()[ ['Cleric', 'Wizard', 'Knight', 'Rogue', 'Profession'] ]
    Cleric  Wizard  Knight  Rogue   Profession
0   1       0       0       0       Cleric
1   0       1       0       0       Wizard
2   0       0       1       0       Knight
3   0       1       0       0       Wizard
4   0       0       0       1       Rogue

We can drop the Profession column now so that it doesn't get into our way and convert all the values to floats. We can then plot some of the columns as histograms to see how does the distribution look like.

csv = csv.drop(columns=['Profession'])
csv = csv.astype(dtype='float32')
hist = csv.hist(column=['Level', 'Gold_Spent', 'Average_Weekly_Time_Minutes', 'Sword'])

Data distribution

We can see that the number of samples per each bin is not equally distributed. Level is even but other values seem to follow a different distribution.

Using machine learning to cluster the players

Unsupervised learning is a category of machine learning where we don't have specific values for the model to predict but rather we want the training phase of the model to find similarities and generate the labels by itself. We will use one of the simplest algorithms called K-Means. It will classify our players into K clusters. Each cluster will have the most "similar" players. Because this is machine learning, we can't easily predict what the clusters will be. For example if we set K=2 we might get players segmented by their profession or just percentiles of levels. We will try multiple values of K and see what the results. In our Jupyter notebook we will use sklearn package to create our first predictions and prototype. The video I recommend you to watch to learn more is this video tutorial.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2, random_state=0).fit(csv)

But it is just a single model epoch. We can test the model by predicting the cluster for all the players and adding it as a column. Then we can split the dataset into two groups and draw histograms to see what is the pattern.

csv['Cluster'] = kmeans.predict(csv)
cluster0 = csv[csv['Cluster'] == 0]
cluster1 = csv[csv['Cluster'] == 1]

cluster0.hist(column=['Level', 'Gold_Spent', 'Average_Weekly_Time_Minutes', 'Sword'])
cluster1.hist(column=['Level', 'Gold_Spent', 'Average_Weekly_Time_Minutes', 'Sword'])

Cluster 0

Cluster 1

By looking at the x-axis, we can see that the players we roughly split by experience. Players with more time, gold spent and higher skills are in one cluster while the others are in the other. However, in the case of sword skill, there's a slight overlap. Let us also check if some clustering happened on one-hot profession.

print( cluster0[['Cleric', 'Wizard', 'Knight', 'Rogue']].sum() )
print( cluster1[['Cleric', 'Wizard', 'Knight', 'Rogue']].sum() )
Cleric    10703.0
Wizard    10430.0
Knight    10582.0
Rogue     10699.0
dtype: float64
Cleric    1968.0
Wizard    1903.0
Knight    1847.0
Rogue     1868.0
dtype: float64

Interestingly, the profession of the players didn't seem to matter. Why would it be so? Well, if we simplify the K-Means algorithm to Euclidean distance between some midpoint and vector made from each row, it would mean that level=1 has almost the same weight as the Knight or Rogue column but when level=500, the column doesn't matter at all. We should first scale the data to make each column have almost the same weight (although it's not always perfect).

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled = scaler.fit_transform(csv)

kmeans = KMeans(n_clusters=2, random_state=0).fit(scaled)
csv['Cluster'] = kmeans.predict(scaled)

cluster0 = csv[csv['Cluster'] == 0]
cluster1 = csv[csv['Cluster'] == 1]

print( cluster0[['Cleric', 'Wizard', 'Knight', 'Rogue']].sum() )
print( cluster1[['Cleric', 'Wizard', 'Knight', 'Rogue']].sum() )
Cleric    9838.0
Wizard    9546.0
Knight    8392.0
Rogue     9676.0
dtype: float64
Cleric    2833.0
Wizard    2787.0
Knight    4037.0
Rogue     2891.0
dtype: float64

After scaling

After scaling

When we look at the histograms of the same columns after running K-means on the scaled dataset, we don't see that much difference in terms of clustering. The high-levels and high-spenders are still favored by the second cluster. But as you can see from the profession counts, Knights are a bit favored by the second cluster.

We can try to find the optimal number of clusters by using the elbow method where we plot the sum of squared errors (inertia) for each number of clusters.

kRange = range(2, 20)
inertias = []
for k in kRange:
    kmeans = KMeans(n_clusters=k, random_state=0).fit(scaled)
    inertias.append(kmeans.inertia_)

fig, ax = plt.pyplot.subplots()
ax.plot(kRange, inertias, '-o')
ax.axes.set_xlabel('k')
ax.axes.set_xticks(kRange)
fig.show()

Elbow

Well from this graph we can either select 6 clusters or 12 as both of them are the points where the slope changes it's direction the most. I will select 6 as it will be easier then to make a graph on how the data was clustered. We will then plot the values for some of the columns and color them by the cluster.

csv['Cluster'] = KMeans(n_clusters=6, random_state=0).fit(scaled).labels_
csv.plot.scatter(x = "Level", y = "Gold_Spent", c = "Cluster", colormap="plasma")
csv.plot.scatter(x = "Sword", y = "Magic_Level", c = "Cluster", colormap="plasma")
csv.plot.scatter(x = "Average_Weekly_Time_Minutes", y = "Gold_Spent", c = "Cluster", colormap="plasma")

Correlation with cluster

These doesn't seem to be some meaningful clusters. However, we removed the profession column in the earlier steps. Let's try to add it back and see if it helps us find any clustering.

csv['Profession'] = pd.read_csv("players.csv")['Profession']
csv.plot.scatter(x = "Profession", y = "Level", c = "Cluster", colormap="plasma")
csv.plot.scatter(x = "Profession", y = "Sword", c = "Cluster", colormap="plasma")
csv.plot.scatter(x = "Profession", y = "Magic_Level", c = "Cluster", colormap="plasma")

Correlation of cluster and profession

We have a clear winner here. KMeans calculated that the profession is one of the most significant lines of the split. On the graph we can see that high skilled Clerics, Wizards and Rogues are in one cluster while Knights have two clusters dedicated to themselves. All low and medium skilled players have their own distinct clusters. Because the generating script based skills, time and gold spent on the level, even including randomization, KMeans algorithm picked that up and after scaling found the correlation. However, one-hot encoding of professions make it a hard border between the data points.

Running on SageMaker

But as this is AWS exam that is coming up for me, we will run the same code on AWS SageMaker - a platform for all things Machine Learning. To do that easily, we will use SageMaker Notebooks running in AWS. To construct our infrastructure easily and destroy it afterwards to not pay too much, we will use Terraform.

If we want the notebook to be able to access our dataset, spawn SageMaker jobs and so on, we need a proper IAM role. AWS offers us a quite permissive policy for SageMaker that can for example access all S3 buckets which names start with sagemaker-*. We will use Terraform to create the role and attach this policy.

resource "aws_iam_role" "sagemaker" {
    name = "SageMakerNotebookRole"
    assume_role_policy = jsonencode({
        Version = "2012-10-17"
        Statement = [ {
            Action = "sts:AssumeRole"
            Effect = "Allow"
            Principal = { Service = "sagemaker.amazonaws.com" }
        } ]
    })
}

resource "aws_iam_role_policy_attachment" "sagemaker" {
    role = aws_iam_role.sagemaker.name
    policy_arn = "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess"
}

Next we will create an S3 bucket where we will upload our dataset and the actual notebook. For the test we will generate two dataset versions: one smaller to test Python code later inside the notebook and a larger one that will be used on the actual SageMaker instances. Modify the script from the beginning of this post to create a bigger dataset of 2_000_000 players. The smaller dataset we already have will be used for testing the code inside the notebook. Change the bucket name to something random enough but include sagemaker in the name. Upload both datasets to the bucket (you can also do it with AWS Console).

resource "aws_s3_bucket" "sagemaker-data" {
    bucket = "sagemaker-data-2024-02-10-xyz345"
}
aws s3 cp players.csv s3://sagemaker-data-2024-02-10-xyz345/kmeans/players.csv
aws s3 cp players-big.csv s3://sagemaker-data-2024-02-10-xyz345/kmeans/players-big.csv

It's time to create a notebook. The easiest would be to create an Internet facing notebook instance. It's not as scary as you need a token either way to log in there. However, the role for this notebook will be very permissive so be sure to destroy everything after playing. We will also provide the root access to the instance so that it's easy to install some Linux packages if they are missing.

resource "aws_sagemaker_notebook_instance" "SageMakerNotebook" {
  role_arn               = aws_iam_role.sagemaker.arn
  instance_type          = "ml.t3.medium"
  name                   = "SageMakerNotebook"
  direct_internet_access = "Enabled"
  root_access            = "Enabled"
}

output "Jupyter" {
  value = "https://${aws_sagemaker_notebook_instance.SageMakerNotebook.url}"
}

You have two possibilities now to access the notebook. Either you log in to AWS Console and use the output value Jupyter by pasting it into the browser or you can generate a presigned URL with AWS CLI without touching AWS Console. You have to copy this very long URL into your browser window.

aws sagemaker create-presigned-notebook-instance-url\
 --notebook-instance-name SageMakerNotebook\
 --region us-east-2\
 --query AuthorizedUrl\
 --output text

If we want to run it on SageMaker, we need an actual Python script. I converted the above notebook cells into full-length Python to be compatible with AWS and S3.

Before we can actually run our KMeans script, we need to prepare the data. We can do this in the script itself, in the notebook, but for the sake of using SageMaker, we will also post a job that will do the data preparation for us. The only thing we need is to convert Professions to one-hot encoding and perform scaling. The output we will store in the same S3 bucket.

The code is available here on GitHub. Download it, upload to your notebook instance. From there you will be able to schedule it on SageMaker instances.

First, we will create a SageMaker session. Then we will run SKLean processor from SageMaker to run a processing job. The list of framework versions is available here in AWS docs. We will put it into a function so that we can easily run it with different parameters. First we will test if the script even works on the local notebook instance. Then we can run the larger file on a bigger, remote instance. This code is supposed to go into the notebook on SageMaker. Also upload prepare-kmeans.py to Jupyter local filesystem.

The script will also save the scaler weights that will be used later for scaling the input data during inference. For simplicity, we will use the same place as we store our processed data.

import sagemaker
from sagemaker.sklearn.processing import SKLearnProcessor
from sagemaker.processing import ProcessingInput, ProcessingOutput

BUCKET = "sagemaker-data-2024-02-10-xyz345"
sess = sagemaker.Session(
    default_bucket=BUCKET,
    default_bucket_prefix="work"
)
role = sagemaker.get_execution_role()

def process(instance_type, file_name):
    preprocessor = SKLearnProcessor(
        role=role,
        framework_version="1.2-1",
        instance_type=instance_type,
        instance_count=1,
        sagemaker_session=sess,
        env={"SM_PROCESS_FILE": file_name}
    )

    inputs = [ProcessingInput(
        source=f"s3://{BUCKET}/kmeans/{file_name}",
        destination='/opt/ml/processing/input'
    )]

    outputs = [ProcessingOutput(
        source='/opt/ml/processing/output',
        destination=f"s3://{BUCKET}/kmeans/processed"
    )]

    preprocessor.run(
        code='prepare-kmeans.py',
        inputs=inputs,
        outputs=outputs
    )

process("local", "players.csv")
## Line below will use more resources
process("ml.t3.xlarge", "players-big.csv")

If the job is successful (you see exited with code 0 and Job Complete), you can check if the file really is in the S3 bucket. Directly in the notebook issue command:

!aws s3 ls s3://sagemaker-data-2024-02-10-xyz345/kmeans/processed/

If everything is fine, the file should be here.

2024-02-01 19:19:31   10467734 players.csv

So we can now preview it, also directly in the notebook to verify if the job did the thing correctly. Copy the file with another shell command in the notebook and head() with pandas.

!aws s3 cp s3://sagemaker-data-2024-02-10-xyz345/kmeans/processed/players.csv .
import pandas as pd
pd.read_csv('players.csv').head()
     Level      Sword       Shield      Magic_Level Average_Weekly_Time_Minutes Gold_Spent  Cleric      Knight      Rogue       Wizard      Profession
0    -1.512848  -1.244164   -1.310136   -1.182956   -0.830235                   -0.620895   1.725059    -0.582215   -0.576149   -0.571346   Cleric
1    0.103096   -0.269317   -0.389385   0.578072    -0.368809                   -0.470166   1.725059    -0.582215   -0.576149   -0.571346   Cleric
2    -1.561188  -1.162927   -1.225608   -1.362817   -0.843856                   -0.621319   -0.579690   1.717577    -0.576149   -0.571346   Knight
3    -1.533565  -1.266910   -1.334287   -1.201747   -0.854072                   -0.621140   -0.579690   -0.582215   -0.576149   1.750254    Wizard
4    -0.262908  0.546304    0.365330    -0.807127   -0.569725                   -0.553066   -0.579690   1.717577    -0.576149   -0.571346   Knight

Now as the data is cleaned, we can run it on a larger, remote instance. I used ml.t3.large as requesting c4 or c5 failed immediately with quota limits (absurdly ridiculous). This will take a long time. After it finished, check if the file was also put into S3.

!aws s3 ls s3://sagemaker-data-2024-02-10-xyz345/kmeans/processed/
2024-02-01 19:47:35  407803754 players-big.csv
2024-02-01 19:35:33   10178843 players.csv

Performing KMeans on SageMaker

So as we now know how to run a processing job on SageMaker, another thing would be to run a training job. We will run the same KMeans code in the same way as we did with the processing job. Upload the script to the notebook instance and create a new function that will perform KMeans training on the processed data. The output will be a model which we will later use for inference. Download fit-kmeans.py from here. Upload it to the notebook instance.

from sagemaker.sklearn.estimator import SKLearn
from sagemaker.inputs import TrainingInput
from pathlib import Path
import re
from datetime import datetime

def basename(f):
    return Path(f).stem

def train(instance_type, file_name):
    est = SKLearn(
        entry_point="fit-kmeans.py",
        role=role,
        framework_version="1.2-1",
        instance_type=instance_type,
        instance_count=1,
        sagemaker_session=sess,
        environment={
            "SM_INPUT_DATA_FILE": file_name,
            "SM_MODEL_FILE": basename(file_name) + ".joblib"
        },
        # If you happen to do some code experiments and need to test around
        # keep this line. This will charge you extra 6 minutes of instance but
        # the benefit of not needing to wait for all the downloads outweighs it.
        # Unless you get: "Instances not retained as a result of warmpool resource limits being exceeded."
        # Ask AWS support to increase the warmpool limit.
        # Amazon needs to invest better lol.
        keep_alive_period_in_seconds=360
    )

    # Regex, replace all non-alphanumeric characters with a hyphen and remove underscore duplicates.
    # And unfortunately the job has to have unique name so you have to somehow remember what you trained.
    job_name = "KMeans-" + re.sub("\\-+", "-", re.sub(r'\W+', '-', file_name)) + str(int(datetime.now().timestamp()))

    # The inputs have to be a prefix in the S3 bucket. It will be all copied over.
    # The specific file to use is specified in the environment variable.
    inputs = {
        "train":
            TrainingInput(f"s3://{BUCKET}/kmeans/processed")
    }

    est.fit(inputs, job_name=job_name)
    return est

estimator = process("ml.m5.xlarge", "players-big.csv")

We can now check if the model was created and stored in S3.

!aws s3 ls s3://sagemaker-data-2024-02-10-xyz345/work/
    PRE KMeans-players-big-csv/
    PRE KMeans-players-csv/

The actual model file (joblib) is under s3://sagemaker-data-2024-02-10-xyz345/work/KMeans-players-big-csv/output/model.tar.gz. We can later refer to it in the inference script. Although the train function above returns the actual estimator that can be also used directly for inference after training, we will train model loading now.

Running KMeans inference on SageMaker

We have to make another script similar to kmeans-fit.py. However, this time the script must follow SageMaker structure. We have to create model_fn function that will be used to load the model from the S3 bucket. Then input_fn is used for deserializing the input data (such as JSON) into the format that our model requires. predict_fn is the function that will be called for each prediction. It will use the model and the input data to make the prediction. As the last step, output_fn will serialize the prediction into the format that the caller requests. The ready file is available here on GitHub.

In order to use the model, we have to specify the location of the tar.gz file we got from training. I found mine in the S3 bucket we specified earlier in the latest directory. Copy S3 URI and paste for model_data parameter.

List of trained models

Copy S3 URI

from sagemaker.sklearn.model import SKLearnModel
model = SKLearnModel(
    model_data=f"s3://{BUCKET}/work/KMeans-players-big-csv1706969709/output/model.tar.gz",
    role=role,
    entry_point="predict-kmeans.py",
    framework_version="1.2-1",
    env={"SM_MODEL_FILE": "players-big.joblib"}
)

endpoint = model.deploy(
    instance_type="ml.c4.xlarge",
    initial_instance_count=1,
    container_startup_health_check_timeout=240
)

The above lines will automatically create a model, endpoint configuration and an inference endpoint for us in SageMaker. Upon endpoint deletion, the model record and configuration will persist. Despite we gave an absolute path when creating the model, SageMaker also automatically picked up which training job created this artifact and linked the two together. If provisioning of the endpoint takes too long, better check CloudWatch log group named after the endpoint for any possible errors. SageMaker will retry to deploy the model a lot of times before giving up. Hence we set the container_startup_health_check_timeout to some small value to speed up possible failure. See this conversation for some hacks that didn't work for me. So far in February 2024, the default timeout seems to be more or less 20 minutes.

SageMaker endpoints

After you are done with the experiments and inferences, remember to delete the endpoint or you will pay for the running instance.

# Run this AFTER you did everything.
endpoint.delete_endpoint() # Also doable via AWS Console and CLI

Because we used header names when training the model, we have to use our own method for loading CSV data. Hence we just pass the input as string and let the endpoint script parse it. The response should be JSON as we requested.

from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import CSVSerializer
from sagemaker.base_serializers import StringSerializer

endpoint.ContentType = "text/csv"
endpoint.deserializer = JSONDeserializer()
endpoint.serializer = StringSerializer(content_type='text/csv')

response = endpoint.predict(
"""Level,Profession,Sword,Shield,Magic_Level,Average_Weekly_Time_Minutes,Gold_Spent
233,Rogue,338,522,421,246,88030
35,Wizard,54,54,102,71,5267
52,Rogue,81,127,97,90,6634
434,Cleric,571,577,1223,1336,1555266
186,Wizard,249,251,524,181,44998
"""
)

The response will be simple array with [{"Cluster": 2}, {"Cluster": 2}, ...] for each row we sent. You can convert it to DataFrame and combine with what you sent. Remember to delete the endpoint now if you have finished experimenting.

Predicting gold spent with XGBoost

XGBoost is an amazing tree-based algorithm but better. It isn't random like Random Forest and not a single tree. It improves with each iteration by correcting errors of the previous iteration. We will use Amazon's provided XGBoost that will train automatically on SageMaker by just providing the data.

However, we need to first prepare this data. We will use modified version of our previous prepare-kmeans.py script to also split the data for train, test and validation. Upload the following file to the notebook work directory: prepare-xgb.py

And next we will create a new function for running this processor.

def processXGB(instance_type, file_name, enable_scaler=False):
    env = {"SM_PROCESS_FILE": file_name}
    destination = f"s3://{BUCKET}/xgb/{basename(file_name)}-split"

    if enable_scaler:
        env["SM_SCALER_ENABLE"] = "1"
        destination = f"s3://{BUCKET}/xgb/{basename(file_name)}-split-scaled"

    preprocessor = SKLearnProcessor(
        role=role,
        framework_version="1.2-1",
        instance_type=instance_type,
        instance_count=1,
        sagemaker_session=sess,
        env=env
    )

    inputs = [ProcessingInput(
        source=f"s3://{BUCKET}/kmeans/{file_name}",
        destination='/opt/ml/processing/input'
    )]

    outputs = [ProcessingOutput(
        source='/opt/ml/processing/output',
        destination=destination
    )]

    preprocessor.run(
        code='prepare-xgb.py',
        inputs=inputs,
        outputs=outputs
    )

We can test this function with smaller dataset on the local notebook processor. If this succeeds, we can run it on more datasets and turn on the scaler to see if the model is better with or without scaling.

processXGB("local", "players.csv")
processXGB("ml.t3.xlarge", "players-big.csv")
processXGB("ml.t3.xlarge", "players-big.csv", True)

XGB splits

The datasets should land in our S3 bucket. We can now train the model. I will try to do some experiments with loss function used by XGBoost to check which one performs the best on the training split. I also included wait parameter so that all models can be trained in parallel.

from sagemaker.estimator import Estimator
from sagemaker.image_uris import retrieve
from sagemaker.inputs import TrainingInput
import re
from datetime import datetime

def train(instance_type, prefix, loss_type="squaredlogerror", rounds=25, wait=True):
    # Last part after slash and convert all non-alphanumeric characters to hyphen
    name = re.sub("\\-+", "-", re.sub(r'\W+', '-', prefix.split('/')[-1]))
    # Add number of rounds, loss type and some timestamp to the name
    name = f"{name}-{rounds}-{loss_type}-{str(int(datetime.now().timestamp()))[-6:]}"
    # Shorten the name
    name = re.sub("square", "sq", re.sub("error", "e", name))

    xgb = sagemaker.estimator.Estimator(
        retrieve("xgboost", sess.boto_region_name, "1.7-1"),
        role,
        instance_count=1,
        instance_type=instance_type,
        output_path=f"s3://{BUCKET}/output/{prefix}",
        sagemaker_session=sess
    )

    # Large depth, eta, gamma and weight seem to work better on this dataset
    # but it overfits around 20th iteration.
    xgb.set_hyperparameters(
        max_depth=80,
        eta=0.475,
        gamma=2.5,
        min_child_weight=24,
        subsample=0.8,
        verbosity=0,
        objective=f"reg:{loss_type}",
        num_round=rounds
    )

    xgb.fit({
            "train": TrainingInput( s3_data=f"s3://{BUCKET}/{prefix}/train", content_type="csv" ),
            "validation": TrainingInput( s3_data=f"s3://{BUCKET}/{prefix}/val", content_type="csv" )
        },
        job_name=name,
        wait=wait
    )

    return xgb

We can now train the models. I will do all of them in parallel. You can observe the results in the console. On this size of the dataset, the training shouldn't take too long, 3~5 minutes.

xgb_sqlog = train("ml.m5.xlarge", "xgb/players-big-split", wait = False)
xgb_sq = train("ml.m5.xlarge", "xgb/players-big-split", "squarederror", wait = False)
xgb_abs = train("ml.m5.xlarge", "xgb/players-big-split", "absoluteerror", wait = True)
# Wait here for the above to finish
xgb_sqlog_sc = train("ml.m5.xlarge", "xgb/players-big-split-scaled", wait = False)
xgb_sq_sc = train("ml.m5.xlarge", "xgb/players-big-split-scaled", "squarederror", wait = False)
xgb_abs_sc = train("ml.m5.xlarge", "xgb/players-big-split-scaled", "absoluteerror", wait = True)

Training several models in parallel

Now we can evaluate the models. We will submit our testing split to the endpoint which will be copied over to the notebook instance. As it's a small subset we can safely manipulate it in the notebook. The function will spin up an endpoint from the trained evaluator, do predictions, delete endpoint and return mean square error and R-squared score.

from sagemaker.serializers import CSVSerializer
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, r2_score

def evaluate(model, instance_type, data):
    pred = model.deploy(
        initial_instance_count=1,
        instance_type=instance_type,
        serializer=CSVSerializer()
    )

    actual = data.to_numpy()[:, 0]
    data = data.to_numpy()[:, 1:] # Skip first column and make it numeric
    split_array = np.array_split(data, int(data.shape[0] / float(100) + 1))
    predictions = ""
    for array in split_array:
        predictions = "".join([predictions, pred.predict(array).decode("utf-8")])

    pred.delete_endpoint()

    predictions = predictions.split("\n")[:-1]
    predictions = np.array([float(x) for x in predictions])

    for p in range(5):
        print(f"{p} (predicted, actual) = {predictions[p]:.3f}, {actual[p]:.3f}")

    return (mean_squared_error(actual, predictions), r2_score(actual, predictions))

!aws s3 cp s3://sagemaker-data-2024-02-10-xyz345/xgb/players-big-split/test/players-big.testing.csv testing.csv
!aws s3 cp s3://sagemaker-data-2024-02-10-xyz345/xgb/players-big-split-scaled/test/players-big.testing.csv testing-sc.csv
testing = pd.read_csv('testing.csv')
testing_sc = pd.read_csv('testing-sc.csv')

As the first experiment, I looked mostly at the squarederror model (second one) and it showed some overfitting around 20th epoch. Despite the fact, I decided to measure it against the test set. And the predictions were pretty accurate.

0 (predicted, actual) = 352036.250, 351934.000
1 (predicted, actual) = 31445.012, 31469.000
2 (predicted, actual) = 67149.734, 67084.000
3 (predicted, actual) = 219671.359, 219695.000
4 (predicted, actual) = 17301.877, 17374.000
MSE: 3948.7184, R2: 1.0000

Let's try with other models and scaled data.

mse, r2 = evaluate(xgb_sqlog, "ml.m5.large", testing)
print(f"Squaredlog: MSE: {mse:.4f}, R2: {r2:.4f}")
mse, r2 = = evaluate(xgb_sq, "ml.m5.large", testing)
print(f"Squared: MSE: {mse:.4f}, R2: {r2:.4f}")
mse, r2 = = evaluate(xgb_abs, "ml.m5.large", testing)
print(f"Absolute: MSE: {mse:.4f}, R2: {r2:.4f}")
mse, r2 = = evaluate(xgb_sqlog_sc, "ml.m5.large", testing_sc)
print(f"Squaredlog Scaled: MSE: {mse:.4f}, R2: {r2:.4f}")
mse, r2 = = evaluate(xgb_sq_sc, "ml.m5.large", testing_sc)
print(f"Squared Scaled: MSE: {mse:.4f}, R2: {r2:.4f}")
mse, r2 = = evaluate(xgb_abs_sc, "ml.m5.large", testing_sc)
print(f"Absolute Scaled: MSE: {mse:.4f}, R2: {r2:.4f}")

The results prove that squarederror loss function performs the best (and the only one reasonably) for the dataset in both scaled and raw versions. This can be due to hyperparameters we selected. Either way, for a model that predicts how much gold will a player spend based on their statistics is good enough.

0 (predicted, actual) = 617.065, 351934.000
1 (predicted, actual) = 528.865, 31469.000
2 (predicted, actual) = 528.865, 67084.000
MSE: 1127183191751.4749, R2: -0.3847

0 (predicted, actual) = 352036.250, 351934.000
1 (predicted, actual) = 31445.012, 31469.000
2 (predicted, actual) = 67149.734, 67084.000
Squared: MSE: 3948.7184, R2: 1.0000

0 (predicted, actual) = 113837.922, 351934.000
1 (predicted, actual) = 113814.094, 31469.000
2 (predicted, actual) = 113814.094, 67084.000
Absolute: MSE: 1013288335448.9435, R2: -0.2448

0 (predicted, actual) = 617.065, 351934.000
1 (predicted, actual) = 528.865, 31469.000
2 (predicted, actual) = 528.865, 67084.000
Squaredlog Scaled: MSE: 1127183191751.4749, R2: -0.3847

0 (predicted, actual) = 352038.406, 351934.000
1 (predicted, actual) = 31466.570, 31469.000
2 (predicted, actual) = 67120.523, 67084.000
Squared Scaled: MSE: 3948.1590, R2: 1.0000

0 (predicted, actual) = 113837.922, 351934.000
1 (predicted, actual) = 113814.094, 31469.000
2 (predicted, actual) = 113814.094, 67084.000
Absolute Scaled: MSE: 1013288335448.9435, R2: -0.2448

Another task would be to find a good set of hyperparameters for the model make it even more accurate. It can be done with SageMaker Model Tuning. However, this is a task for another day.