How to: XGboost and Hyperparameter Tuning with AWS

Adam Davis
4 min readOct 20, 2022
L’Aurora -Salvador Dali, 1948

Don’t bother about being modern. Unfortunately, it is the one thing that, whatever you do, cannot avoid.” -Salvador Dali

Are you a generalist? I think of myself as a generalist of sorts when it comes to math and programming. I’ve had experience with most different aspects from operations research, game theory, analyzing data and even the proper SQL commands to speed efficiency. One part where I struggle is with combining everything into one neat package. I haven’t used AWS much but that has changed now. I got myself on the free tier and am plugging away deploying models.

Learning more about deployment and even stat-up of a notebook (also knowing what a bucket of data is), I noticed that there are a lot of general tutorials. Not a lot of them delve into the visualization but rely on blind trust of why your model works. Here I built on an example and also added in my preferred metrics for model accuracy and effectiveness.

Amazon web services (AWS) is a cloud computing service that allows a few people work on the same models and to also deploy and keep on and running for different applications. In this case I use it to deploy a model which is fully trained on specific data and is able to be used to make future predictions. Fair warning this dataset is fairly imbalanced and I didn’t do any actual balancing, this article is to show how to train and deploy a model and check it’s accuracy in case we would like to re-train it.

Start:

You’ll need to create your session using your correct container. These all need to be specified according to where you are at in the country and what model that you are using. The “prefix” that you are creating is what the session will be saved under.

Index the data bucket that you have created already.

Load your dataframe . This is a standard way to upload data using pandas.

Next we’ll do a standard 70/30 train/test split and create a correlation matrix:

Wow not the best fit. Next post will deal with some feature engineering and ways to address this.

To tune the hyperparameters of our model, we will need to have our data in a specific format for the XGboost in AWS. The target column will need to be the first left column and be deleted from the end of our dataset. Just switching places really. For the validation set, our model we are using the test dataset. We’ll also need to create our AWS session.

As we are tuning our models we will need to specify the ranges and types of parameters etc. that we will want to be tuned. I also used two different types of tuners to check which has the best performance on our data.

Deployment

We are now ready to deploy our model. There is some pre-processing that we can do with our deployed model’s predictions as well.

Let’s create a confusion matrix to test the accuracy of our model as well as a classification table.

Not fantastic but the purpose of this wasn’t to create the most accurate model but just a model in general. Lastly, don’t forget to delete the endpoint and bucket and to stop the instance from running.

--

--