Spark your Infrastructure: Terraform to deploy AWS Glue Pyspark job

17 April 2023
Blog Image

The days when infrastructure was managed and provisioned through manual processes are, or at least should be, long behind us. Nowadays, it’s a best practice to treat infrastructure as code. Infrastructure as Code (IaC) is a technique based on software development principles and there are several tools available to adopt IaC. It goes without saying that building a modern data platform means embracing the practice of Infrastructure as Code.

One of the most popular IaC tools is Terraform. In this blog post we want to show you how to setup a simple AWS Glue Pyspark job using Terraform. To familiarize yourself with AWS Glue Python jobs you can read one of our previous blog posts.

Step-by-step guide

1. Develop a AWS Glue Pyspark job

First of all you need to develop a Glue Pyspark job. The example below creates a Spark dataframe from a list of data, performs some basic transformations and stores the dataframe as a Parquet file inside an S3 bucket.

blogpost_job.py

2. AWS infrastructure

Once you have a Glue Pyspark job you need to setup the necessary AWS resources to run the job. It is in this step that Terraform comes into play. In the following code snippets we use Terraform to create the needed resources.

2.1. Terraform input variables

Before we start declaring the AWS resources, we define two input variables. Terraform input variables allow you to parametrize your Terraform modules which makes your modules composable and reusable.

variables.tf

2.2. AWS S3 bucket

The script containing the Glue job must to be stored in an S3 bucket. We will therefore first create an S3 bucket called datashift-playground-dev. Apply the Terraform execution plan to deploy the S3 bucket.

main.tf

Once deployed, you can upload the Python script into the bucket. Note that in the following steps we assume the script is located in s3://datashift-playground-dev/blogpost/glue-jobs/blogpost_job.py.

2.3. AWS Glue

The rest of the needed AWS resources can be deployed after uploading the Python script. You can add the below declarations to your main.tf script and apply the execution plan again so that the additional resources are created.

main.tf

Congrats! You have now deployed your AWS Glue Pyspark job using Terraform.

Need more detail?

Interested in setting up AWS Glue Pyspark jobs or in setting up your infrastructure via Terraform? Get in touch with us and we will be happy to discuss this in more detail.