How to query your S3 Data Lake using Athena within an AWS Glue Python shell job

16 November 2022
Blog Image

If you’re somewhat familiar with AWS Glue, then you probably know this serverless ETL service of AWS supports two types of jobs: Spark and Python shell. In this article, we’ll focus on Python shell jobs and explain how you can make optimal use of your S3 Data Lake using Athena within Python shell jobs.

So, keep reading if you want to know how to use an Athena query result in Python using boto3, the AWS SDK for Python.

Why use AWS Glue Python shell jobs in combination with AWS Athena?

1. It’s very cost-efficient,

    Glue Python shell jobs cost $0.40/hour (if you use an instance with 16GB RAM = 1 DPU), while AWS Athena comes at $5 per TB of data scanned.

    Consider the example where we query a table with approximately 1.900.000.000 rows (and 50 columns) in one of our ETL jobs. Did you know that such a table only takes 170GB of storage space on S3 when using the Parquet data storage format? On average, around 8GB of data are scanned per query using partitioned-based filtering. As a result, the average cost per query is only $0.04.

    2. It delivers top-notch performance.

      An analytical query with basic aggregation on a month’s data (approximately 148.000.000 rows) takes, on average, about 15 seconds only. Behind the scenes, AWS Athena is using Presto, and that clearly pays off.

      3. It’s serverless

        No more worries about the underlying architecture and resources because Glue and Athena are serverless services. You can assign DPU capacity to a Glue Python shell job (with a maximum of 1 DPU, which equals an instance with 16GB RAM), and you’re up and running.

        A step-by-step guide to using AWS Glue Python shell jobs

        Let's briefly present the steps to take following existing best practices & code templates. We assume that you already know how AWS Athena works and have created the relevant tables using the Glue Catalog as a Hive meta store. And before we get started with our step-by-step guide, let’s give you one more tip: make sure to program your production pipelines and jobs following the pep8 coding conventions.

        1. Import the required modules.

        import boto3

        import io

        import time

        2. Create a function that runs a given query and returns the result as a file-like object.

        This function has the following parameters:

        athena_client: a boto3 Athena client object

        query: the SQL query as a string value (this is the query that we are launching on AWS Athena)

        database: the AWS Athena database to which the tables that you query belong to

        s3_query_output_path: the full S3 output path as a string. This is the location on S3 where the query result is being stored. AWS Athena uses CSV as the default output file format. The boto3 Athena resource doesn’t include a parameter to write the query result directly into the Parquet format :(.

        s3_bucket: the s3 bucket of the Athena query output location

        s3_prefix: the prefix of the s3_query_output_path variable. This parameter is used to get the query result from the relevant S3 location as a file-like object. By creating a concatenated string, you could also dynamically generate this parameter using the s3_query_output_path parameter. We just pass it as a fixed parameter in this example for educational purposes.

        3. Create variables to pass as parameters and run the function

        4. Optionally, load the file-like object in a Pandas DataFrame

        Always keep the following in mind when using AWS Glue Python shell jobs

        Make sure to use partition-based filtering within your SQL queries. Assuming that your source data within the Data Lake are partitioned, you should use those partitioned columns in your query filtering! If not, you will undoubtedly encounter memory issues when retrieving big CSV file-like objects. To give you some more context: when we run jobs at maximum capacity (=1 DPU/16GB RAM), we can retrieve results of approximately 30 million rows (with 20 columns) in a single job. And in case you would be encountering memory errors, always make sure to check the size of the generated CSV file!

        Need more detail?

        Interested in using AWS Glue Python shell jobs in combination with AWS Athena? Get in touch with us, and we'll be happy to discuss this in more detail.