Categories
Architecture & Resilience

Event-Based Cost Control In AWS Glue: Architecture

In this post, I examine some unexpected AWS Glue costs and design an event-based cost control process architecture.

Table of Contents

Introduction

Last month, I finished a series of data pipeline posts using, among other services, AWS Glue. During this series I made many discoveries – some more desirable than others. One such undesirable was a cost spike in early June! Not enough to trigger a budget alarm, but still higher than expected at that time.

To Cost Explorer! These were the results:

2024 06 24 AWSCostsStartJune

Those Glue costs were…unexpected. While this doesn’t look like much, in contrast my entire May 2024 bill was $1.08. So June saw an almost 150% cost increase over just three days!

This post has two sections. Firstly, the Discovery section examines the costs in closer detail and considers potential solutions. Secondly, the Architecture section examines the decisions made for and the technical implementation of the chosen solution.

Discovery

This section examines the costs in closer detail and considers potential solutions. I’ll structure the cost analysis using three questions:

  • How are the costs made up?
  • What specifically is generating the costs?
  • Why are the costs being generated?

The How

Question 1: How are the costs made up?

Firstly, let’s break down the costs. The earlier chart shows that Glue is the main cost driver – I now want to drill down into the API-level costs. I can do this by changing the chart’s dimension to API Operation.

This updates it to:

2024 06 24 AWSCostsStartJuneDimAPI

And the raw data to:

2024 06 24 AWSCostsStartJuneTable

The main costs here are all Glue APIs, with the top two being:

  • GlueInteractiveSession
  • Jobrun

No operation is tax – Ed

Jobrun was easy to account for, as I was testing some Glue ETL jobs at the time. But I was unfamiliar with GlueInteractiveSession, and as it was the biggest cost driver it became the focus of my ongoing investigation.

The What

Question 2: What specifically is generating the costs?

So what is the GlueInteractiveSession API? What does it do? And how does it accrue costs? Let’s begin with the AWS User Guide definition:

The interactive sessions API describes the AWS Glue API related to using AWS Glue interactive sessions to build and test extract, transform, and load (ETL) scripts for data integration.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-interactive-sessions.html

AWS Glue Interactive Sessions offer serverless, on-demand Apache Spark environments that work seamlessly with Glue ETL jobs. These sessions allow for the live development, testing, and enhancement of data processing steps and ETL tasks. They can easily connect to data from various AWS services such as S3, DynamoDB, and Redshift.

Interactive Sessions let users preview data without running full ETL jobs. This offers several benefits during development and testing:

  • Data modifications are only temporary during an Interactive Session, protecting the original data from undesired and unintended changes.
  • Jobs can be evaluated step by step rather than after each complete run, allowing for quicker development and testing compared to always executing the full job. And because of this…
  • When testing ETL steps, interactive sessions usually use fewer resources than a Glue job, thus reducing costs.

Speaking of costs, Glue Interactive Sessions billing is similar to Glue ETL Job billing and is based on the following factors:

  1. Duration: How long the session runs, measured in seconds.
  2. Resource Usage: The resources consumed during the session, such as CPU, memory, and storage.

This all sounds good. So why is my bill so high?

The Why

Question 3: Why are the costs being generated?

So I now know that:

  • The GlueInteractiveSession API is the main cost driver.
  • My Glue Interactive Sessions are linked to my AWS Glue ETL Jobs.

Let’s now examine why the GlueInteractiveSession API is suddenly generating higher costs.

The How chart shows that GlueInteractiveSession costs can happen irrespective of Jobrun costs. Indeed – on June 03 there were no Jobrun costs. So running Glue ETL jobs isn’t causing these charges.

Helpfully, the AWS Glue console has a dedicated Interactive Sessions section that shows session instance histories. Upon inspection, I found lots of this:

2024 12 08 GlueSessionTImeout

So, timeouts. Timeouts are good. They stop Interactive Sessions from running indefinitely, and sessions started from the Glue console automatically get a 30-minute timeout.

What was more concerning was the number of timeouts I found: three on June 02 and six on June 03. That’s nine sessions, each of which timing out after 30 minutes. That’s four and a half hours of unused compute I’m being billed for! How are these timeouts happening?

…About that. I often open multiple browser tabs to compare screens quickly when I’m trying things out. Here, each new Glue ETL Job browser tab starts a new interactive session based on my commands, and I forget to close these sessions afterwards. Oops!

Solutions

So now I know the cost’s root cause is my own ineptitude, how do I fix this? There are several options:

Permission Blocking: I could deny CreateSession requests using IAM and SCPs. This solution works for non-data-facing AWS accounts but creates unreasonable barriers for Glue-based console workstreams elsewhere.

Parameter Adjustment: The CreateSession API has an IdleTimeout parameter that controls the number of minutes when idle before the session times out. Although this can be easily configured through the CLI or SDK, I haven’t found a way to adjust it in the console yet.

Local Sessions: AWS maintains a Glue Labs Docker image intended for local AWS Glue job script development and testing. This would replace the cloud-based Interactive Sessions entirely and is arguably the best solution for data teams and at scale. The main reason I’m not using it here is that I’m the only user of this particular AWS account.

Event-Based Automation: All Interactive Sessions are stopped using the StopSession API regardless of reason. This includes the timeout process. An automated mechanism that invokes this API after a set period would effectively emulate a timeout. Additionally, since I oversee this process, I’m able to swiftly adjust the duration as needed.

And so I finally have a user story:

As an AWS account owner, I want Glue interactive sessions to stop automatically after a chosen duration so that I don’t accidentally generate unexpected and avoidable costs.

Finally, there is one further topic I want to address…

Event-Based Vs Event-Driven

Let’s examine the difference between event-based and event-driven. Mainly because I thought this was an event-driven process for months until I did some digging.

Now, I’m no expert on this. However, James Eastham is. Go watch this. It’s only six minutes – I’ll wait.

Ok good. For those who are time-strapped or want the highlights:

  • Event-based systems are technical events. Represented in a data context as API calls like ObjectCreated and CrawlerStarted.
  • Event-driven systems are business events. Represented in a data context as processes like Refresh Started and Sales Data Ingested.

My Glue Cost Control system is event-based because it is governed entirely by AWS events and API calls: StartSession will trigger some AWS automation that ultimately invokes StopSession.

So what does that automation look like? Well…

Architecture

This section examines the decision-making and technical implementation of my AWS Glue event-based cost control architecture. In my investigations, I discovered that AWS is way ahead of me!

Existing AWS Solution

The AWS Big Data blog has a 2023 post about enforcing boundaries on AWS Glue interactive sessions using this architecture:

The whole process is listed here, and the post’s code is in a GitHub repo. In summary:

  • The Glue Interactive Session creates a CloudTrail Event Record.
  • An EventBridge Rule captures the event and invokes a Lambda function.
  • The Lambda function inspects the event and acts depending on set boundaries.
  • SNS handles user notifications.
  • SQS and CloudWatch handle errors.

I’m using this architecture as a basis for my event-based Glue cost control process with some changes.

Architectural Decisions

This section outlines my adjustments to the AWS architecture to better align with my event-based Glue cost control process.

Replace Lambda With Step Functions

The AWS solution uses a Lambda function for event inspection and API interaction. This function has lots going on. But my needs are far simpler and fall well within the remit of a Step Functions workflow.

Many AWS heavyweights evangelize Step Functions over Lambda. Most recently, Eric Johnson dedicated a slide of his 2024 re:Invent session to this mantra:

“Step Functions first,
Step Functions always.”

For this use case, I’m inclined to agree. Step Functions offers several advantages over Lambda here:

Service Integration: Lambda’s interactivity with other AWS services requires manual code (e.g. a Python boto3 client). Step Functions offer no-code AWS service integrations that interact directly with AWS APIs. So my Step Function will be faster to develop.

Error Handling: Lambda relies on the function code for error handling and retries. In contrast, Step Functions offer configurable built-in no-code error handling and retry mechanisms, making my Step Function more resilient.

Ongoing Maintenance: While AWS manages the Lambda service, the function code still needs runtime maintenance, security patching and general refactoring as it ages. Conversely, Step Functions use static JSON and YAML-based ASL, so my Step Function will require less ongoing maintenance.

Step Function Model

There are two Step Function models: Standard Workflows and Express Workflows. I’ll be using a Standard workflow here. Two factors drive this decision:

API Behaviour: Changing a Glue Interactive Session is not an idempotent action. Requesting a change to a session in an invalid state produces an IllegalSessionState exception. For example, the below error is thrown when trying to stop a Glue job that hasn’t yet been fully provisioned:

JSON
{
  "cause": "Session is in PROVISIONING status (Service: Glue, Status Code: 400, Request ID: null)",
  "error": "Glue.IllegalSessionStateException",
  "resource": "stopSession",
  "resourceType": "aws-sdk:glue"
}

Express Workflows utilize an at-least-once model, meaning an execution might run multiple times. Sending several requests that are very likely to fail will create confusion and waste resources. In contrast, Standard Workflows adhere to an exactly-once model with optional retries, significantly reducing the likelihood of these problems.

And speaking of resource use…

Cost: Express Workflow executions are charged according to how often they run, the duration of each run and the memory consumed during the process. Standard Workflow executions are billed based on the number of state transitions and feature a generous and indefinite free tier.

Standard Workflows are a better option here because my workflow requires waiting. While Express Workflows may not be too costly, I’d still be paying for the wait. And remember – the whole point is to reduce avoidable costs! Conversely, Standard Workflows would stay entirely within the free tier at the expected volumes.

Remove The SQS Queue

I’ve removed the SQS queue simply because I don’t need it here. It was originally intended to record events that triggered a Lambda function failure. However, the Step Function workflow’s inbuilt auditing will now capture this.

Considering the Frugal Architect Mindset and AWS Well-Architected Framework‘s Cost Optimization Pillar, the SQS queue’s financial and development costs are no longer justified. This cements its removal.

Architecture Diagram

This is my event-based Glue Cost Control process architecture diagram:

In this solution:

  1. User interacts with a Glue ETL Job and creates an Interactive Session.
  2. Glue CreateSession event is created.
  3. Glue CreateSession event creates a CloudTrail event record.
  4. EventBridge matches the event record to an event rule.
  5. Eventbridge extracts the event’s SessionID and passes it to the Step Functions workflow, which waits for the set duration.
  6. Workflow passes SessionID to the Glue StopSession API. This action retries twice if it is unsuccessful.
  7. Finally, Workflow triggers an SNS email confirming the session’s stop.

Additionally, several services send logs to CloudWatch and gain permissions using IAM. If the Step Function fails, a CloudWatch alarm triggers a user email.

Summary

In this post, I examined some unexpected AWS Glue costs and designed an event-based cost control process architecture.

Once I understood the problem clearly, I iterated on an existing AWS architecture to build my bespoke event-based process. My architecture diagram shows how the key components work together and provides a clear implementation roadmap. In the next post I’ll start the build!

If you found this post helpful, the button below will take you to my contact details, socials, projects, and sessions.

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Data & Analytics

Gold Layer PySpark ETL With AWS Glue Studio

In this post, I create my WordPress data pipeline’s Gold ETL process using PySpark and the AWS Glue Studio visual interface.

Table of Contents

Introduction

Time to finish my WordPress AWS data pipeline! Here it is so far:

AWS Cloud
AWS Cloud
EventBridge
Schedule
EventBridge…
AWS Step Functions workflow
AWS Step Functions workflow
3
3
AWS Lambda Raw Function
AWS Lambda Ra…
AWS SNS Topic
AWS SNS Topic
2
2
State
Machine
State…
AWS Lambda Bronze Function
AWS Lambda Br…
F
F
5
5
AWS Glue
Bronze Crawler
AWS Glue…
4
4
AWS Glue
Silver ETL Job
AWS Glue…
F
F
F
F
1
1
F
F
EventBridge
Scheduler
EventBridge…
AWS SNS Topic
AWS SNS Topic
User
User
CloudWatch Logs
CloudWatch Lo…
F
F
F
F
Text is not SVG – cannot display

In which;

In the Medallion Lakehouse Architecture, this covers both the Bronze and Silver layers that handle raw and processed data respectively. Now I’ll start aggregating my WordPress data for reporting and analytics. For this, I’ll use AWS Glue Studio.

Firstly, I’ll explore Glue Studio and its features. Next, I’ll architect and build an ETL job using Glue Studio’s visual editor while examining some of Glue’s behaviours. Finally, I’ll update my WordPress Data Pipeline Step Functions workflow and examine costs.

Let’s begin with Glue Studio.

AWS Glue Studio

This section introduces Glue Studio and examines Apache Spark.

AWS Glue Studio

AWS Glue Studio is a serverless tool designed for data-centric tasks like automating data preparation, orchestrating data quality checks and creating ETL jobs. It integrates with other AWS services, and also interacts with data from sources like RDS, Redshift and S3. It is ideal for simplifying data transformation and integration processes. The AWS documentation contains full details of Glue Studio’s features.

Under the hood, Glue Studio uses PySpark, the Python API for Apache Spark. Workflows can be created both as code and via Glue Studio’s visual interface. Glue Studio supports Git version control systems for change management, and integrates several observability tools including AWS IAM for security and Amazon CloudWatch for logging. Additionally, Glue also has its own monitoring and orchestration tools.

But wait – Spark? PySpark? What?!

Apache Spark

Apache Spark is an open-source framework designed to process large-scale data quickly. Spark enables distributed computing, allowing tasks to be performed across multiple machines for faster and more efficient data processing. It has existed since 2014.

Known for its speed, Spark processes data in memory, significantly reducing the need for slower disk operations associated with older systems. Spark is commonly used for big data analytics, machine learning and real-time data processing in industries that handle massive datasets.

PySpark

PySpark is a Python interface for Apache Spark. It allows operations to be distributed across clusters of machines while maintaining the accessibility and ease of Python. PySpark’s combination of Python’s simplicity and Spark’s power makes it a practical, accessible solution for handling extensive datasets in a fast and scalable way.

Glue Studio’s visual interface automatically writes PySpark code in real time. For example, this boilerplate Python script is created with each new Glue PySpark job:

Python
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

For those curious, this DataEng video provides a technical explanation of each import:

So that’s the basics of AWS Glue Studio. Now let’s see what the solution looks like.

Architecture

This section examines my proposed solution’s architecture. Much of this architecture is similar to both the Bronze and Silver layers. I’ll examine the new Gold Glue PySpark ELT job first, followed by the updated WordPress data pipeline Step Function workflow.

Glue Gold ETL Job

Firstly, this is the Gold Glue PySpark ETL job:

While updating CloudWatch Logs throughout:

  1. Gold Glue ETL job extracts data from wordpress-api Silver S3 objects and then performs PySpark transformations.
  2. Gold Glue PySpark ETL job loads the transformed data into Gold S3 bucket as Parquet objects.

Step Function Workflow

Next, the updated Step Function workflow:

While updating the workflow’s CloudWatch Log Group throughout:

  1. An EventBridge Schedule executes the Step Functions workflow. Lambda Raw function is invoked.
    • Invocation Fails: Publish SNS message. Workflow then ends.
    • Invocation Succeeds: Invoke Lambda Bronze function.
  2. Lambda Bronze function is invoked.
    • Invocation Fails: Publish SNS message. Workflow then ends.
    • Invocation Succeeds: Run Glue Bronze Crawler.
  3. Glue Bronze Crawler runs.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Update Glue Data Catalog. Run Glue Silver ETL job.
  4. Glue Silver ETL job runs.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Run Glue Silver Data Quality Checks.
  5. Glue Silver Data Quality Checks run.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Run Glue Silver Crawler.
  6. Glue Silver Crawler runs.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Update Glue Data Catalog. Run Glue Gold ETL job.
  7. Glue Gold PySpark ETL job runs.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Run Glue Gold Crawler.
  8. Glue Gold Crawler runs.
    • Run Fails: Publish SNS message. Workflow then ends.
    • Run Succeeds: Update Glue Data Catalog. Workflow then ends.

Additionally, an SNS message is published if the Step Functions workflow fails.

Gold ETL Job

In this section, I create my Gold Glue PySpark ETL job. Firstly, I’ll define the job’s requirements. Next, I’ll build the job in Glue Studio, and finally I’ll examine Glue’s inbuilt monitoring.

Requirements

Let’s begin by understanding the Gold Layer. Databricks defines it as curated, business-level data:

Data in the Gold layer of the lakehouse is typically organised in consumption-ready “project-specific” databases. The Gold layer is for reporting and uses more de-normalised and read-optimised data models with fewer joins. The final layer of data transformations and data quality rules are applied here.

https://www.databricks.com/glossary/medallion-architecture

The concept of a gold layer is nothing new. Other names include aggregated, enriched and consumption layers. The idea is the same in all cases – producing refined and aggregated datasets that are easily consumable by analytics tools, machine learning models and production applications.

This Gold ETL job will produce an aggregation of both the posts and statistics_pages Silver datasets. The Gold dataset will contain view statistics and post creation data, limited to blog posts.

This will involve:

  • Joining the Silver datasets.
  • Removing unneeded columns to reduce the Gold dataset’s size.
  • Renaming columns to improve the Gold dataset’s legibility.
  • Filtering the Gold dataset to remove unneeded data.

So let’s get started!

Job Creation

This section splits the Gold Glue PySpark ETL job creation process into separate steps for each part.

Sources

Firstly, let’s define the data sources. There are two sources, both of which are folders in the data-lakehouse-silver S3 bucket:

  • wordpress_api/posts/
  • wordpress_api/statistics_pages/

Each source needs a separate node specifying the S3 path and data format. This example shows the Silver posts dataset, where the wordpress_api/posts/ S3 path is selected:

2024 10 25 AWSGlueStudioNodeSource

Finally, this is the Source node’s PySpark code for both posts and statistics_pages:

Python
# Script generated for node S3 Silver statistics_pages
S3Silverstatistics_pages_node1724058965930 = glueContext.create_dynamic_frame.from_options(
  format_options={}, 
  connection_type="s3", 
  format="parquet", 
  connection_options={
    "paths": ["s3://data-lakehouse-silver/wordpress_api/statistics_pages/"], 
    "recurse": True
    },
  transformation_ctx="S3Silverstatistics_pages_node1724058965930"
 )

# Script generated for node S3 Silver posts
S3Silverposts_node1724058915313 = glueContext.create_dynamic_frame.from_options(
  format_options={}, 
  connection_type="s3", 
  format="parquet", 
  connection_options={
    "paths": ["s3://data-lakehouse-silver/wordpress_api/posts/"], 
    "recurse": True
    },
  transformation_ctx="S3Silverposts_node1724058915313"
 )

Join Transformation

From AWS:

The Join transform allows you to combine two datasets into one. You specify the key names in the schema of each dataset to compare.

https://docs.aws.amazon.com/glue/latest/dg/transforms-configure-join.html

This node essentially creates a SQL join using columns from the selected sources. Here, I’ve inner joined posts.ID to statistics_pages.ID:

2024 10 25 AWSGlueStudioNodeJoin

Rows from the Silver datasets that match the join condition are merged into a new row in an output DynamicFrame that will ultimately become the Gold dataset. This frame includes all columns from both Silver datasets.

The ETL visual now shows two source nodes linked to the Join node:

2024 10 25 AWSGlueStudioDAGSourceJoin

Finally, this is the Join node’s PySpark code:

Python
# Script generated for node Join
Join_node1724059035756 = Join.apply(
  frame1=S3Silverposts_node1724058915313,
  frame2=S3Silverstatistics_pages_node1724058965930,
  keys1=["ID"],
  keys2=["id"],
  transformation_ctx="Join_node1724059035756"
  )

Change Schema Transformation

Now it’s time to do some cleaning!

From AWS:

Change Schema transform remaps the source data property keys into the desired configured for the target data. In a Change Schema transform node, you can:

  • Change the name of multiple data property keys.
  • Change the data type of the data property keys, if the new data type is supported and there is a transformation path between the two data types.
  • Choose a subset of data property keys by indicating which data property keys you want to drop.
https://docs.aws.amazon.com/glue/latest/dg/transforms-configure-applymapping.html

Firstly, I set the Join node as the Change Schema node’s parent to update the ETL visual:

2024 10 25 AWSGlueStudioDAGJoinSchema

Following the join, the Gold dataset can be simplified and optimised. Here’s an example of what the Change Schema node looks like in action:

2024 10 25 AWSGlueStudioNodeSchema

Here

  • Source Key shows the current column name.
  • Target Key handles column name changes.
  • Data Type sets the data type.
  • Ticking a Drop box removes that column from the output DynamicFrame

I’ve listed my changes below. Bold items appear in the example.

Firstly, these columns are dropped due to duplication or redundancy:

posts:

  • posts.post_modified
  • post_modified_day
  • post_modified_month
  • post_modified_todate
  • post_modified_year

statistics_pages:

  • date_todate
  • id
  • type
  • uri

Additionally, these columns are renamed to add context:

posts:

  • post_date_todate to post_date

statistics_pages:

  • page_id to statistics_id
  • date to statistics_date
  • date_year to statistics_date_year
  • date_month to statistics_date_month
  • date_day to statistics_date_day

Finally, this is the Change Schema node’s PySpark code:

Python
# Script generated for node Change Schema
ChangeSchema_node1724059144495 = ApplyMapping.apply(
  frame=Join_node1724059035756, 
  mappings=[
    ("ID", "bigint", "post_ID", "long"), 
    ("post_title", "string", "post_title", "string"), 
    ("post_status", "string", "post_status", "string"), 
    ("post_parent", "bigint", "post_parent", "long"), 
    ("post_type", "string", "post_type", "string"), 
    ("post_date_todate", "timestamp", "post_date", "timestamp"), 
    ("post_date_year", "bigint", "post_date_year", "long"), 
    ("post_date_month", "bigint", "post_date_month", "long"), 
    ("post_date_day", "bigint", "post_date_day", "long"), 
    ("page_id", "bigint", "statistics_id", "long"), 
    ("date", "timestamp", "statistics_date", "timestamp"), 
    ("count", "bigint", "statistics_count", "long"), 
    ("date_year", "bigint", "statistics_date_year", "long"), 
    ("date_month", "bigint", "statistics_date_month", "long"), 
    ("date_day", "bigint", "statistics_date_day", "long")
    ], 
  transformation_ctx="ChangeSchema_node1724059144495"
  )

Filter Transformation

The joined, cleaned dataset contains data about all amazonwebshark content. I only want the posts data, so next I’ll filter everything else out.

From AWS:

Use the Filter transform to create a new dataset by filtering records from the input dataset based on a regular expression. Rows that don’t satisfy the filter condition are removed from the output.

https://docs.aws.amazon.com/glue/latest/dg/transforms-filter.html

Firstly, I set the Change Schema node as the Filter node’s parent to update the ETL visual:

2024 10 25 AWSGlueStudioDAGSchemaFilter

Next, I set the filter conditions. I only need one condition here – keep all dataset rows where post_type matches post:

2024 10 25 AWSGlueStudioNodeFilter

Finally, this is the Filter node’s PySpark code:

Python
# Script generated for node Filter
Filter_node1724060106174 = Filter.apply(
  frame=ChangeSchema_node1724059144495, 
  f=lambda row: (bool(re.match("post", row["post_type"]))),
 transformation_ctx="Filter_node1724060106174"
 )

Target

Finally, I must choose a target location for my Gold dataset.

Target uses the same interface as the Source node. This time, a Gold S3 bucket folder path wordpress_api/statistics_postname/ is specified. Everything else is the same as Source. The Target node offers significant versatility, detailed in the AWS target node documentation.

In summary, this is the Target node’s PySpark code:

Python
# Script generated for node S3 Gold
S3Gold_node1724060393283 = glueContext.write_dynamic_frame.from_options(
  frame=Filter_node1724060106174, 
  connection_type="s3", 
  format="glueparquet", 
  connection_options={
    "path": "s3://data-lakehouse-gold/wordpress_api/statistics_postname/", 
    "partitionKeys": []
    },
 format_options={"compression": "snappy"}, 
 transformation_ctx="S3Gold_node1724060393283"
 )

And here’s the full ETL visual:

2024 10 25 AWSGlueStudioDAGFinal

The full Glue job PySpark script is available in this post’s GitHub repo.

Job Properties

Next, I’ll examine some of my Glue job’s properties. This section only covers some key properties as there are loads. For a fuller view, please review the AWS Job Property documentation.

Additional properties like bookmarks, quality checks, scheduling and version control are also available. I’ve written about quality checks before, and the other properties could all be posts in themselves. For now, let’s move on to execution.

Job Execution

Each PySpark Glue job has several logging sources that are aggregated into the job’s Run tab. The summary shows properties including job status, durations and DPU capacity:

2024 10 25 AWSGlueStudioRunsLowerDetails

Each job can then be viewed in further detail, with insights including:

These resources are increasingly useful as Glue jobs scale. They show resource utilisation, query plans and node configuration which is essential when optimising and troubleshooting big data processes.

Ok, so my job is configured and running successfully. Now let’s review the outputs.

Glue Outputs & Behaviours

This section examines the outputs of my Gold Glue PySpark ETL job and the behaviours influencing them.

For clarity, this is not a case of finding and fixing errors. Rather, this is an exploration of how a Glue PySpark job’s output can differ from expectations. Coming in, I was more familiar with using pandas for ETL and initially found these behaviours confusing. So I wrote this section with that in mind, as it may help others in similar positions down the road.

Firstly I’ll demonstrate a behaviour. Next, I’ll explain why it happens. Finally, I’ll examine if it can be changed. Although, just because something can be done doesn’t mean that it should be.

Run 1: Multiple Objects

Previously, the Bronze and Silver layers ultimately produced single objects for each dataset. Conversely, my Gold PySpark job creates four objects with the same RunID:

2024 10 29 TestingObjectsFour

Ok – that’s unexpected. What’s more, if I run the job again then I get another four files with a new RunID. So that’s eight in total:

2024 10 29 TestingObjectsEight

There’s two behaviours here that differ from the previous layers:

  • Each run produces multiple objects instead of one.
  • Each run creates new objects instead of replacing existing ones.

Let’s examine the multiple objects first.

What’s Happening?

This occurs due to data partitioning.

As mentioned earlier, AWS Glue uses Apache Spark. Spark enables distributed computing by breaking down data into smaller parts. The presence of multiple objects is a direct outcome of this partitioning approach, offering benefits such as:

  • Parallel Processing: With data spread across multiple files, Spark workers can access different parts of the dataset simultaneously instead of fighting for a single object. This approach balances the workload and accelerates both read and write operations.
  • Fault Tolerance: If a write operation fails, only the impacted object needs reprocessing rather than the entire dataset. This design enhances resilience and reduces the risk of complete data loss.
  • Memory Management: Each Spark worker processes only its assigned data partition rather than the full dataset. This improves data loading efficiency and helps prevent memory exhaustion.

Can I Change It?

I couldn’t find a way to change this behaviour within Glue Studio. Glue is very capable of deriving partitions, so this isn’t surprising.

While it can be done, this involves manually changing the autogenerated PySpark script. Glue allows this at the cost of disabling the job’s visual design features:

Unlocking the job script will convert your job from visual mode to script-only mode. This action cannot be undone. To keep a copy of the visual-mode job, clone the job on the Jobs page of Glue Studio.

The change itself uses the coalesce method of Glue’s DynamicFrame class to control the number of partitions. This involves:

  • An additional import:
Python
from awsglue.dynamicframe import DynamicFrame
  • Converting the dynamic frame to a Spark DataFrame using coalesce(n). Here, coalesce(1) forces the output into a single object:
Python
single_file_df = Filter_node1724060106174.toDF().coalesce(1)
Python
single_file_dyf = DynamicFrame.fromDF(single_file_df, glueContext, "single_file_dyf")

The Glue job now produces a single Parquet object.

This should be used with care. Too many partitions can reduce response times by requiring more reads than necessary. Too few can hinder Spark’s workload distribution abilities. Here, having one object cripples it completely thus removing a key Spark benefit.

Run 2: Objects Not Replaced

Ok, let’s keep coalesce(1) in place because it makes this example easier. Running this job variant creates a single object:

2024 10 29 TestingObjectsOne

Running it again produces a second object with a new RunID:

2024 10 29 TestingObjectsTwo

Why isn’t the first object being replaced?

What’s Happening?

There are good reasons for this. Here’s why a replace function isn’t built in:

  • Spark Architecture: Spark processes data in parallel, with each task running separately. With this setup, replacing a single piece of data in an object is challenging. So instead, Spark jobs either create entirely new objects or replace data partitions.
  • S3 Architecture: S3 stores data as objects rather than files, so it doesn’t have folder-level replacements like a typical file system. When S3 ‘replaces’ an object, it actually creates a new version of the object with the same name and removes the old one.
  • Data Management Features: Writing new objects for each job run enables features like versioning, time travel and incremental processing with formats like Apache Iceberg and Delta Lake. It also avoids issues like access conflicts and deadlocks, since existing data remains unchanged while new data is written.

Can I Change It?

So…yes. Creating a boto3 S3 client and running a conditional delete during the job would achieve the desired effect:

Python
# Define S3 bucket and prefix for output path
output_bucket = "data-lakehouse-gold"
output_prefix = "wordpress_api/statistics_postname/"

# Initialize S3 client and clear existing objects in the output path
s3 = boto3.client('s3')
response = s3.list_objects_v2(Bucket=output_bucket, Prefix=output_prefix)

# Check if there are any files and delete them
if 'Contents' in response:
    for obj in response['Contents']:
        s3.delete_object(Bucket=output_bucket, Key=obj['Key'])

But, at this point, is this really a Spark use case anymore? For an ETL job requiring object replacement, I would initially lean towards using a Glue Python Shell job or the AWS SDK for pandas Lambda layer because:

  • Fewer cloud resources would be used, making the job cheaper than a PySpark job.
  • Fewer Python imports would be needed, reducing the script size and dependencies.
  • With appropriate settings, Lambda may run the script faster than Glue.

Suitability should always be a key consideration with cloud architectures. Taking time to choose the right service saves a lot of headaches later on.

Step Functions Update

This section integrates the Gold resources into my existing WordPress Data Pipeline Step Function workflow.

The Gold workflow update is similar to the Silver one. Firstly, I need a new Glue: StartJobRun action running the Gold Glue PySpark ETL job:

JSON
{
  "JobName": "WordPress_Gold_statisticspagespostsjoin"
}

Also, a new Glue: StartCrawler action running the Gold crawler:

JSON
{
  "Name": "wordpress-gold"
}

Here is how my Step Function workflow looks with these changes:

stepfunctions graph

The workflow’s IAM role needs new allow permissions too. Firstly, glue:StartJobRun and glue:GetJobRun on the WordPress_Gold_statisticspagespostsjoin Glue job:

JSON
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"glue:StartJobRun",
				"glue:GetJobRun"
			],
			"Resource": [
				"arn:aws:glue:eu-west-1: REDACTED:job/WordPress_Gold_statisticspagespostsjoin"
			]
		}
	]
}

(glue:GetJobRun lets the workflow check the job’s progress – Ed)

Next, glue:StartCrawler on the wordpress-gold crawler:

JSON
{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Sid": "VisualEditor0",
			"Effect": "Allow",
			"Action": [
				"glue:StartCrawler"
			],
			"Resource": [
				"arn:aws:glue:eu-west-1:REDACTED:crawler/wordpress-gold"
			]
		}
	]
}

With these permissions, the workflow executes successfully:

2024 10 29 StepFunctionResultsGraph

I can get further details from the workflow’s Table view. This includes task durations, resource log links and a visual timeline of each state:

2024 10 29 StepFunctionResultsTable

Further Step Functions console details are in this 2022 Ben Smith AWS post.

Cost Analysis

This section examines my costs for the updated Step Function workflow.

Here, my Cost Explorer chart runs from 04 November to 14 November. It is grouped by API Operation and excludes tax.

2024 11 15 CostsGold

My main costs are from Glue’s Jobrun and CrawlerRun operations. Each ruleset now costs around $0.17 a day to run. This has increased from last time’s $0.09, but that’s to be expected as I’m running two Glue jobs now.

My crawlers now cost $0.06 a day, averaging $0.02 for each of the Bronze, Silver and Gold crawlers. The purple blip is for Glue Interactive Sessions – I have something coming up on those. Beyond that, I’m paying for some S3 PutObject calls and everything else is within the free tier.

Note that on Nov 06, it….broke. A failed call to the WordPress API brought the whole workflow down:

stepfunctions graph error

This proves my error handling works though! A forced stop and graceful failure is preferable to having data in an unknown state, especially in a production environment!

Summary

In this post, I created my WordPress data pipeline’s Gold ETL process using PySpark and the AWS Glue Studio visual interface.

I found Glue Studio to be highly user-friendly. It enhances job observability with comprehensive monitoring tools, and makes PySpark script creation significantly easier through its visual editor. Additionally, it integrates smoothly with other Glue features and the broader AWS ecosystem, offering extensive and intuitive customisation options.

This wraps up the WordPress AWS Data Pipeline project. This series aimed to demonstrate how different AWS services can work together to build efficient and cost-effective data pipelines. Through it, I’ve gained new insights and have several fresh ideas to explore!

If this post has been useful then the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Training & Community

Three Mantras For The Anxious New Speaker

In this post, I share three mantras for the anxious new speaker and some helpful resources for session development that I’ve used this year.

Table of Contents

Introduction

This year has been a whirlwind! After presenting my first in-person session at AWS Community Summit London in April, I’ve since spoken at user groups, a paid event and even internationally!

This hasn’t come easy for someone as naturally anxious as me. I work procedurally in many areas, which doesn’t lend itself fully to organic and spontaneous pursuits like public speaking.

I use mantras during exercise when I need quick guidance that’s easy to recall, and after London I realised that a similar approach while speaking would calm my nerves and refocus my attention. These mantras have since become invaluable, so now I’ll present them here.

Firstly, I’ll share three mantras that have guided me through developing, presenting, and evaluating my sessions. Following this, I’ll add some helpful resources for session creation, slide deck development and delivery mindset.

Mantras

This section contains three mantras for the anxious new speaker that I’ve used this year.

No One Wants You To Fail

I begin with some imposter syndrome goodness. Public speaking offers rich ammunition for imposter syndrome sufferers, like:

  • “The audience will be full of experts and I don’t belong in front of them.”
  • “No one will enjoy my session or find it useful.”
  • “No one will take me seriously.”

But the reality is very different. Audiences want speakers to succeed because it creates a more enjoyable and informative event, fostering better knowledge sharing and a positive atmosphere.

And it’s not like the session’s content is a secret! Consider delivering a session to a user group. The session’s abstract has likely been seen by a user group leader and several audience members before the doors even open. People know what they are getting into and are choosing to be there!

Audiences want to enjoy and benefit from the session, and they’re often patient, understanding, and forgiving when things don’t go perfectly. They’re not there to criticise or judge – they’re hoping the speaker succeeds and provides value.

AWSUGLeeds

People Are Watching The Slides, Not You

Flashback to April 2024 – my first in-person session at AWS Summit London. I was presenting midway through the day, and my anxiety brain entered high gear while waiting:

  • How should I stand?
  • What should I do with my hands?
  • How often should I look at the audience?
  • OMGOMGOMG

Anxiety brain then fixated on the various keynotes and TED Talks I’ve seen instead of focusing on my session. Great.

Then I had an idea. I started watching the audience. Some people were checking their lanyards and swag bags. Others were glancing at the passing crowds or grabbing a coffee from the nearby Serverlesspresso stand. But most were fully focused on the slides.

No one was fixated on the speaker.

Next, I watched the speaker. They looked up and down, occasionally gesturing. Nothing about their delivery felt like a finely choreographed routine.

In those moments, I realised that I was holding myself to impossibly high standards for my first in-person session. This wasn’t reality TV or theatre. This was a group of enthusiasts with common interests seeking knowledge. The audience wasn’t here to watch me. They were here to watch the slides.

That shift in perspective helped me so much. Without that mantra in April, there’s no way I would have been capable of doing Comsum (a filmed session in front of a paying audience) in September:

Every Minute Is A Victory

I said I use mantras during exercise earlier, and this one is pretty much a straight rip from those. Being a speaker (especially an anxious new speaker!) demands time and energy for tasks like:

  • Developing an abstract, submitting it to a user group or call for papers and awaiting the outcome.
  • Curating and preparing a session, updating and refining a slide deck and practising delivery.
  • Making sure you’re in the right place, at the right time with the right materials. And waiting for the day to arrive!
  • Delivering the session, maintaining flow and addressing audience questions.
  • Evaluating the session, tweaking the content and reflecting personally on the experience.

It would be easy to look at all this, nope out and spend your time elsewhere. So every minute spent on a session, from inception to post-delivery, is a victory.

AWSUGLiverpool

Resources

This section contains some helpful resources for session development.

New Stars Of Data Library

New Stars of Data is an event dedicated to mentoring and promoting new speakers in the Microsoft community. It is run by Ben Weissman and William Durkin, and is supported by a team of experienced speakers. I participated in NSOD6 in 2023.

New Stars Of Data has a Speaker Improvement Library supported by the Microsoft Azure Data Community. This library was invaluable during my New Stars of Data journey, and I still refer to it regularly for guidance and inspiration.

Here are some of my personal favourites from the library:

Cult Of Done

Next, let’s discuss Bre Pettis and Kio Stark’s Cult of Done Manifesto. I have previously written about the Cult Of Done, and actively use it for creative and professional tasks. The following CoD principles relate well to the anxious new speaker:

“Accept that everything is a draft. It helps to get it done.”

Cult of Done Manifesto Principle 2

In my experience, a session is never truly finished. Slide optimisations and delivery improvements often become evident during the presentation. Audience questions and comments may prompt revisions. And as technology and the cloud evolve, the session itself may need to change.

Currently, I’ve presented Building And Automating Serverless Auto-Scaling Data Pipelines In AWS five times this year. No two sessions have ever been the same. Each time has essentially been a draft!

“Pretending you know what you’re doing is almost the same as knowing what you are doing, so just accept that you know what you’re doing even if you don’t and do it.”

Cult of Done Manifesto Principle 4

This is already pretty descriptive – an eloquent version of “fake it till you make it”. To paraphrase Tris Oaten, you’re watching me learn how to construct a session in real-time. You’re watching me learn to present in real-time, and how to submit abstracts, build confidence and answer audience questions in real-time.

This is a continual journey that even seasoned presenters are on. There is no shame in such a journey, so embrace it.

“Done is the engine of more.”

Cult of Done Manifesto Principle 13

Every finished session offers something in exchange. This ranges from improved confidence and skills to increased momentum and drive. And the more abstracts written, the more sessions submitted and the more presentations delivered, the more you build a foundation for better talks, deeper insights, and greater confidence in your abilities.

Note that ‘more’ doesn’t necessarily mean ‘more sessions’ here. ‘More’ can mean:

  • Personal growth
  • Networking with fellow enthusiasts and community members.
  • Development opportunities (you never know who’s in the audience!)

I’ll end it here, but many other principles apply. Be sure to check out the full manifesto and Tris’ No Boilerplate video for more insights:

Summary

In this post, I shared three mantras for the anxious new speaker and some helpful resources for session development that I’ve used this year.

Public speaking comes more easily to some than others. I never expected to find myself in this position, and I’m not sure I would have believed anyone who told me this is how 2024 would unfold! Mantras are powerful tools for calming nerves and building confidence, and if these don’t resonate with you then there are plenty of others to explore.

If this post has been useful, check out the button below for links to my contact information, social media, projects, and upcoming sessions:

SharkLinkButton 1

Thanks for reading ~~^~~