Categories
Me

Video Thrilled The Dataflow Shark

In this post, I debut both the amazonwebshark YouTube channel and my first demonstration video and shark shorts.

Table of Contents

Introduction

So, this is amazonwebshark’s fiftieth post! It’s also my first post as a consultant! I joined the Steamhaus team this month and am looking forward to what the future brings!

In my last YearCompass review, I committed to building my personal brand. So far 2024 has seen this shark presenting at AWS Summit London and AWS User Group Liverpool, and making video content seems like the logical next step.

Before this, the amazonwebshark YouTube channel held playlists for videos I’ve referenced in previous posts. And as of July 2024 I’ve joined the video content creator ranks with some humble contributions of my very own!

So what’s this post about? Firstly, I’ll examine my motivation for making video content. Then I’ll link my current uploads, and finally I’ll set initial expectations for the channel.

Why?

So why am I doing this? This section explains my motivation for recording videos and what I hope to achieve.

Practice & Improve Speaking

Generally, doing something more often reveals improvements, efficiencies and optimisations. And I want to improve my speaking. So I need to do more of it!

I got some great advice from Laurie Kirk a while ago on this topic. She has a habit of filming herself daily and reviewing the footage for improvements. By her own admission, this has improved her confidence and quality.

Besides this, speaking practice draws parallels with training runs. Becoming an optimal runner involves various types of training. Want to build endurance? Run slower over distance. Want to improve speed? Focus on faster, shorter bursts.

And it’s the same with speaking. Want to practise lightning talks? Make short videos. Want to improve sessions? Film a demo. My work so far made me more confident at AWS UG Liverpool earlier this month, so I hope to see more improvements in the coming months by practising different types of speaking.

Audience Diversification

In addition to improving my current abilities, I want to upskill my ability to communicate with people outside my field of expertise.

It’s well known that technical people love speaking with technical people. Getting into the weeds about operating systems, functions, architectures and paradigms regularly see hours fly by at meetups.

This is, however, inherently limiting to those less knowledgeable in those areas. I was originally going to label this as ‘non-tech’, but on reflection this extends in all directions. For example, I won’t understand an architectural discussion that turns to von Neumann architectures. I then have equal potential to confuse if I start talking about Lakehouse architectures. This can even happen with people sharing a speciality: I had no idea what instantiate meant when I first heard it from another Data Engineer.

Amongst the best sessions and videos I’ve seen are ones where topics are made accessible and inclusive to a diverse audience with ranging skills and experience. Some viewers will have years of experience in the field and are looking for the latest insights. Others will be hearing about the topic for the very first time. Appealing to both ends of the spectrum is the ideal scenario.

This is the skill level I’m aiming towards. Creating sessions and videos that appeal to a diverse range of viewers will make me a more inclusive and effective communicator. And it’s not just about audience diversification…

Content Diversification

Next, producing videos will let me make different kinds of content.

Applying the Diátaxis framework, my blog posts lean more towards Tutorial than the others. This is intentional, as I’ve always preferred practice over theory and like sharing cool stuff that enables people.

That’s not to say I don’t get curious about the other Diátaxis ‘needs’ of How-To Guide, Explanation and Reference. While past exploration of these with text hasn’t worked out, videos offer new opportunities here such as:

  • Trying out new AWS services and features.
  • Running through concepts and architectures.
  • Exploring unfamiliar and less common settings and parameters.

In short, content ideas that lend themselves better to video than text. And speaking of ideas…

Failing Fast

Videos may save a post idea that has promise but isn’t working out.

I am never stuck for blog post ideas. There’s always something to write about – from services and architectures to current events. I probably have more potential post topics than I can ever write.

This isn’t to say that every post I start is completed though. Some ideas begin well but start to unravel. They might meander, lack cohesion or simply become uninteresting. And while I’m getting better at seeing the early signs of this, occasionally some still slip through.

I’m a big believer in avoiding the sunken cost fallacy. So in those situations, I admit defeat and defer to the Cult of Done Manifesto’s fifth principle:

“If you wait more than a week to get an idea done, abandon it.”

The Cult of Done Manifesto – Bre Pettis

There are several interpretations of this principle. NoBoilerplate‘s advice is that:

“Ideas in your brain are like a pipe full of random stuff. Some of it will be good; some not so good. If you’re not feeling it, don’t try to make a bad idea better – try the next idea.”

The Cult of Done: How To Get *Started* – No Boilerplate

I agree. But it’s still disheartening sometimes to delete something that still feels like it has legs – just not for the body you’re trying to stitch them to. More recently, I found Jason Fladlien‘s interpretation which has a different take:

“The longer you go not getting something done the more baggage you create around getting it done. “Abandoning” an idea simply means throwing this version of it in the trash. You can start it fresh later.”

The Cult of Done – The Drive-Contentment Connection

This applies very well to situations where an idea loses traction as a blog post but is still worth pursuing. Instead of deleting everything, post material could be repurposed into video material.

Additionally, I’m likely to have session abstracts that either don’t work out or aren’t accepted. If I consider the idea to be sound, videos are ideal solutions to these situations too.

Chasing Internet Stardom

Yeah ok not really.

Current Uploads

So what have I produced so far on the video and shark short front?

Firstly, I’ve filmed a pair of data-themed YouTube shorts. The first examines one of the functions of an AWS Glue Crawler:

The other considers one of the differences between Parquet files and CSV files:

Future shorts will be uploaded to YouTube, Instagram and TikTok. I’ll see how this goes over the coming months.

I’ve also uploaded an extended demo for my Building And Automating Serverless Auto-Scaling Data Pipelines In AWS session:

The demo I use in this session begins with some existing AWS resources. This keeps me within the session’s time limit, but at the cost of an incomplete picture of what the Step Function workflow is doing. This extended version starts with a blank workflow and shows the Glue and Athena setup behind the scenes.

I’m not holding these up as works of art! They are rough around the edges, and I’m sure I’ll improve over time. In the meantime, a different Cult of Done Manifesto principle applies:

“Pretending you know what you’re doing is almost the same as knowing what you are doing, so just accept that you know what you’re doing even if you don’t and do it.”

The Cult of Done Manifesto – Bre Pettis

Expectations

So what are my expectations for the video and shark shorts?

This is all very much early days. There are no grand plans or ambitions, and I don’t have an upload schedule planned. I already have lots going on personally and professionally and don’t want to burn myself out. Much like this blog, it’s something for me to experiment and upskill with.

That said, I recently bought some streaming gear and a posh microphone in the Prime Day sales. So let’s see where this goes!

Summary

In this post, I debuted both the amazonwebshark YouTube channel and my first demonstration video and shark shorts.

As I said, there’s no grand vision for any of this and I’m totally winging it. It’s a bit of fun and I’m interested to see where it goes. In the meantime, the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Data & Analytics

SQL Workbench: On-Demand DuckDB-Wasm With No Bill

In this post, I try Tobias Müller‘s free SQL Workbench tool powered by DuckDB-Wasm. And maybe make one or two gratuitous duck puns.

Table of Contents

Introduction

Hopefully like many people in tech, I have a collection of articles, repos, projects and whatnot saved under the banner of “That sounds cool / looks interesting / feels useful – I should check that out at some point.” Sometimes things even come off that list…

DuckDB was on my list, and went to the top when I heard about Tobias Müller’s free SQL Workbench tool powered by DuckDB-Wasm. What I heard was very impressive and pushed me to finally examine DuckDB up close.

So what is DuckDB?

About DuckDB

This section examines DuckDB and DuckDB-Wasm – core components of SQL Workbench.

DuckDB

DuckDB is an open-source SQL Online Analytical Processing (OLAP) database management system. It is intended for analytical workloads, and common use cases include in-process analytics, exploration of large datasets and machine learning model prototyping. DuckDB was released in 2019 and reached v1.0.0 on June 03 2024.

DuckDB is an embedded database that runs within a host process. It is lightweight, portable and requires no separate server. It can be embedded into applications, scripts, and notebooks without bespoke infrastructure.

DuckDB uses a columnar storage format and supports standard SQL features like complex queries, joins, aggregations and window functions. It can be used with programming languages like Python and R.

In this Python example, DuckDB:

  • Creates an in-memory database
  • Defines a users table.
  • Inserts data into users.
  • Runs a query to fetch results.
Python
import duckdb

# Create a new DuckDB database in memory
con = duckdb.connect(database=':memory:')

# Create a table and insert some data
con.execute("""
CREATE TABLE users (
    user_id INTEGER,
    user_name VARCHAR,
    age INTEGER
);
""")

con.execute("INSERT INTO users VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35)")

# Run a query
results = con.execute("SELECT * FROM users WHERE age > 25").fetchall()
print(results)

DuckDB-Wasm

Launched in 2021, DuckDB WebAssembly (Wasm) is a version of the DuckDB database that has been compiled to run in WebAssembly. This lets DuckDB run in web browser processes and other environments with WebAssembly support.

So what’s WebAssembly? Don’t worry – I didn’t know either and so deferred to an expert:

Since all data processing happens in the browser, DuckDB-Wasm brings new benefits to DuckDB. In-browser dashboards and data analysis tools with DuckDB-Wasm enabled can operate without any server-side processing, improving their speed and security. If data is being sent somewhere then DuckDB-Wasm can handle ETL operations first, reducing both processing time and cost.

Running in-browser SQL queries also removes the need for setting up database servers and IDEs, reducing patching and maintenance (but don’t forget about the browser!) while increasing availability and convenience for remote work and educational purposes.

Like DuckDB, it is open-source and the code is on GitHub. A DuckDB Web Shell is available at shell.duckdb.org and there are also more technical details on DuckDB’s blog.

SQL Workbench

This section examines SQL Workbench and tries out some of its core features.

About SQL Workbench

Tobias Müller produced SQL Workbench in January 2024. He integrated DuckDB-Wasm into an AWS serverless static site and built an online interactive tool for querying data, showing results and generating visualizations.

SQL Workbench is available at sql-workbench.com. Note that it doesn’t currently support mobile browsers. Tobias has also written a great tutorial post that includes:

and loads more content in addition that I won’t reproduce here. So go and give Tobias some traffic!

Layout

The core SQL Workbench components are:

  • The Object Explorer section shows databases, schema and tables:
2024 06 26 SQLWorkbenchSchema
  • The Tools section shows links, settings and a drag-and-drop section for adding files (more on that later):
2024 06 26 SQLWorkbenchOptions
  • The Query Pane for writing SQL, which preloads with the below script upon each browser refresh:
2024 06 26 SQLWorkbenchQueryWindow
  • And finally, the Results Pane shows query results and visuals:
2024 06 26 SQLWorkbenchResultsWindow

Next, let’s try it out!

Sample Queries

SQL Workbench opens with a pre-loaded script that features:

  • Instructions:
SQL
-- WELCOME TO THE ONLINE SQL WORKBENCH!
-- To run a SQL query in your browser, select the query text and press:
-- CTRL + Enter (Windows/Linux) / CMD + Enter (Mac OS)

-- See https://duckdb.org/docs/sql/introduction for more info about DuckDB SQL syntax
  • Example queries using Parquet files:
SQL
-- Remote Parquet scans:
SELECT * FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet' LIMIT 1000;

SELECT avg(c_acctbal) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet';

SELECT count(*)::int as aws_service_cnt FROM 'https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet';

SELECT * FROM 'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet';

SELECT cloud_provider, sum(ip_address_cnt)::int as cnt FROM 'https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.parquet' GROUP BY cloud_provider;

SELECT * FROM 'https://raw.githubusercontent.com/tripl-ai/tpch/main/parquet/lineitem/part-0.parquet';
  • Example query using a CSV file:
SQL
-- Remote CSV scan
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv');

These queries are designed to show SQL Workbench’s capabilities and the style of SQL queries it can run (expanded on further here). They can also be used for experimentation, so here goes!

Firstly, I’ve added an ORDER BY to the first Parquet query to make sure the first 1000 orders are returned:

SQL
-- Show first 1000 orders for Parquet file
SELECT * 
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'
ORDER BY o_orderkey
LIMIT 1000;
2024 06 26 SQL WorkbenchQueryLimit1000

Secondly, I’ve used the COUNT and CAST functions to count the total orders and return the value as an integer:

SQL
-- Show number of orders for Parquet file
SELECT CAST(COUNT(o_orderkey) AS INT) AS orderkeycount 
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet';
2024 06 26 SQL WorkbenchQueryOrderCount

Finally, this query captures all 1996 orders in a CTE and uses it to calculate the 1996 order price totals grouped by order priority:

SQL
-- Show 1996 order totals in priority order
WITH cte_orders1996 AS (
    SELECT o_orderkey, o_custkey, o_totalprice, o_orderdate, o_orderpriority
    FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'
    WHERE YEAR(o_orderdate) = 1996
)

SELECT SUM(o_totalprice) AS totalpriorityprice, o_orderpriority 
FROM cte_orders1996 
GROUP BY o_orderpriority 
ORDER BY o_orderpriority;
2024 06 26 SQL WorkbenchQueryCTE

Now let’s try some of my data!

File Imports

SQL Workbench can also query data stored locally (as well as remote data which is out of this post’s scope). For this section I’ll be using some Parquet files generated by my WordPress bronze data orchestration process: posts.parquet and statistics_pages.parquet.

After dragging each file onto the assigned section they quickly appear as new tables:

2024 06 26 SQLWorkbenchSchemaWordPress

These tables show column names and inferred data types when expanded:

2024 06 26 SQLWorkbenchSchemaWordPressExpanded

Queries can be written manually or by right-clicking the tables in the Object Explorer. A statistics_pages SELECT * test query shows the expected results:

2024 06 26 SQLWorkbenchQueryStatsPages

As posts.parquet‘s ID primary key column matches statistics_pages.parquet‘s ID foreign key column, the two files can be joined on this column using an INNER JOIN:

SQL
SELECT 
  p.post_title,
  sp.type, 
  sp.date as viewdate,
  sp.count
FROM 
  'statistics_pages.parquet' AS sp
  INNER JOIN 'posts.parquet' AS p ON sp.id = p.id
 ORDER BY sp.date
2024 06 26 SQLWorkbenchQueryJoin

I can use this query to start making visuals!

Visuals

Next, let’s examine SQL Workbench’s visualisation capabilities. To help things along, I’ve written a new query showing the cumulative view count for my Using Athena To Query S3 Inventory Parquet Objects post (ID 92):

SQL
 SELECT 
  sp.date AS viewdate,
  SUM(sp.count) OVER (PARTITION BY sp.id ORDER BY sp.date) AS cumulative_views
FROM 
  'statistics_pages.parquet' AS sp
  INNER JOIN 'posts.parquet' AS p ON sp.id = p.id
WHERE p.id = 92
ORDER BY sp.date;
2024 06 26 SQLWorkbenchQueryCumulViews

(On reflection the join wasn’t necessary for this output, but anyway… – Ed)

Tobias has written visualisation instructions here so I’ll focus more on how my visual turns out. After the query finishes running, several visual types can be selected:

2024 06 26 SQLWorkbenchVisualizationsOptions

Selecting a visual quickly renders it in the browser. Customisation options are also available. That said, in my case the first chart was mostly fine!

2024 06 26 SQLWorkbenchVisualizationsLineChart

All I need to do is move viewdata (bottom right) to the Group By bin (top right) to add dates to the X-axis.

Exports & Cleanup

Finally, let’s examine SQL Workbench’s export options. And there’s plenty to choose from!

2024 06 26 SQLWorkbenchExports

The CSV and JSON options export the query results in their respective formats. The Arrow option exports query results in Apache Arrow format, which is designed for high-speed in-memory data processing and so compliments DuckDB very well. Dremio wrote a great Arrow technical guide for those curious.

The HTML option produces an interactive graph with tooltips and configuration options. Handy for sharing and I’m sure there’ll be a way to use the backend code for embedding if I take a closer look. The PNG has me covered in the meantime, immediately producing this image:

athenainventorycumulativeCropped
(Image cropped for visibility – Ed)

Finally there’s the Config.JSON option, which exports the visual’s settings for version control and reproducibility:

JSON
{"version":"2.10.0","plugin":"Y Line","plugin_config":{},"columns_config":{},"settings":true,"theme":"Pro Light","title":"Using Athena To Query S3 Inventory Parquet Objects Cumulative Views","group_by":["viewdate"],"split_by":[],"columns":["cumulative_views"],"filter":[],"sort":[],"expressions":{},"aggregates":{}}

All done! So how do I remove my data from SQL Workbench? I…refresh the site! Because SQL Workbench uses DuckDB, my data has only ever been stored locally and in memory. This means the data only persists until the SQL Workbench page is closed or reloaded.

Summary

In this post, I tried Tobias Müller’s free SQL Workbench tool powered by DuckDB-Wasm.

SQL Workbench and DuckDB are both very impressive! The results they produce with no servers or data movement are remarkable, and their tooling greatly simplifies working with modern data formats.

DuckDB makes lots of sense as an AWS Lambda layer, and DuckDB-Wasm simplifies in-browser analytics at a time when being data-driven and latency-averse are constantly gaining importance. I don’t feel like I’ve even scratched the surface of SQL Workbench either, so go and check it out!

Phew – no duck puns. No one has time for that quack. If this post has been useful then the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Data & Analytics

Discovering Data With AWS Glue

In this post, I use the data discovering features of AWS Glue to crawl and catalogue my WordPress API pipeline data.

Table of Contents

Introduction

By the end of my WordPress Bronze Data Orchestration post, I had created an AWS Step Function workflow that invokes two AWS Lambda functions:

stepfunctions graph

The data_wordpressapi_raw function gets data from the WordPress API and stores it as CSV objects in Amazon S3. The data_wordpressapi_bronze function transforms these objects to Parquet and stores them in a separate bucket. If either function fails, AWS SNS publishes an alert.

While this process works fine, the extracted data is not currently utilized. To derive value from this data, I need to consider transforming it. Several options are available, such as:

  • Creating new Lambda functions.
  • Importing the data into a database.
  • Third-party solutions like Databricks.

Here, I’ve chosen to use AWS Glue. As a fully managed ETL service, Glue automates various data processes in a low-code environment. I’ve not written much about it, so it’s time that changed!

Firstly, I’ll examine AWS Glue and some of its concepts. Next, I’ll create some Glue resources that interact with my WordPress S3 objects. Finally, I’ll integrate those resources into my existing Step Function workflow and examine their costs.

Let’s begin with some information about AWS Glue.

AWS Glue Concepts

This section explores AWS Glue and some of its data discovering features.

AWS Glue

From the AWS Glue User Guide:

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, machine learning, and application development. It also includes additional productivity and data ops tooling for authoring, running jobs, and implementing business workflows.

https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html

Glue can be accessed using the AWS Glue console web interface and the AWS Glue Studio graphical interface. It can also be accessed programmatically via the AWS Glue CLI and the AWS Glue SDK.

Benefits of AWS Glue include:

  • Data-specific features like data cataloguing, schema discovery, and automatic ETL code generation.
  • Infrastructure optimised for ETL tasks and data processes.
  • Built-in scheduling capabilities and job execution.
  • Integration with other AWS services like Athena and Redshift.

AWS Glue’s features fall into Discover, Prepare, Integrate and Transform categories. The Glue features used in this post come from the data discovering category.

Glue Data Catalog

An AWS Glue Data Catalog is a managed repository serving as a central hub for storing metadata about data assets. It includes table and job definitions, and other control information for managing an AWS Glue environment. Each AWS account has a dedicated AWS Glue Data Catalog for each region.

The Data Catalog stores information as metadata tables, with each table representing a specific data store and its schema. Glue tables can serve as sources or targets in job definitions. Tables are organized into databases, which are logically grouped collections of related table definitions.

Each table contains column names, data type definitions, partition information, and other metadata about a base dataset. Data Catalog tables can be populated either manually or using Glue Crawlers.

Glue Crawler

A Glue Crawler connects to a data store, analyzes and determines its schema, and then creates metadata tables in the AWS Glue Data Catalog. They can run on-demand, be automated by services like Amazon EventBridge Scheduler and AWS Step Functions, and be started by AWS Glue Triggers.

Crawlers can crawl several data stores including:

  • Amazon S3 buckets via native client.
  • Amazon RDS databases via JDBC.
  • Amazon DocumentDB via MongoDB client.

An activated Glue Crawler performs the following processes on the chosen data store:

  • Firstly, data within the store is classified to determine its format, schema and properties.
  • Secondly, data is grouped into tables or partitions.
  • Finally, the Glue Data Catalog is updated. Glue creates, updates and deletes tables and partitions, and then writes the metadata to the Data Catalog accordingly.

Now let’s create a Glue Crawler!

Creating A Glue Crawler

In this section, I use the AWS Glue console to create and run a Glue Crawler for discovering my WordPress data.

Crawler Properties & Sources

There are four steps to creating a Glue Crawler. Step One involves setting the crawler’s properties. Each crawler needs a name and can have optional descriptions and tags. This crawler’s name is wordpress_bronze.

Step Two sets the crawler’s data sources, which is greatly influenced by whether the data is already mapped in Glue. If it is then the desired Glue Data Catalog tables must be selected. Since my WordPress data isn’t mapped yet, I need to add the data sources instead.

My Step Function workflow puts the data in S3, so I select S3 as my data source and supply the path of my bronze S3 bucket’s wordpress_api folder. The crawler will process all folders and files contained in this S3 path.

Finally, I need to configure the crawler’s behaviour for subsequent runs. I keep the default setting, which re-crawls all folders with each run. Other options include crawling only folders added since the last crawl or using S3 Events to control which folders to crawl.

Classifiers are also set here but are out of scope for this post.

Crawler Security & Targets

Step Three configures security settings. While most of these are optional, the crawler needs an IAM role to interact with other AWS services. This role consists of two IAM policies:

  • An AWSGlueServiceRole AWS managed policy which allows access to related services including EC2, S3, and Cloudwatch Logs.
  • A customer-managed policy with s3:GetObject and s3:PutObject actions allowed on the S3 path given in Step Two.

This role can be chosen from existing roles or created with the crawler.

Step Four begins with setting the crawler’s output. The Crawler creates new tables, requiring the selection of a target database for these tables. This database can be pre-existing or created with the crawler.

An optional table name prefix can also be set, which enables easy table identification. I create a wordpress_api database in the Glue Data Catalog, and set a bronze- prefix for the new tables.

The Crawler’s schedule is also set here. The default is On Demand, which I keep as my Step Function workflow will start this crawler. Besides this, there are choices for Hourly, Daily, Weekly, Monthly or Custom cron expressions.

Advanced options including how the crawler should handle detected schema changes and deleted objects in the data store are also available in Step Four, although I’m not using those here.

And with that, my crawler is ready to try out!

Running The Crawler

My crawler can be tested by accessing it in the Glue console and selecting Run Crawler:

2024 05 23 RunCrawler

The crawler’s properties include run history. Each row corresponds to a crawler execution, recording data including:

  • Start time, end time and duration.
  • Execution status.
  • DPU hours for billing.
  • Changes to tables and partitions.
2024 05 23 GlueCrawlerRuns

AWS stores the logs in an aws-glue/crawlers CloudWatch Log Group, in which each crawler has a dedicated log stream. Logs include messages like the crawler’s configuration settings at execution:

Crawler configured with Configuration 
{
    "Version": 1,
    "CreatePartitionIndex": true
}
 and SchemaChangePolicy 
{
    "UpdateBehavior": "UPDATE_IN_DATABASE",
    "DeleteBehavior": "DEPRECATE_IN_DATABASE"
}

And details of what was changed and where:

Table bronze-statistics_pages in database wordpress_api has been updated with new schema

Checking The Data Catalog

So what impact has this had on the Data Catalog? Accessing it and selecting the wordpress_api database now shows five tables, each matching S3 objects created by the Step Functions workflow:

2024 05 23 GlueDataCatalogTables

Data can be viewed by selecting Table Data on the desired row. This action executes an Athena query, triggering a message about the cost implications:

You will be taken to Athena to preview data, and you will be charged separately for Athena queries.

If accepted, Athena generates and executes a SQL query in a new tab. In this example, the first ten rows have been selected from the wordpress_api database’s bronze-posts table:

SQL
SELECT * 
FROM "AwsDataCatalog"."wordpress_api"."bronze-posts" 
LIMIT 10;

When this query is executed, Athena checks the Glue Data Catalog for the bronze-posts table in the wordpress_api database. The Data Catalog provides the S3 location for the data, which Athena reads and displays successfully:

2024 05 23 AthenaBronzeQueryResults

Now that the crawler works, I’ll integrate it into my Step Function workflow.

Crawler Integration & Costs

In this section, I integrate my Glue Crawler into my existing Step Function workflow and examine its costs.

Architectural Diagrams

Let’s start with some diagrams. This is how the crawler will behave:

While updating the crawler’s wordpress_bronze CloudWatch Log Stream throughout:

  1. The wordpress_bronze Glue Crawler crawls the bronze S3 bucket’s wordpress-api folder.
  2. The crawler updates the Glue Data Catalog’s wordpress-api database.

This is how the Crawler will fit into my existing Step Functions workflow:

wordpress api stepfunction rawbronze

While updating the workflow’s CloudWatch Log Group throughout:

  1. An EventBridge Schedule executes the Step Functions workflow.
  2. Raw Lambda function is invoked.
    • Invocation Fails: Publish SNS message. Workflow ends.
    • Invocation Succeeds: Invoke Bronze Lambda function.
  3. Bronze Lambda function is invoked.
    • Invocation Fails: Publish SNS message. Workflow ends.
    • Invocation Succeeds: Run Glue Crawler.
  4. Glue Crawler runs.
    • Run Fails: Publish SNS message. Workflow ends.
    • Run Succeeds: Update Glue Data Catalog. Workflow ends.

An SNS message is published if the Step Functions workflow fails.

Step Function Integration

Time to build! Let’s begin with the crawler’s requirements:

  • The crawler must only run after both Lambda functions.
  • It must also only run if both functions invoke successfully first.
  • If the crawler fails it must alert via the existing PublishFailure SNS topic.

This requires adding an AWS Glue: StartCrawler action to the workflow after the second AWS Lambda: Invoke action:

2024 05 23 StepFunctionStartCrawler

This action differs from the ones I’ve used so far. The existing actions all use optimized integrations that provide special Step Functions workflow functionality.

Conversely, StartCrawler uses an SDK service integration. These integrations behave like a standard AWS SDK API call, enabling more fine-grained control and flexibility than optimised integrations at the cost of needing more configuration and management.

Here, the Step Functions StartCrawler action calls the Glue API StartCrawler action. After adding it to my workflow, I update the action’s API parameters with the desired crawler’s name:

JSON
{
  "Name": "wordpress_bronze"
}

Next, I update the action’s error handling to catch all errors and pass them to the PublishFailure task. These actions produce these additions to the workflow’s ASL code:

JSON
  "Start Bronze Crawler": {
      "Type": "Task",
      "End": true,
      "Parameters": {
        "Name": "wordpress_bronze"
      },
      "Resource": "arn:aws:states:::aws-sdk:glue:startCrawler",
      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "PublishFailure"
        }
      ]
    },

And result in an updated workflow graph:

stepfunctions graph gluecrawler

Additionally, the fully updated Step Functions workflow ASL script can be viewed on my GitHub.

Finally, I need to update the Step Function workflow IAM role’s policy so that it can start the crawler. This involves allowing the glue:StartCrawler action on the crawler’s ARN:

JSON
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowBronzeGlueCrawler",
            "Effect": "Allow",
            "Action": [
                "glue:StartCrawler"
            ],
            "Resource": [
                "arn:aws:glue:eu-west-1:[REDACTED]:crawler/wordpress_bronze"
            ]
        }

My Step Functions workflow is now orchestrating the Glue Crawler, which will only run once both Lambda functions are successfully invoked. If either function fails, the SNS topic is published and the crawler does not run. If the crawler fails, the SNS topic is published. Otherwise, if everything runs successfully, the crawler updates the Data Catalog as needed.

So how much does discovering data with AWS Glue cost?

Glue Costs

This is from AWS Glue’s pricing page for crawlers:

There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single DPU provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl.

$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run

https://aws.amazon.com/glue/pricing/

And for the Data Catalog:

With the AWS Glue Data Catalog, you can store up to a million objects for free. If you store more than a million objects, you will be charged $1.00 per 100,000 objects over a million, per month. An object in the Data Catalog is a table, table version, partition, partition indexes, statistics or database.

The first million access requests to the Data Catalog per month are free. If you exceed a million requests in a month, you will be charged $1.00 per million requests over the first million. Some of the common requests are CreateTable, CreatePartition, GetTable , GetPartitions, and GetColumnStatisticsForTable.

https://aws.amazon.com/glue/pricing/

So how does this relate to my workflow? The below Cost Explorer chart shows my AWS Glue API costs from 01 May to 28 May. Only the CrawlerRun API operation has generated charges, with a daily average of $0.02:

2024 05 28 GlueAPICostMay28

My May 2024 AWS bill shows further details on the requests and storage items. The Glue Data Catalog’s free tier covers my usage:

2024 05 28 GlueCostsMay28

Finally, let’s review the entire pipeline’s costs for April and May. Besides Glue, my only other cost remains S3:

2024 05 28 CostExplorerAprMay

Summary

In this post, I used the data discovering features of AWS Glue to crawl and catalogue my WordPress API pipeline data.

Glue’s native features and integration with other AWS services make it a great fit for my WordPress pipeline’s pending processes. I’ll be using additional Glue features in future posts, and wanted to spotlight the Data Catalog early on as it’ll become increasingly helpful as my use of AWS Glue increases.

If this post has been useful then the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~