Categories
Me

Video Thrilled The Dataflow Shark

In this post, I debut both the amazonwebshark YouTube channel and my first demonstration video and shark shorts.

Table of Contents

Introduction

So, this is amazonwebshark’s fiftieth post! It’s also my first post as a consultant! I joined the Steamhaus team this month and am looking forward to what the future brings!

In my last YearCompass review, I committed to building my personal brand. So far 2024 has seen this shark presenting at AWS Summit London and AWS User Group Liverpool, and making video content seems like the logical next step.

Before this, the amazonwebshark YouTube channel held playlists for videos I’ve referenced in previous posts. And as of July 2024 I’ve joined the video content creator ranks with some humble contributions of my very own!

So what’s this post about? Firstly, I’ll examine my motivation for making video content. Then I’ll link my current uploads, and finally I’ll set initial expectations for the channel.

Why?

So why am I doing this? This section explains my motivation for recording videos and what I hope to achieve.

Practice & Improve Speaking

Generally, doing something more often reveals improvements, efficiencies and optimisations. And I want to improve my speaking. So I need to do more of it!

I got some great advice from Laurie Kirk a while ago on this topic. She has a habit of filming herself daily and reviewing the footage for improvements. By her own admission, this has improved her confidence and quality.

Besides this, speaking practice draws parallels with training runs. Becoming an optimal runner involves various types of training. Want to build endurance? Run slower over distance. Want to improve speed? Focus on faster, shorter bursts.

And it’s the same with speaking. Want to practise lightning talks? Make short videos. Want to improve sessions? Film a demo. My work so far made me more confident at AWS UG Liverpool earlier this month, so I hope to see more improvements in the coming months by practising different types of speaking.

Audience Diversification

In addition to improving my current abilities, I want to upskill my ability to communicate with people outside my field of expertise.

It’s well known that technical people love speaking with technical people. Getting into the weeds about operating systems, functions, architectures and paradigms regularly see hours fly by at meetups.

This is, however, inherently limiting to those less knowledgeable in those areas. I was originally going to label this as ‘non-tech’, but on reflection this extends in all directions. For example, I won’t understand an architectural discussion that turns to von Neumann architectures. I then have equal potential to confuse if I start talking about Lakehouse architectures. This can even happen with people sharing a speciality: I had no idea what instantiate meant when I first heard it from another Data Engineer.

Amongst the best sessions and videos I’ve seen are ones where topics are made accessible and inclusive to a diverse audience with ranging skills and experience. Some viewers will have years of experience in the field and are looking for the latest insights. Others will be hearing about the topic for the very first time. Appealing to both ends of the spectrum is the ideal scenario.

This is the skill level I’m aiming towards. Creating sessions and videos that appeal to a diverse range of viewers will make me a more inclusive and effective communicator. And it’s not just about audience diversification…

Content Diversification

Next, producing videos will let me make different kinds of content.

Applying the Diátaxis framework, my blog posts lean more towards Tutorial than the others. This is intentional, as I’ve always preferred practice over theory and like sharing cool stuff that enables people.

That’s not to say I don’t get curious about the other Diátaxis ‘needs’ of How-To Guide, Explanation and Reference. While past exploration of these with text hasn’t worked out, videos offer new opportunities here such as:

  • Trying out new AWS services and features.
  • Running through concepts and architectures.
  • Exploring unfamiliar and less common settings and parameters.

In short, content ideas that lend themselves better to video than text. And speaking of ideas…

Failing Fast

Videos may save a post idea that has promise but isn’t working out.

I am never stuck for blog post ideas. There’s always something to write about – from services and architectures to current events. I probably have more potential post topics than I can ever write.

This isn’t to say that every post I start is completed though. Some ideas begin well but start to unravel. They might meander, lack cohesion or simply become uninteresting. And while I’m getting better at seeing the early signs of this, occasionally some still slip through.

I’m a big believer in avoiding the sunken cost fallacy. So in those situations, I admit defeat and defer to the Cult of Done Manifesto’s fifth principle:

“If you wait more than a week to get an idea done, abandon it.”

The Cult of Done Manifesto – Bre Pettis

There are several interpretations of this principle. NoBoilerplate‘s advice is that:

“Ideas in your brain are like a pipe full of random stuff. Some of it will be good; some not so good. If you’re not feeling it, don’t try to make a bad idea better – try the next idea.”

The Cult of Done: How To Get *Started* – No Boilerplate

I agree. But it’s still disheartening sometimes to delete something that still feels like it has legs – just not for the body you’re trying to stitch them to. More recently, I found Jason Fladlien‘s interpretation which has a different take:

“The longer you go not getting something done the more baggage you create around getting it done. “Abandoning” an idea simply means throwing this version of it in the trash. You can start it fresh later.”

The Cult of Done – The Drive-Contentment Connection

This applies very well to situations where an idea loses traction as a blog post but is still worth pursuing. Instead of deleting everything, post material could be repurposed into video material.

Additionally, I’m likely to have session abstracts that either don’t work out or aren’t accepted. If I consider the idea to be sound, videos are ideal solutions to these situations too.

Chasing Internet Stardom

Yeah ok not really.

Current Uploads

So what have I produced so far on the video and shark short front?

Firstly, I’ve filmed a pair of data-themed YouTube shorts. The first examines one of the functions of an AWS Glue Crawler:

The other considers one of the differences between Parquet files and CSV files:

Future shorts will be uploaded to YouTube, Instagram and TikTok. I’ll see how this goes over the coming months.

I’ve also uploaded an extended demo for my Building And Automating Serverless Auto-Scaling Data Pipelines In AWS session:

The demo I use in this session begins with some existing AWS resources. This keeps me within the session’s time limit, but at the cost of an incomplete picture of what the Step Function workflow is doing. This extended version starts with a blank workflow and shows the Glue and Athena setup behind the scenes.

I’m not holding these up as works of art! They are rough around the edges, and I’m sure I’ll improve over time. In the meantime, a different Cult of Done Manifesto principle applies:

“Pretending you know what you’re doing is almost the same as knowing what you are doing, so just accept that you know what you’re doing even if you don’t and do it.”

The Cult of Done Manifesto – Bre Pettis

Expectations

So what are my expectations for the video and shark shorts?

This is all very much early days. There are no grand plans or ambitions, and I don’t have an upload schedule planned. I already have lots going on personally and professionally and don’t want to burn myself out. Much like this blog, it’s something for me to experiment and upskill with.

That said, I recently bought some streaming gear and a posh microphone in the Prime Day sales. So let’s see where this goes!

Summary

In this post, I debuted both the amazonwebshark YouTube channel and my first demonstration video and shark shorts.

As I said, there’s no grand vision for any of this and I’m totally winging it. It’s a bit of fun and I’m interested to see where it goes. In the meantime, the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Training & Community

Shark’s Summit Session

In this post, I discuss my recent Building And Automating Data Pipelines session presented at 2024’s AWS Summit London.

Table of Contents

Introduction

One of my YearCompass 2024 goals was to build a personal brand and focus on my soft skills and visibility. After participating in 2023’s New Stars Of Data 6 event, I wrote some new session abstracts and considered my next move. Then in February I saw Matheus Guimaraes‘ LinkedIn invite to submit sessions for 2024’s AWS Summit London event:

2024 04 09 LinkedInMatheus

I mulled it over, deciding to submit an abstract using my recent WordPress Data Pipeline project. At the very least, it’d be practice for both writing abstracts and pushing myself to submit them.

And that’s where I expected it to end. Until…

2024 04 09 GmailAccept

Just like that, I was heading to the capital again! And this time as a speaker!

My AWS Summit London (ASL) experience was going to be different from my New Stars Of Data (NSOD) one in several ways:

  • While NSOD was virtual, ASL would be in person.
  • I had four months to prepare for NSOD, and five weeks for ASL.
  • My NSOD session was 60 minutes, while ASL would be 30.

So I dusted off my NSOD notes, put a plan together and got to work!

Preparation

This section examines the preparation of the slides and demo for my summit session.

Slides

Firstly, I brainstormed what the session should include from my recent posts. Next, I storyboarded what the session’s sections would be. These boiled down to:

  • Defining the problem. I wanted to use an existing framework for this, ultimately choosing the 4Vs Of Big Data. These have been around since the early 2000’s, and are equally valid today for EDA and IOT events, API requests, logging metrics and many other modern technologies.
  • Examining the AWS services comprising the data pipeline, and highlighting features of each service that relate to the 4Vs.
  • Demonstrating the AWS services in a real pipeline and showing further use cases.

This yielded a rough schedule for the session:

  • 00:00-05:00 Introduction
  • 05:00-10:00 Problem Definition
  • 10:00-15:00 Solution Architecture
  • 15:00-20:00 Demo
  • 20:00-25:00 Summary
  • 25:00-30:00 Questions

Creating and editing the slide deck was much simpler with this in place. Each slide now needed to conform with and add value to its section. It became easier to remove bloat and streamline the wordier slides.

Several slides then received visual elements. This made them more audience-friendly and gave me landmarks to orient myself within the deck. I used AWS architecture icons on the solution slides and sharks on the problem slides. Lots of sharks.

Here’s the finished deck. I regret nothing.

As I was rounding off the slides, the summit agenda was published with the Community Lounge sessions. It was real now!

2024 04 24 AWSSummitSession

Demo

I love demos and was keen to include one for my summit session.

Originally I wanted a live demo, but this needed a good internet connection. It was pointed out to me that an event with thousands of people might not have the best WiFi reception, leading to slow page loads at best and 404s at worst!

So I recorded a screen demo instead. From a technical standpoint, this protected the demo from platform outages, network failures and zero-day bugs. And from a delivery standpoint, a pre-recorded demo let me focus on communicating my message to the audience instead of potentially losing my place, mistyping words and overrunning the allocated demo time.

The demo used this workflow, executed by an EventBridge Schedule:

stepfunctions graph

The demo’s first versions involved building the workflow and schedule from scratch. This overran the time allocation and felt unfocused. Later versions began with a partly constructed workflow. This built on the slides and improved the demo’s flow. I was far happier with this version, which was ultimately the one I recorded.

I recorded the demo with OBS Studio – a free open-source video recording and live-streaming app. There’s a lot to OBS, and I found this GuideRealmVideos video helpful in setting up my recording environment:

Delivery

This section covers the rehearsal and delivery of my AWS summit session.

Rehearsal

With everything in place, it was time to practise!

I had less time to practise this compared to NSOD, so I used various strategies to maximise my time. I started by practising sections separately while refining my notes. This gave all sections equal attention and highlighted areas needing work.

Next, after my success with it last time, I did several full run-throughs using PowerPoint’s Speaker Coach. This went well and gave me confidence in the content and slide count.

2024 04 27 RehersalReportClip

The slide visuals worked so well that I could practise the opening ten minutes without the slides in front of me! This led to run-throughs while shopping, on public transport and even in the queue for the AWS Summit passes.

Probably got some weird looks for that. I still regret nothing.

Practising the demo was more challenging! While the slides were fine as long as I hit certain checkpoints, the demo’s pace was entirely pre-determined. I knew what the demo would do, but keeping in sync with my past self was tricky to master. I was fine as long as I could see my notes and the demo in real-time.

Finally, on the night before I did some last-minute practice runs with the hotel room’s TV:

PXL 20240423 1801344612

On The Day

My day started with being unable to find the ExCel’s entrance! Great. But still better than a 05:15 Manchester train! I got my event pass and hunted for the Community Lounge, only to find my name in lights!

PXL 20240424 1202226372

The lounge itself was well situated, away from potentially distracting stands and walkways but still feeling like an important part of the summit.

PXL 20240424 0850558352

I spent some time adjusting to the space and battling my brewing anxiety. Then Matheus and Rebekah Kulidzan appeared and gave me some great and much-appreciated advice and encouragement! Next, I went for a wander with my randomised feel-good playlist that threw out some welcome bangers:

I watched Yan’s session, paying attention to his delivery and mannerisms alongside his session’s content. After he finished I powered on, signed in and miked up. The lounge setup was professional but not intimidating, and the AWS staff were helpful and attentive. Finally, at noon I went live!

IMG 3907
Photo by Thembile Ndlovu

The session went great! I had a good audience, kept my momentum and hit my section timings. I had a demo issue when my attempt to duplicate displays failed. Disaster was averted by playing the demo on the main screens only!

Finally, my half-hour ended and I stepped off the stage to applause, questions and an unexpected hug!

Looking Back

So what’s next?

I was happy with the amount of practice I did, and will continue putting time into this in the coming months. I’ve submitted my summit session to other events, and the more rehearsals I complete the higher my overall standard should get.

I also want to find a more reliable way of showing demos without altering Windows display settings. Changing these settings mid-presentation isn’t the robust solution I thought it was, so I want to find a feature or setting that’ll take care of that.

Finally, I plan to act on advice from Laurie Kirk. She suggested speaking about a day’s events on camera and then watching it back the following day. This highlights development areas and will get me used to speaking under observation.

Summary

In this post, I discussed my recent Building And Automating Data Pipelines session presented at 2024’s AWS Summit London.

When writing my post about the 2022 AWS Summit London event, I could never have known I’d find myself on the lineup a few years later! Tech communities do great jobs of driving people forward, and while this is usually seen through a technical lens the same is true for personal skills.

The AWS Community took this apprehensive, socially anxious shark and gave him time, a platform and an audience. These were fantastic gifts that I’m hugely grateful for and will always remember.

PXL 20240424 1606465872

If this post has been useful then the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Training & Community

New Stars Of Data Retrospective

I’m a speaker now! In this post, I write a retrospective review of my New Stars Of Data 6 session and overall experience.

Table of Contents

Introduction

In July, I shared the news that I was speaking at the October 2023 New Stars Of Data event:

2023 08 11 NewStarsOfDataSchedule

In August and October, I wrote about my preparations and experiences leading up to the event. Since then, the big day has come and gone!

With New Stars Of Data 6 now in the history books, I wanted to write a final retrospective post for my series. Firstly, I’ll examine both the Sale Sizzler data and VS Code Data Wrangler as a companion post for the session. Then I’ll sum up how the final week of preparation went and draw the series to a close.

Separately, my mentor Olivier Van Steenlandt has written about his mentoring experience on his blog. Give it a read!

Sale Sizzlers

This section of my New Stars Of Data retrospective explains what the Sale Sizzlers are and examines the data generated by a typical event.

Sale Sizzler Events

The Sale Sizzlers are a 5k road race series organised by my running club – Sale Harriers Manchester. Every Summer, four events take place at Wythenshawe Park in two-week intervals. The course is regarded as one of the fastest in North West England and attracts a wide range of participants from first-time racers to former Olympians.

SaleSizzler5KRoute

They began in the same year as the 2002 Commonwealth Games (also held in Manchester). Since then, thousands of runners have participated in the name of enjoyment, charity and personal bests.

Sale Sizzler Data

Sale Sizzler administration has changed over the years in response to both popularity and technology. Initially, everything was paper-based from entries to results. Then, as the Internet became more established, some processes moved online.

Today, everything from runner entry to results distribution can be completely outsourced to third parties. Since 2016, Nifty Entries have handled Sale Sizzlers administration and published the results as CSVs on their platform. Nifty’s Sale Sizzler data privacy policy is available here.

I used the 2023 Sale Sizzler 1’s CSV for my demo. As of 2023, these CSVs contain the following columns:

PositionINTRunner’s overall finishing position.
Finish TimeTIMETime from race start to runner crossing finish line.
NumberINTRunner’s race number.
First NameSTRINGRunner’s first name
Last NameSTRINGRunner’s last name.
Chip TimeTIMETime from runner crossing start line to runner crossing finish line.
ClubSTRINGRunner’s running club (if applicable)
Club PositionINTRunner’s finishing position relative to their running club.
GenderSTRINGRunner’s gender.
Gender PositionINTRunner’s finishing position relative to their gender.
CategorySTRINGRunner’s age group.
Category PositionINTRunner’s finishing position relative to their age group.

That’s all the required knowledge for the Sizzlers. Now let’s examine the catalyst for my session – Data Wrangler.

VS Code Data Wrangler

This section of my New Stars Of Data retrospective examines VS Code Data Wrangler’s features and the operations I used in my session demo.

Data Wrangler Features

The Data Wrangler Extension for Visual Studio Code was launched in March 2023. It is designed to help with data preparation, cleaning, and presentation, and has a library of built-in transformations and visualizations.

It offers features for quickly identifying and fixing errors, inconsistencies, and missing data. Data profiling, quality checks and formatting operations are also available.

Data Wrangler uses a no-code interface, and generates Python code behind the scenes using the pandas and regex open-source libraries. Transformations can be exported as Jupyter Notebooks, Python scripts and CSVs.

Data Wrangler Documentation

The Data Wrangler GitHub repo has excellent documentation. I’m not going to reproduce it here, because:

  • The repo deserves the traffic.
  • Data Wrangler is constantly being updated so the instructions can easily change.
  • The Readme is very well written and needs no improvement.

I will, however, highlight the following key areas:

The rest of this section examines the Data Wrangler operations I used in my demo.

Missing Value Operations

The first two operations in my demo removed the dataset’s 1123 missing values by:

  • Dropping missing Position values
  • Filling missing Club values

Most of the missing values belonged to runners who either didn’t start the race or didn’t finish it. These people had no finish time, which is a vital metric in my session and necessitated their removal from the dataset.

Removing these runners left 45 missing values in the Club column. These were runners unaffiliated to a running club. The fix this time was to replace empty values with Unaffiliated, leaving no missing values at all.

The Data Wrangler GUI uses Git-like representation for the Fill Missing Values operation, where the red column is before the change and green is after:

2023 06 03 ClubMissingValueAfter

Wrangler generated this Python code to update the Club column:

# Replace missing values with "Unaffiliated" in column: 'Club'
df = df.fillna({'Club': "Unaffiliated"})

Column Creation Operations

Next, I wanted to create some columns using the New Column By Example operation. Firstly, Data Wrangler requests target columns and a creation pattern. Microsoft Flash Fill then automatically creates a column when a pattern is detected from the columns chosen.

I created two new columns by:

  • Combining First Name and Last Name to make Full Name.
  • Combining Gender and Category to make Gender Category.

Both these columns simplify reporting. The Full Name column is easier to read than the separate First and Last Name columns, and brings the Nifty data in line with other data producers like Run Britain. Additionally, using the Full Name column in Power BI tables takes less space than using both of its parent columns.

Having a Gender Category column is not only for quality of life, but also for clarity. Most of the Category values like U20 and V50 don’t reveal the runner’s gender. Conversely, Gender Category values like Female U20 and Male V50 are obvious, unambiguous and better than Category values alone.

This GIF from the demo shows how the Gender Category column is created:

NewColumnExample1

During this, Data Wrangler generated this Python code:

# Derive column 'Gender Category' from columns: 'Gender', 'Category'
# Transform based on the following examples:
#    Category    Gender    Output
# 1: "Under 20"  "Male" => "Male Under 20"
df.insert(12, "Gender Category", df["Gender"] + " " + df["Category"])

This works, but produces a slight issue with the Senior Female and Senior Male values. In the case of Senior Male, Flash Fill outputs the new value of Male Senior Male (20-39).

This is correct, but the Male duplication is undesirable. This is resolved by identifying an instance of this value and removing the second Male string:

NewColumnExample2

This updates the Python code to:

# Derive column 'Gender Category' from columns: 'Gender', 'Category'
# Transform based on the following examples:
#    Category               Gender    Output
# 1: "Under 20"             "Male" => "Male Under 20"
# 2: "Senior Male (20-39)"  "Male" => "Male Senior (20-39)"
df.insert(12, "Gender Category", df.apply(lambda row : row["Gender"] + " " + row["Category"].split(" ")[0] + row["Category"][row["Category"].rfind(" "):], axis=1))

And the replacement values for both genders become Female Senior (20-34) and Male Senior (20-39).

Bespoke Operations

Finally, I wanted to demonstrate how to use bespoke Python code within Data Wrangler. My first operation was to add a column identifying the event:

df['Event'] = 'Sale Sizzler 1' 

This creates an Event column containing Sale Sizzler 1 in each row.

My second was a little more involved. The Sale Sizzler finish times are represented as HH:MM:SS. Power BI can show these values as strings but can’t use them for calculations. A better option was to transform them to total seconds, because as integers they are far more versatile.

This transformation can be done in DAX, but every dataset refresh would recalculate the values. This is unnecessarily computationally expensive. As the finish times will never change, it makes sense to apply Roche’s Maxim of Data Transformation and transform them upstream of Power BI using Data Wrangler.

This avoids Power BI having to do unnecessary repeat work, and indeed removes the need for Power BI to calculate the values at all! This also allows both the data model and the visuals using the transformed data to load faster.

Here is my custom Python code:

df['Chip Time Seconds'] = df['Chip time'].apply(lambda x: int(x.split(':')[0])*3600+ int(x.split(':')[1])*60 +int(x.split(':')[2])) 

This uses the split method and a lambda function to apply the following steps to each Chip Time value to calculate an equivalent Chip Time Seconds value:

  • Hours to seconds: capture the first number and multiply it by 3600.
  • Minutes to seconds: capture the second number and multiply it by 60.
  • Seconds: capture the third number.
  • Add all values together

So with the example of a Chip Time value of 00:15:11:

  • 00 * 3600 = 0 seconds
  • 15 * 60 = 900 seconds
  • 11 seconds
  • 0 + 900 + 11 = 911 Chip Time Seconds

These integers can then be used to calculate averages, high performers and key influencers. The full demo is in the session recording that is included further down this post.

Session

This section of my New Stars Of Data retrospective is about my final preparations and the day itself.

Final Week

Before my final meeting with Olivier, he asked me to think about my plans both for the week of the event and the day itself. This was surprisingly hard! I’d spent so much time on the build-up that I hadn’t even considered this.

The final meetup was divided into taking stock of the journey to get to event week, and some final discussion over event expectations and etiquette. New Stars Of Data uses Microsoft Teams for delivery, which I have lots of experience with through work. Olivier made sure I knew when to turn up and what to do.

Following some thought and input from Olivier, I did my final rehearsals at the start of the week and did a final run-through on Wednesday. After that, I took Olivier’s advice and gave myself time to mentally prepare for the big day.

The Big Day!

I spent Friday morning doing house and garden jobs. Basically staying as far away from the laptop as possible to keep my anxiety low. At noon I sprung into action, setting up my streaming devices, checking my demos worked and confirming I could access the Teams channel. Then I walked Wolfie to tire him out before my session. It turned out that Wolfie had other ideas!

New Stars Of Data has fifteen-minute windows between sessions for speaker transitions, and during this I chatted with the moderators who helped me stay calm. Wolfie stayed quiet during the whole time, then started barking two minutes in. Thankfully, I’d practised handling distractions!

The session felt like it flew by, and the demos went mostly as planned. One of the New Column By Example transformations in the Data Wrangler demo didn’t work as expected, erroring instead of giving the desired values.

This had happened during rehearsals, so I was prepared for the possibility of it failing again. To this end, I pre-recorded a successful transformation and stored the Python code generated by the operation. I wasn’t able to show the recording due to time constraints, but used the Python code to show what the expected output should have been.

My session was recorded and is on the DataGrillen YouTube channel:

I uploaded my session files to my Community-Sessions GitHub repo. If that naming schema sounds ambitious, well…

Future Plans

So, having presented my first session, what next?

Well, I had always planned to take my foot off the gas a little after completing New Stars Of Data to appreciate the experience (and write this retrospective!). I’ve been working on it since June, and I wanted to have some time for consideration and reflection.

With respect to Racing Towards Insights, I have a couple of optimisations I’m considering. These include using a virtual machine for the Power BI demos to take the pressure off my laptop, examining options for a thirty-minute version of the session for other events and looking at applications for the Python code export function.

I’m also keen to find out how to avoid the New Column By Example error I experienced. To this end, I’ve raised an issue on the Data Wrangler GitHub repo and will see if I can narrow down the problem.

Additionally, I’ve had several positive conversations with people about submitting sessions for local user groups and community events, and have several ideas for blog topics and personal projects that could lend themselves to session abstracts. With the knowledge gained from Olivier’s mentorship, I can now start to think about what these abstracts might look like.

Summary

In this post, I wrote a retrospective review of my New Stars Of Data 6 session and overall experience. In closing, I’d like to thank the following community members for being part of my New Stars Of Data journey:

If this post has been useful, the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~