Categories
Training & Community

New Stars Of Data Retrospective

I’m a speaker now! In this post, I write a retrospective review of my New Stars Of Data 6 session and overall experience.

Table of Contents

Introduction

In July, I shared the news that I was speaking at the October 2023 New Stars Of Data event:

2023 08 11 NewStarsOfDataSchedule

In August and October, I wrote about my preparations and experiences leading up to the event. Since then, the big day has come and gone!

With New Stars Of Data 6 now in the history books, I wanted to write a final retrospective post for my series. Firstly, I’ll examine both the Sale Sizzler data and VS Code Data Wrangler as a companion post for the session. Then I’ll sum up how the final week of preparation went and draw the series to a close.

Separately, my mentor Olivier Van Steenlandt has written about his mentoring experience on his blog. Give it a read!

Sale Sizzlers

This section of my New Stars Of Data retrospective explains what the Sale Sizzlers are and examines the data generated by a typical event.

Sale Sizzler Events

The Sale Sizzlers are a 5k road race series organised by my running club – Sale Harriers Manchester. Every Summer, four events take place at Wythenshawe Park in two-week intervals. The course is regarded as one of the fastest in North West England and attracts a wide range of participants from first-time racers to former Olympians.

SaleSizzler5KRoute

They began in the same year as the 2002 Commonwealth Games (also held in Manchester). Since then, thousands of runners have participated in the name of enjoyment, charity and personal bests.

Sale Sizzler Data

Sale Sizzler administration has changed over the years in response to both popularity and technology. Initially, everything was paper-based from entries to results. Then, as the Internet became more established, some processes moved online.

Today, everything from runner entry to results distribution can be completely outsourced to third parties. Since 2016, Nifty Entries have handled Sale Sizzlers administration and published the results as CSVs on their platform. Nifty’s Sale Sizzler data privacy policy is available here.

I used the 2023 Sale Sizzler 1’s CSV for my demo. As of 2023, these CSVs contain the following columns:

PositionINTRunner’s overall finishing position.
Finish TimeTIMETime from race start to runner crossing finish line.
NumberINTRunner’s race number.
First NameSTRINGRunner’s first name
Last NameSTRINGRunner’s last name.
Chip TimeTIMETime from runner crossing start line to runner crossing finish line.
ClubSTRINGRunner’s running club (if applicable)
Club PositionINTRunner’s finishing position relative to their running club.
GenderSTRINGRunner’s gender.
Gender PositionINTRunner’s finishing position relative to their gender.
CategorySTRINGRunner’s age group.
Category PositionINTRunner’s finishing position relative to their age group.

That’s all the required knowledge for the Sizzlers. Now let’s examine the catalyst for my session – Data Wrangler.

VS Code Data Wrangler

This section of my New Stars Of Data retrospective examines VS Code Data Wrangler’s features and the operations I used in my session demo.

Data Wrangler Features

The Data Wrangler Extension for Visual Studio Code was launched in March 2023. It is designed to help with data preparation, cleaning, and presentation, and has a library of built-in transformations and visualizations.

It offers features for quickly identifying and fixing errors, inconsistencies, and missing data. Data profiling, quality checks and formatting operations are also available.

Data Wrangler uses a no-code interface, and generates Python code behind the scenes using the pandas and regex open-source libraries. Transformations can be exported as Jupyter Notebooks, Python scripts and CSVs.

Data Wrangler Documentation

The Data Wrangler GitHub repo has excellent documentation. I’m not going to reproduce it here, because:

  • The repo deserves the traffic.
  • Data Wrangler is constantly being updated so the instructions can easily change.
  • The Readme is very well written and needs no improvement.

I will, however, highlight the following key areas:

The rest of this section examines the Data Wrangler operations I used in my demo.

Missing Value Operations

The first two operations in my demo removed the dataset’s 1123 missing values by:

  • Dropping missing Position values
  • Filling missing Club values

Most of the missing values belonged to runners who either didn’t start the race or didn’t finish it. These people had no finish time, which is a vital metric in my session and necessitated their removal from the dataset.

Removing these runners left 45 missing values in the Club column. These were runners unaffiliated to a running club. The fix this time was to replace empty values with Unaffiliated, leaving no missing values at all.

The Data Wrangler GUI uses Git-like representation for the Fill Missing Values operation, where the red column is before the change and green is after:

2023 06 03 ClubMissingValueAfter

Wrangler generated this Python code to update the Club column:

# Replace missing values with "Unaffiliated" in column: 'Club'
df = df.fillna({'Club': "Unaffiliated"})

Column Creation Operations

Next, I wanted to create some columns using the New Column By Example operation. Firstly, Data Wrangler requests target columns and a creation pattern. Microsoft Flash Fill then automatically creates a column when a pattern is detected from the columns chosen.

I created two new columns by:

  • Combining First Name and Last Name to make Full Name.
  • Combining Gender and Category to make Gender Category.

Both these columns simplify reporting. The Full Name column is easier to read than the separate First and Last Name columns, and brings the Nifty data in line with other data producers like Run Britain. Additionally, using the Full Name column in Power BI tables takes less space than using both of its parent columns.

Having a Gender Category column is not only for quality of life, but also for clarity. Most of the Category values like U20 and V50 don’t reveal the runner’s gender. Conversely, Gender Category values like Female U20 and Male V50 are obvious, unambiguous and better than Category values alone.

This GIF from the demo shows how the Gender Category column is created:

NewColumnExample1

During this, Data Wrangler generated this Python code:

# Derive column 'Gender Category' from columns: 'Gender', 'Category'
# Transform based on the following examples:
#    Category    Gender    Output
# 1: "Under 20"  "Male" => "Male Under 20"
df.insert(12, "Gender Category", df["Gender"] + " " + df["Category"])

This works, but produces a slight issue with the Senior Female and Senior Male values. In the case of Senior Male, Flash Fill outputs the new value of Male Senior Male (20-39).

This is correct, but the Male duplication is undesirable. This is resolved by identifying an instance of this value and removing the second Male string:

NewColumnExample2

This updates the Python code to:

# Derive column 'Gender Category' from columns: 'Gender', 'Category'
# Transform based on the following examples:
#    Category               Gender    Output
# 1: "Under 20"             "Male" => "Male Under 20"
# 2: "Senior Male (20-39)"  "Male" => "Male Senior (20-39)"
df.insert(12, "Gender Category", df.apply(lambda row : row["Gender"] + " " + row["Category"].split(" ")[0] + row["Category"][row["Category"].rfind(" "):], axis=1))

And the replacement values for both genders become Female Senior (20-34) and Male Senior (20-39).

Bespoke Operations

Finally, I wanted to demonstrate how to use bespoke Python code within Data Wrangler. My first operation was to add a column identifying the event:

df['Event'] = 'Sale Sizzler 1' 

This creates an Event column containing Sale Sizzler 1 in each row.

My second was a little more involved. The Sale Sizzler finish times are represented as HH:MM:SS. Power BI can show these values as strings but can’t use them for calculations. A better option was to transform them to total seconds, because as integers they are far more versatile.

This transformation can be done in DAX, but every dataset refresh would recalculate the values. This is unnecessarily computationally expensive. As the finish times will never change, it makes sense to apply Roche’s Maxim of Data Transformation and transform them upstream of Power BI using Data Wrangler.

This avoids Power BI having to do unnecessary repeat work, and indeed removes the need for Power BI to calculate the values at all! This also allows both the data model and the visuals using the transformed data to load faster.

Here is my custom Python code:

df['Chip Time Seconds'] = df['Chip time'].apply(lambda x: int(x.split(':')[0])*3600+ int(x.split(':')[1])*60 +int(x.split(':')[2])) 

This uses the split method and a lambda function to apply the following steps to each Chip Time value to calculate an equivalent Chip Time Seconds value:

  • Hours to seconds: capture the first number and multiply it by 3600.
  • Minutes to seconds: capture the second number and multiply it by 60.
  • Seconds: capture the third number.
  • Add all values together

So with the example of a Chip Time value of 00:15:11:

  • 00 * 3600 = 0 seconds
  • 15 * 60 = 900 seconds
  • 11 seconds
  • 0 + 900 + 11 = 911 Chip Time Seconds

These integers can then be used to calculate averages, high performers and key influencers. The full demo is in the session recording that is included further down this post.

Session

This section of my New Stars Of Data retrospective is about my final preparations and the day itself.

Final Week

Before my final meeting with Olivier, he asked me to think about my plans both for the week of the event and the day itself. This was surprisingly hard! I’d spent so much time on the build-up that I hadn’t even considered this.

The final meetup was divided into taking stock of the journey to get to event week, and some final discussion over event expectations and etiquette. New Stars Of Data uses Microsoft Teams for delivery, which I have lots of experience with through work. Olivier made sure I knew when to turn up and what to do.

Following some thought and input from Olivier, I did my final rehearsals at the start of the week and did a final run-through on Wednesday. After that, I took Olivier’s advice and gave myself time to mentally prepare for the big day.

The Big Day!

I spent Friday morning doing house and garden jobs. Basically staying as far away from the laptop as possible to keep my anxiety low. At noon I sprung into action, setting up my streaming devices, checking my demos worked and confirming I could access the Teams channel. Then I walked Wolfie to tire him out before my session. It turned out that Wolfie had other ideas!

New Stars Of Data has fifteen-minute windows between sessions for speaker transitions, and during this I chatted with the moderators who helped me stay calm. Wolfie stayed quiet during the whole time, then started barking two minutes in. Thankfully, I’d practised handling distractions!

The session felt like it flew by, and the demos went mostly as planned. One of the New Column By Example transformations in the Data Wrangler demo didn’t work as expected, erroring instead of giving the desired values.

This had happened during rehearsals, so I was prepared for the possibility of it failing again. To this end, I pre-recorded a successful transformation and stored the Python code generated by the operation. I wasn’t able to show the recording due to time constraints, but used the Python code to show what the expected output should have been.

My session was recorded and is on the DataGrillen YouTube channel:

I uploaded my session files to my Community-Sessions GitHub repo. If that naming schema sounds ambitious, well…

Future Plans

So, having presented my first session, what next?

Well, I had always planned to take my foot off the gas a little after completing New Stars Of Data to appreciate the experience (and write this retrospective!). I’ve been working on it since June, and I wanted to have some time for consideration and reflection.

With respect to Racing Towards Insights, I have a couple of optimisations I’m considering. These include using a virtual machine for the Power BI demos to take the pressure off my laptop, examining options for a thirty-minute version of the session for other events and looking at applications for the Python code export function.

I’m also keen to find out how to avoid the New Column By Example error I experienced. To this end, I’ve raised an issue on the Data Wrangler GitHub repo and will see if I can narrow down the problem.

Additionally, I’ve had several positive conversations with people about submitting sessions for local user groups and community events, and have several ideas for blog topics and personal projects that could lend themselves to session abstracts. With the knowledge gained from Olivier’s mentorship, I can now start to think about what these abstracts might look like.

Summary

In this post, I wrote a retrospective review of my New Stars Of Data 6 session and overall experience. In closing, I’d like to thank the following community members for being part of my New Stars Of Data journey:

If this post has been useful, the button below has links for contact, socials, projects and sessions:

SharkLinkButton 1

Thanks for reading ~~^~~

Categories
Training & Community

New Stars Of Data 6 Preparations: July

In this post, I talk about my July preparations for the upcoming New Stars Of Data 6 event in October 2023. Yes – it’s now the middle of August. I’ve been very busy!

Table of Contents

Introduction

I’m speaking at the next New Stars Of Data event in October! I have a Sessionize profile now and everything!

2023 08 11 NewStarsOfDataSchedule

Back in December, I said that I wanted to try improving my presentation skills in 2023. I found out about New Stars Of Data when they advertised their May 2023 event, and when they opened the NSOD6 Call For Speakers I decided the time was right to have a go myself!

But wait. What’s New Stars Of Data?

Nom Sharks On What?

New Stars Of Data is an event focused on the tuition and promotion of new speakers in the Microsoft space. It is run by Ben Weissman and William Durkin, and is supported by a team of experienced speakers. There have been five events at the time of writing, with the sixth scheduled for October 27 2023.

NSOD open their Call For Speakers roughly every six months and announce it on their Twitter feed. There are four criteria that all applicants must meet:

  1. You have never spoken at a large, public event before (User Groups/Meetups do not rule you out!).
  2. Your presentation is on a topic in the Microsoft Data Platform world.
  3. Your presentation is in English.
  4. Your presentation will fill the allotted time (60 minutes).

Successful applicants have an experienced speaker assigned to them as a mentor. The mentor supports the newcomer through the process, coaching them in all aspects of creating and delivering their session.

Not sure if this is the case for every event, but I also received a cool New Stars Of Data t-shirt for taking part!

Creating My Session

In this section, I cover how I came up with my session’s topic and how I wrote my abstract.

Choosing The Topic

Once I decided to submit a session, I needed to decide what it would be about! While I specialise in AWS (amazonwebshark being a bit of a giveaway) I use many Microsoft products in a typical working week. Two of these are Power BI and Visual Studio Code. By chance, I was starting a new post when the Call For Speakers was announced. This post was about a recent VS Code extension called Data Wrangler.

Data Wrangler is a no-code data preparation and cleaning tool. It uses Python and the Pandas and Regex libraries to provide on-demand data operations, and uses Excel and Power Query technology to enable data profiling, data quality checks and the visualisation of data distributions.

Changes are presented in real-time, with a Git-like interface showing the original and updated data. This lends itself very well to demos! Data Wrangler also offers several export methods for the transformed data, one of which is CSV. This is perfect for Power BI visuals! The session was taking shape!

Finally, I needed a data source. This decision was easy, as I’ve already been using the race results from the 2023 Sale Sizzlers events for another project. The Sale Sizzlers are a series of four 5k running events that place over the Summer, the results of which are freely available as CSVs on the Nifty platform.

Having chosen a topic for the session, I needed to write an abstract for it. So what’s that?

Writing The Abstract

A session abstract is a brief summary of a session or talk. It typically provides an overview of the session topic, the services used and the key takeaways for attendees.

When writing mine, I took advice from a Brent Ozar post and a Johan Ludvig Brattås video. The result was an abstract that was technically accurate but felt a bit flat. So I turned to the tool du jour ChatGPT.

I supplied my abstract and asked ChatGPT for improvements. The results were…mixed, ranging from cliché city to word salad. Some highlights:

  • “…effortlessly assimilating the race results…”
  • “Reveling in the effortless efficiency…”
  • “…fervent sports aficionados…”
  • “Seize this opportunity to join forces in this compelling expedition of knowledge!”

None of which I could say with a straight face, so I turned those down.

I did like some of ChatGPT’s suggestions though. Ultimately, an abstract is as much an advert for a session as it is a description of one, so there needs to be some marketing and persuasiveness in there somewhere.

My finished abstract is on Sessionize and is included below:

In this session, I explore the capabilities of the Visual Studio Code Data Wrangler extension and Microsoft Power BI using real results from the Sale Sizzler 5k race series. I’ll uncover valuable insights through engaging visualisations and user-friendly and low-code data transformations.

This session will cover the following key steps:

– Getting started by setting up a Visual Studio Code environment and seamlessly importing the race results.

– Discovering the convenience of the Visual Studio Code Data Wrangler extension for effortlessly transforming and cleaning the race results

– Taking a closer look at the Python code generated by Data Wrangler to understand what’s happening behind the scenes.

– Loading the transformed race results into Power BI to generate informative visualisations and analyse trends.

Join me if you’re a data professional, a budding analyst or a sports enthusiast!

So ChatGPT wasn’t marking its own homework, I asked Google Bard what it thought of this abstract. It responded:

I think your abstract is very well-written and informative. It clearly states the topic of your session, the speaker’s qualifications, and the key takeaways for attendees.

Well, it made the robots happy. It also pleased Ben and William, as they accepted my session!

So who’s mentoring me?

My Mentor

My mentor is Olivier Van Steenlandt. Upon discovering this, I thought his name looked familiar. It turned out he’d written an Azure DevOps Pipelines post I’d seen the week prior! Small world!

Fotoshoot D1 33 681x1024 1

Olivier is a BI professional specialising in Microsoft. He has substantial experience with SQL Server, Power BI and Azure, and is currently a BI Team Lead.

Since his first session at Datagrillen 2022, Olivier has presented at several events internationally and is currently delivering a session about migrating from SSRS to Power BI Paginated Reports. I enjoy our conversations and he’s given me lots to think about!

Progress Update

In this section, I cover the specifics of my July preparations for my New Stars Of Data session.

I had my first meeting with Olivier in mid-July and got my first jobs! Jobs like “Start getting familiar with Zoomit” and “Please close some of your fifty million Chrome tabs”. I also agreed to start working on the start and end of the session and to decide on its general flow.

To begin, I consulted the New Stars Of Data Speaker Improvement Library and watched Steve Jones‘s “Creating a Slide Deck from an Idea” video. I then reached out to Steve, who happily supplied his example deck for me to review. Thanks for all your help Steve!

Next, I watched Rob Sewell‘s “How do you do that? Remote Presentations.” video. While I expected a video like Steve’s, Rob focused more on his equipment. I hadn’t even thought about this! In August I plan to test my laptop’s microphone and webcam to see what the output is like.

In my August meeting with Olivier, I demoed my starting and summary slides and showed my new-found Zoomit skills. I committed to some extra tasks besides equipment testing:

  • Finalise the start and end of the session.
  • Decide on and start producing the main section and demos.
  • Practise the session opening with a view to presenting it to Olivier in September.

Practising will work well with the equipment testing, as I’m going to have lots of disposable footage over the next few weeks…

Summary

In this post, I talked about my July preparations for the upcoming New Stars Of Data 6 event in October 2023. I’ll be posting further updates in the run-up to October, so watch this space!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Training & Community

T-SQL Tuesday #160: Microsoft OpenAI Wishlist

In this post, I host March 2023’s T-SQL Tuesday with an invitation concerning the ongoing Microsoft and OpenAI partnership.

tsql tuesday

Table of Contents

Introduction

Artificial Intelligence has been a big deal in recent months. One of the main drivers of this has been OpenAI, whose DALL-E 2 and ChatGPT services have seen extraordinary public interest and participation.

ChatGPT is currently the fastest-growing consumer application in history It reached 100 million users in its first two months, and has been integrated into numerous applications. One such example is the recent version of DBeaver that I tried out in my previous post.

Microsoft has been one of OpenAI’s most prominent supporters. In July 2019 Microsoft invested $1 billion in OpenAI and became their exclusive cloud provider.

In January 2023 Microsoft announced the latest phase of its multibillion-dollar investment partnership with OpenAI and the general availability of Azure OpenAI Service. Since then, Microsoft announced that it is building AI technology into Microsoft Bing, Edge and Microsoft 365.

T-SQL Tuesday Invitation

My invitation for this month’s T-SQL Tuesday is:

What is on your wishlist for the partnership between Microsoft and OpenAI?

This can include all Microsoft products and services, like:

When posting, please remember the T-SQL Tuesday rules:

  1. Publish contributions on Tuesday 14 March 2023.
  2. Include the T-SQL Tuesday Logo and have it link to this post.
  3. Please add your post’s link in this post’s comments.
  4. Use the #tsql2sday hashtag on social media.

My Wishlist

In this section, I’ll get the ball rolling with some things I’d like to see!

Power BI AI Assistant

Microsoft Power BI is a data visualization tool. It can interact with dozens of data sources and integrates with a wide range of applications. Power BI includes numerous visualizations and can produce professional, functional, and feature-rich dashboards.

However, Power BI is reliant on user skill levels. Like all data visualization tools, Power BI can create bad dashboards in the wrong hands. Like, really bad. Dashboards can suffer from several problems that make them useless at best and misleading at worst.

Enter ChatGPT. Power BI already offers AI Insights and AI Visuals. ChatGPT could join this family of tools, using an AI model trained on the works of visual design specialists like Edward Tufte, Stephen Few and Kate Strachnyi.

ChatGPT could offer features like:

  • Creating visuals and reports based on text prompts:
"Show me how my team's sales are performing this month"
>> Here is a bar chart showing team member performance for the current month.
  • Suggesting visuals and reports based on the data sources available.
  • Assessing dashboards for factors like colour blindness and colour symbolism.
  • Optimising existing dashboards:
"Improve my annual sales dashboard"
>> I have changed the pie chart showing 12 team members to a bar chart, as this will improve the visualization's legibility.

Azure IAC AI Assistant

IAC (Infrastructure As Code) has revolutionized the public cloud industry, bringing with it benefits like:

  • Automated, faster deployments.
  • Repeatable and consistent deployments.
  • Self-documenting infrastructure.

But IAC also presents challenges:

  • IAC scripts rely on the skills of the engineer writing them.
  • Configurations can drift or have unintended consequences.
  • It’s not easy to incorporate existing infrastructure.

ChatGPT could resolve many of these problems, turning infrastructure creation into a conversation. It could, for example:

  • Create infrastructure based on non-technical requests:
"Make me what I need to start a blog."
>> I have created a LAMP stack on a virtual machine in your default region.  Your access details are here:

Username: Username
Password: Password
  • Learn current infrastructure usage patterns and create optimisations for busy and quiet periods.
  • Spot potential conflicts and step in to prevent data loss or downtime.
ClippyProduction
Compliments of the IMGFlip Clippy Meme Generator
  • Make existing infrastructure faster, cheaper or more performant without the need for manual refactoring.
  • Resolve problems like high latency, failing connections and unexpected cost increases:
"Why is my web app generating errors?"
>> One of your virtual machines does not allow connection requests from CIDR range 10.01.10.01/28.  Do you want me to fix this?

"Yes please."
>> I have now amended virtual machine MYAPP001's Network Security Group to accept incomming connection requests from CIDR range 10.01.10.01/28.

Summary

In this post, I hosted March 2023’s T-SQL Tuesday with an invitation concerning the ongoing Microsoft and OpenAI partnership. I look forwards to reading everyone’s responses!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~