Categories
Data & Analytics

Writing User Stories For An iTunes Dashboard

In this post, I use Agile Methodology to collect requirements and write user stories for my iTunes dashboard.

Table of Contents

Introduction

In my recent posts, I’ve been building a basic data pipeline for an iTunes export file. So far I have:

Now I can think about analysing this data. I’m building a dashboard for this as they offer advantages including:

  • Dashboards communicate information quickly without having to run reports or write queries.
  • It is easy to sort and filter dashboards.
  • Using dashboard visuals doesn’t require knowledge of the maths and scripting behind them.
  • Dashboard visuals can interact with each other and respond to changes and selections.

Before I start thinking about my dashboard I should review the data. My preferred way of doing this is to create a data dictionary, so let’s start there.

Data Dictionary

In this section, I will explain what a data dictionary is and create one for the iTunes data in my Athena table.

Data Dictionary Introduction

A data dictionary is data about data. Just like a dictionary contains information on and definitions of words, IBM’s Dictionary of Computing defines a data dictionary as:

“a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format”

IBM Dictionary of Computing

Typical attributes of data dictionaries include:

  • Data type
  • Description
  • Conditions: Is the data required? Does it have dependencies?
  • Default value
  • Minimum and maximum values

There are numerous online resources about data dictionaries. I like this video as a gentle introduction to the topic using real-world examples:

Now let’s apply this to my iTunes data.

iTunes Data Dictionary

Here, I have written a short data dictionary to give my data some context. I have divided the dictionary into data types and given a brief description of each field.

Some fields have two field names because they will be renamed on the dashboard:

  • album will become key as iTunes has no field for musical keys, so I use album instead.
  • name will become title for clarity.
  • tracknumber will become bpm as, although iTunes does have a BPM field, this is not included in its export files.

Strings:

  • album / key: A track’s musical key.
  • artist: A track’s artist(s).
  • genre: A track’s music genre.
  • name / title: A track’s title and mix.

Integers:

  • myrating: A track’s rating in iTunes. Min 0 Max 100 Interval +20.
  • myratingint: A track’s rating as an integer. Min 0 Max 5 Interval +1.
  • plays: A track’s total play count.
  • tracknumber / bpm: A track’s tempo in beats per minute.
  • year: A track’s year of release.

DateTimes:

  • dateadded: The date & time a track was added to iTunes.
  • datemodified: The date & time a track was last modified in iTunes.
  • lastplayed: The date & time a track was last played in iTunes.
  • dateaddeddate: The date a track was added to iTunes.
  • datemodifieddate: The date a track was last modified in iTunes.
  • lastplayeddate: The date a track was last played in iTunes.

For context, this is a typical example of a track’s details in the iTunes GUI:

2022 09 14 iTunesTrackExample

For completeness, dateadded and datemodified are recorded in the File tab.

Now that the data is defined, I can start thinking about the dashboard. But where do I start?

Beginning The Design Process

In this section, I talk about how I got my dashboard design off the ground.

I didn’t decide to write user stories for my iTunes dashboard straightaway. This post took a few versions to get right, so I wanted to spend some time here running through my learning process.

Learning From The Past

This isn’t the first time I’ve made a dashboard for my own use. However, I have some unused and abandoned dashboards that usually have at least one of these problems:

  • They lack clarity and/or vision, resulting in a confusing user experience.
  • The use of excessive tabs causes a navigational nightmare.
  • Visuals are either poorly or incorrectly chosen.
  • Tables include excessive amounts of data. Even with conditional formatting, insights are hard to find.

When I started designing my iTunes dashboard, I was keen to make something that would be useful and stand the test of time. Having read Information Dashboard Design by Stephen Few, it turns out the problems above are common in dashboard design.

In his book, Stephen gives guidance on design considerations and critiques of sample dashboards that are just as useful now as they were when the book was published in 2006.

So I was now more clued up on design techniques. But that’s only half of what I needed…

What About The User?

I searched online for dashboard design tips while waiting for the book to arrive. While I did get useful results, they didn’t help me answer questions like:

  • How do I identify a dashboard’s purpose and requirements?
  • How do I justify my design choices?
  • What do I measure to confirm that the dashboard adds value?

I was ultimately pointed in the direction of Agile Analytics: A Value-Driven Approach to Business Intelligence and Data Warehousing by Dr Ken Collier. In this book, Ken draws on his professional experience to link the worlds of Business Intelligence and Agile Methodology, including sections on project management, collaboration and user stories.

(I should point out that I’ve only read Chapters 1: Agile Analytics: Management Methods and Chapter 4: User Stories for BI Systems currently, but I’m working on it!)

Wait! I’ve used Agile before! Writing user stories for my iTunes dashboard sounds like a great idea!

So let’s talk about Agile.

Introducing Agile

In this section, I will introduce some Agile concepts and share resources that helped me understand the theory behind them.

Agile Methodology

Agile is a software development methodology focusing on building and delivering software in incremental and iterative steps. The 2001 Manifesto for Agile Software Development declares the core Agile values as:

  • Individuals and interactions over processes and tools.
  • Working software over comprehensive documentation.
  • Customer collaboration over contract negotiation.
  • Responding to change over following a plan.

Online Agile resources are plentiful. Organisations like Atlassian and Wrike have produced extensive Agile resources that are great for beginners and can coach the more experienced.

For simpler introductions, I like Agile In A Nutshell and this Development That Pays video:

Epics

Kanbanize defines Epics as:

“…large pieces of work that can be broken down into smaller and more manageable work items, tasks, or user stories.”

“What Are Epics in Agile? Definition, Examples, and Tracking” by Kanbanize

I found the following Epic resources helpful:

  • Finally, this Dejan Majkic video explains Epics at a high level:

User Stories

Digité defines a User Story as:

“…a short, informal, plain language description of what a user wants to do within a software product to gain something they find valuable.”

“User Stories: What They Are and Why and How to Use Them” by Digité

I found the following User Story resources helpful:

  • Firstly, Atlassian’s article about User Stories explains the theory behind them, introducing User Stores as part of a wider Agile Project Management section.
  • This femke.design video gives a great personable introduction to User Stories:
  • Finally, this Atlassian video assumes some knowledge of User Stories and has more of a training feel:

Personas

Wrike defines Personas as:

“…fictional characteristics of the people that are most likely to buy your product. Personas provide a detailed summary of your ideal customer including demographic traits such as location, age, job title as well as psychographic traits such as behaviors, feelings, needs, and challenges.”

“What Are Agile Personas?” by Wrike
  • This Atlassian video gives additional tips and advice on creating personas:

That’s all the theory for now! Let’s begin writing the user stories I’m going to use for my iTunes dashboard, starting by creating a persona.

My Persona

In this section, I will create the persona I’ll use to write the epic and user stories for my iTunes dashboard. But why bother?

Why Create A Persona?

Some sources consider personas to be optional while others prioritise them highly. I’m using one here for a few reasons:

  • Firstly, I’m my own customer here. Using a persona will make it easier to identify my requirements, and ringfence ‘engineer me’ from ‘user me’.
  • Secondly, the persona will help to focus the project. Once the persona’s goals have been defined, they will present a clear target to work towards.
  • Finally, creating a persona makes this post far easier to write! Pulling my requirements out of thin air doesn’t feel authentic, and writing about myself in the third person is WEIRD.

So who is this persona?

Introducing Biscuit

Meet my stakeholder, Biscuit:

PXL 20220914 165327222 2 600450
That’s not what stakeholder means! – Ed

Being Biscuit

Biscuit is a sassy, opinionated shark that will tell anyone who listens about when he met Frank Sidebottom.

Biscuit likes listening to dance music. He has a collection of around 3500 tracks and uses iTunes smart playlists to listen to them. His collection ranges from Deep House at around 115 BPM to Drum & Bass at around 180 BPM.

Biscuit is a bit of a music geek. He found out about music theory when he saw key notations on his Anjunabeats record sleeves and wondered what they meant. He uses Mixed In Key to scan his collection, so each track has rich metadata.

Biscuit has various playlists depending on his mood, location and/or activity. He likes to choose between recent favourites and tunes he’s not heard for a while.

Biscuit doesn’t use music streaming services and doesn’t want to due to the internet requirement and the bitrates offered.

Biscuit’s Challenges

The current process of creating new smart playlists involves spending time looking through existing playlists and using trial and error. This usually takes at least an hour and doesn’t guarantee good results.

As the current process of generating new playlists is undesirable, existing and legacy playlists are being used more often. This is creating problems:

  • Tracks in those playlists get disproportionate plays and become stale, while tracks not in those playlists don’t get played for ages.
  • Underused playlists consume iTunes compute resources and iPhone storage space.
  • Time and money are being spent on adding new tracks to the collection as no simple process exists to identify lesser played tracks.

Biscuit’s Goals

Biscuit wants a quicker way to use his iTunes data for building smart playlists. Specifically, he wants to know about the tracks he plays most and least often so that future smart playlists can be focused accordingly.

Biscuit would like to visualise his iTunes data in a dashboard. The dashboard should show:

  • If any listening trends or behaviours exist.
  • Traits of disproportionately overplayed and underplayed tracks.

Biscuit also wants to know if the following factors influence how many times a track is played:

  • BPM: Is there a relationship between a track’s tempo and its play count?
  • Date Added: Recently added tracks will usually have fewer plays, but what about tracks that have been in the collection longer? Do tracks exist with older dateadded dates and low play counts?
  • Rating: The assumption would be that tracks with higher ratings will get played more. Is this correct?
  • Year: Are tunes produced in certain years played more than others? Are there periods of time that future smart playlists should target?

My Epic And User Stories

In this section, I will write the epic and user stories that I will use to design and create my iTunes dashboard.

Designing a dashboard would usually be a user story in an epic – I’ve allocated a user story to each dashboard visual to help keep me focused, as time is currently tight and it can be challenging to time for this!

Epic: iTunes Play Counts Dashboard

As a playlist builder, Biscuit wants to use a dashboard to analyse the play counts in his iTunes data so that he can simplify the process of creating new smart playlists.

ACCEPTANCE CRITERIA:

  • Biscuit can analyse play totals and see how they are distributed between bpm, dateadded, rating and year fields.
  • Biscuit can use the dashboard for architecting new smart playlists instead of iTunes.
  • Biscuit can access the dashboard on PC and mobile.
  • The dashboard’s operational costs must be minimal.

User Story: Plays Visual

As a playlist builder, Biscuit wants to see play totals so that it is easier to review and manage his current play intervals.

ACCEPTANCE CRITERIA:

  • Biscuit can see and sort totals of individual plays.
  • Biscuit can see and sort totals of current play intervals.
  • The dashboard can be filtered by plays and current play intervals.
  • The dashboard must use the following play intervals:
    • P0: unplayed.
    • P01-P05: between 1 and 5 plays.
    • P06-P10: between 6 and 10 plays.
    • P11-P20: between 11 and 20 plays.
    • P21-P30: between 21 and 30 plays.
    • P31+: over 30 plays.

User Story: BPMs Visual

As a playlist builder, Biscuit wants to see how plays are distributed between track BPMs so that he can identify how BPMs influence which tracks are played most and least often.

ACCEPTANCE CRITERIA:

  • Biscuit can see relationships between BPMs and play totals at a high level.
  • Biscuit can both filter the visual and drill down for added precision.
  • The dashboard can be filtered by both BPMs and current BPM intervals.
  • The dashboard must use the following BPM intervals:
    • B000-B126: 126 BPM and under.
    • B127-B129: 127 BPM to 129 BPM.
    • B130-B133: 130 BPM to 133 BPM.
    • B134-B137: 134 BPM to 137 BPM.
    • B138-B140: 138 BPM to 140 BPM.
    • B141-B150: 141 BPM to 150 BPM.
    • B151+: 151 BPM and over.

User Story: Ratings Visual

As a playlist builder, Biscuit wants to see how plays are distributed between iTunes ratings so that he can identify how ratings influence which tracks are played most and least often.

ACCEPTANCE CRITERIA:

  • Biscuit can see relationships between ratings and play totals at a high level.
  • Biscuit can both filter the visual and drill down for added precision.
  • The dashboard can be filtered by rating.

User Story: Date Added Visual

As a playlist builder, Biscuit wants to see how plays are distributed relative to when tracks were added to the collection so that he can identify tracks with abnormally high and low play totals relative to how long they have been in the collection.

ACCEPTANCE CRITERIA:

  • Biscuit can see relationships between the year tracks were added to the collection and play totals at a high level.
  • Biscuit can both filter the visual and drill down for added precision.
  • The dashboard can be filtered by the years that tracks were added in.

User Story: Year Visual

As a playlist builder, Biscuit wants to see how plays are distributed relative to track production years so that he can identify how production years influence which tracks are played most and least often.

ACCEPTANCE CRITERIA:

  • Biscuit can see relationships between production years and play totals at a high level.
  • Biscuit can both filter the visual and drill down for added precision.
  • The dashboard can be filtered by production year.

Summary

In this post, I used Agile Methodology to collect requirements and wrote user stories for my iTunes dashboard.

I created a data dictionary to give context to my iTunes data, examined some high-level Agile concepts and used a persona to write five user stories that I will use to create my iTunes dashboard.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Ingesting iTunes Data Into AWS With Python And Athena

In this post, I will update my existing iTunes Python ETL to return a Parquet file, which I will then upload to S3 and view using Athena.

Table of Contents

Introduction

In my last post, I made an ETL that exported data from a CSV into a Pandas DataFrame using AWS Data Wrangler. That post ended with the transformed data being saved locally as a new CSV.

It’s time to do something with that data! I want to analyse my iTunes data and look for trends and insights into my listening habits. I also want to access these insights in the cloud, as my laptop is a bit bulky and quite slow. Finally, I’d prefer to keep my costs to a minimum.

Here, I’ll show how AWS and Python can be used together to meet these requirements. Let’s start with AWS.

Amazon S3

In this section, I will update my S3 setup. I’ll create some new buckets and explain my approach.

New S3 Buckets

Currently, I have a single S3 bucket containing my iTunes Export CSV. Moving forward, this bucket will contain all of my unmodified source objects, otherwise known as raw data.

To partner the raw objects bucket, I now have an ingested objects bucket. This bucket will contain objects where the data has been transformed in some way. My analytics tools and Athena tables will point here for their data.

Speaking of Athena, the other new bucket will be used for Athena’s query results. Although Athena is serverless, it still needs a place to record queries and store results. Creating this bucket now will save time later on.

Having separate buckets for each of these functions isn’t a requirement, although it is something I prefer to do. Before moving on, I’d like to run through some of the benefits I find with this approach.

Advantages Of Multiple Buckets

Firstly, having buckets with clearly defined purposes makes navigation way easier. I always know where to find objects, and rarely lose track of or misplace them.

Secondly, having multiple buckets usually makes my S3 paths shorter. This doesn’t sound like much of a benefit upfront, but the S3 path textboxes in the AWS console are quite small, and using long S3 paths in the command line can be a pain.

Finally, I find security and access controls are far simpler to implement with a multi-bucket setup. Personally I prefer “You can’t come into this house/bucket” over “You can come into this house/bucket, but you can’t go into this room/prefix”. However, both S3 buckets and S3 prefixes can be used as IAM policy resources so there’s technically no difference.

That concludes the S3 section. Next, let’s set up Athena.

Amazon Athena

In this section, I’ll get Athena ready for use. I’ll show the process I followed and explain my key decisions. Let’s start with my reasons for choosing Athena.

Why Athena?

Plenty has been written about Athena’s benefits over the years. So instead of retreading old ground, I’ll discuss what makes Athena a good choice for this particular use case.

Firstly, Athena is cheap. The serverless nature of Athena means I only pay for what I query, scan and store, and I’ve yet to see a charge for Athena in the three years I’ve been an AWS customer.

Secondly, like S3, Athena’s security is managed by IAM. I can use IAM policies to control who and what can access my Athena data, and can monitor that access in CloudTrail. This also means I can manage access to Athena independently of S3.

Finally, Athena is highly available. Authorised calls to the service have a 99.9% Monthly Uptime Percentage SLA and Athena benefits from S3’s availability and durability. This allows 24/7 access to Athena data for users and applications.

Setting Up Athena

To start this section, I recommend reading the AWS Athena Getting Started documentation for a great Athena introduction. I’ll cover some basics here, but I can’t improve on the AWS documentation.

Athena needs three things to get off the ground:

  • An S3 path for Athena query results.
  • A database for Athena tables.
  • A table for interacting with S3 data objects.

I’ve already talked about the S3 path, so let’s move on to the database. A database in Athena is a logical grouping for the tables created in it. Here, I create a blog_amazonwebshark database using the following script:

CREATE DATABASE blog_amazonwebshark

Next, I enter the column names from my iTunes Export CSV into Athena’s Create Table form, along with appropriate data types for each column. In response, the form creates this Athena table:

The form adds several table properties to the table’s DDL. These, along with the data types, are expanded on in the Athena Create Table documentation.

Please note that I have removed the S3 path from the LOCATION property to protect my data. The actual Athena table is pointing at an S3 prefix in my ingested objects bucket that will receive my transformed iTunes data.

Speaking of data, the form offers several choices of source data format including CSV, JSON and Parquet. I chose Parquet, but why do this when I’m already getting a CSV? Why create extra work?

Let me explain.

About Parquet

Apache Parquet is a file format that supports fast processing for complex data. It can essentially be seen as the next generation of CSV. Both formats have their place, but at scale CSV files have large file sizes and slow performance.

In contrast, Parquet files have built-in compression and indexing for rapid data location and retrieval. In addition, the data in Parquet files is organized by column, resulting in smaller sizes and faster queries.

This also results in Athena cost savings as Athena only needs to read the columns relevant to the queries being run. If the same data was in a CSV, Athena would have to read the entire CSV whether the data is needed or not.

For further reading, Databricks have a great Parquet section in their glossary.

That’s everything for Athena. Now I need to update my Python scripts.

Python

In this section, I’ll make changes to my Basic iTunes ETL to include my new S3 and Athena resources and to replace the CSV output with a Parquet file. Let’s start with some variables.

New Python Variables

My first update is a change to ETL_ITU_Play_Variables.py, which contains my global variables. Originally there were two S3 global variables – S3_BUCKET containing the bucket name and S3_PREFIX containing the S3 prefix path leading to the raw data:

S3_BUCKET
S3_PREFIX

Now I have two buckets and two prefixes, so it makes sense to update the variable names. I now have two additional global variables, adding _RAW to the originals and _INGESTED to the new ones for clarity:

S3_BUCKET_RAW
S3_PREFIX_RAW

S3_BUCKET_INGESTED
S3_PREFIX_INGESTED

Changing CSV To Parquet

The next change is to ETL_ITU_Play.py. The initial version converts a Pandas DataFrame to CSV using pandas.DataFrame.to_csv. I’m now replacing this with awswrangler.s3.to_parquet, which needs three parameters:

Put together, it looks like this:

wr.s3.to_parquet(
    df = df,
    boto3_session = session,
    path = s3_path_ingested

Before committing my changes, I took the time to put the main workings of my ETL in a class. This provides a clean structure for my Python script and will make it easier to reuse in future projects.

That completes the changes. Let’s review what has been created.

Architecture

Here is an architectural diagram of how everything fits together:

Here is a breakdown of the processes involved:

  1. User runs the Python ETL script locally.
  2. Python reads the CSV object in datalake-raw S3 bucket.
  3. Python extracts data from CSV into a DataFrame and transforms several columns.
  4. Python writes the DataFrame to datalake-ingested S3 bucket as a Parquet file.
  5. Python notifies User of a successful run.
  6. User sends query to Athena.
  7. Athena reads data from datalake-ingested S3 bucket.
  8. Athena returns query results to User.

Testing

In this section, I will test my resources to make sure they work as expected. Bare in mind that this setup hasn’t been designed with production use in mind, so my testing is somewhat limited and would be insufficient for production deployment.

Testing Python

TEST: Upload a CSV to the datalake-raw S3 bucket, then run the Python script. The Python script must run successfully and print updates in the terminal throughout.

RESULT: I upload an iTunes Export CSV to the datalake-raw S3 bucket:

The Python script runs, printing the following output in the terminal:

Creating DataFrame.
DataFrame columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')
Deleting unnecessary DataFrame columns.
Renaming DataFrame columns.
Reformatting DateTime DataFrame columns.
Creating Date Columns From DateTime Columns.
Creating MyRatingDigit Column.
Replacing blank values to prevent IntCastingNaN errors.
Setting Data Types.
Creating Parquet file from DataFrame.
Processes complete.

Testing S3

TEST: After the Python script successfully runs, the datalake-ingested S3 bucket must contain an itunesdata.parquet object.

RESULT: Upon accessing the datalake-ingested S3 bucket, an itunesdata.parquet object is found:

(On an unrelated note, look at the size difference between the Parquet and CSV files!)

Testing Athena

TEST: When the datalake-ingested S3 bucket contains an itunesdata.parquet object, data from the iTunes Export CSV must be shown when the following Athena query is run:

SELECT * FROM basic_itunes_python_etl;

RESULT: Most of the Athena results match the iTunes Export data. However, the transformed dates did not match expectations:

This appears to be a formatting problem, as some parts of a date format are still visible.

To diagnose the problem I wanted to see how these columns were being stored in the Parquet file. I used mukunku’s ParquetViewer for this, which is described in the GitHub repo as:

…a quick and dirty utility that I created to easily view Apache Parquet files on Windows desktop machines.

It works very well!

Here is a screenshot of the data. The lastplayed column has dates and times, while the datamodifieddate column has dates only:

The cause of the problem becomes apparent when the date columns are viewed using the ISO 8601 format:

The date columns are all using timestamps, even when no times are included!

A potential fix would be to change the section of my Python ETL script that handles data types. Instead, I update the data types used in my Athena table from date:

  `datemodifieddate` date, 
  `dateaddeddate` date, 
  `lastplayeddate` date, 

To timestamp:

  `datemodifieddate` timestamp, 
  `dateaddeddate` timestamp, 
  `lastplayeddate` timestamp, 

This time, when I view my Athena table the values all appear as expected:

Scripts

My ETL_ITU_Play.py file commit from 2022-08-08 can be viewed here:

ETL_ITU_Play.py on GitHub

My updated repo readme can be viewed here:

README.md on GitHub

Summary

In this post, I updated my existing iTunes Python ETL to return a Parquet file, which I then uploaded S3 and viewed using Athena. I explained my reasoning for choosing S3, Athena and the Parquet file format, and I handled a data formatting issue.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Creating A Basic iTunes ETL With Python And AWS Data Wrangler

In this post I will use Python and AWS Data Wrangler to create a basic iTunes ETL that extracts data from an iTunes export file into a Pandas DataFrame.

Table of Contents

Introduction

For many years I have enjoyed various forms of dance music. Starting with my first compilation CDs in 2000, I’ve since amassed a large collection of records, CDs and virtual media ranging from the late 80s to modern times.

I started using iTunes as my main media player in 2010. Since then I have built up a large database of iTunes metadata that includes various counts, ratings and timestamps.

Currently I use this data for a series of iTunes Smart Playlists. To derive further meaning from the data and to practise my Python skills, I want to extract this data from iTunes and analyse it using the various data tools at my disposal.

To get the ball rolling I’m going to build a basic iTunes ETL, which I will continue to develop over the coming months.

Let’s start by looking at the iTunes export process.

iTunes Export Files

I use iTunes 12.6.4.3. This isn’t by choice – iTunes 12.6.4.3 is the last version with a built-in App Store, allowing my battered old iPhone 3GS to live on in its second life as an iPod Touch:

Still works!

I mention this as newer versions of iTunes may be different, or may not offer an export feature at all. Why do I persist with this ageing setup? That…is a post for another time.

Every week I sync my Not-iPhone via iTunes, and then create an export of my master playlist:

iTunes doesn’t have many export options, and exports playlists as tab-delimited txt files by default:

To give myself an easier time for this post, I manually made the following changes to a recent iTunes export file:

  • Imported the txt file into Microsoft Excel.
  • Removed columns I didn’t want.
  • Saved the altered file as a csv.
  • Uploaded the csv to Amazon S3.

This Franken-File will be what I use to build my basic iTunes ETL. I understand there are ways of dealing with txt files in Python – I’ll be exploring this in future posts.

Setup

Before starting to write any code, I have done the following:

Advisory

During this post, I will make several decisions that will be revisited in the coming months as my skills improve. I have taken steps to protect my AWS credentials (more on that shortly) but at this stage my basic iTunes ETL Python script is a work in progress and should not be used in a Production environment.

Creating Secure Variables

My first job is to create the variables I’m going to need. As these variables can compromise my AWS account in the wrong hands, I want to create them as securely as possible.

The topic of security is something I will be returning to in future posts. For now, I’m using a similar method to PowerShell’s Dot Sourcing in last month’s post.

Python’s import statement can import other Python scripts in the same way as modules. With this in mind, I create a new ETL_ITU_Play_Variables.py file for my variables.

Importing ETL_ITU_Play_Variables into my main script will allow Python to locate the variables and call them successfully:

import ETL_ITU_Play_Variables

aws_accesskey = ETL_ITU_Play_Variables.AWS_ACCESSKEY
aws_secret = ETL_ITU_Play_Variables.AWS_SECRET

Next I create a gitignore file and add ETL_ITU_Play_Variables.py to it. I can now use these variables in my local environment, safe in the knowledge that Git will not track ETL_ITU_Play_Variables and will not include it in any commits.

With that taken care of, I need two sets of variables.

Creating Authentication Variables

AWS authenticates every request before completing it. As none of my AWS resources are public, I need to provide credentials that have the necessary IAM permissions.

There are various ways to provide these credentials – in this case I’m using an AWS Access Key / Secret Key combination with a variable for each string:

aws_accesskey = 'accesskey123456789'
aws_secretkey = 'secretkey123456789'

As additional security, these keys belong to a new IAM user that only has permission to read S3 objects in the appropriate bucket.

I now need a way to pass these keys to AWS. I use the AWS SDK for Python (Boto3) for this, creating a session variable using boto3.session.Session

session = boto3.session.Session
(
aws_access_key_id = aws_accesskey,
aws_secret_access_key = aws_secret
)

Creating S3 Variables

Next I create the S3 variables I need. I use s3_bucket for the bucket name and s3_prefix for the iTunes export csv‘s bucket prefix.

s3_bucket = 'example-my-bucket'
s3_prefix = 'Example/MyPath/'

I then use these variables to create s3_path for AWS Data Wrangler to use:

s3_path = f"s3://{s3_bucket}/{s3_prefix}"

Making The ETL

With my variables in place, I can start working on my basic iTunes ETL! AWS is now accepting my requests, so let’s start configuring AWS Data Wrangler.

Creating The DataFrame

AWS Data Wrangler is essentially Pandas on AWS, and the two tools share many commands. This DataEng Uncomplicated AWS Data Wrangler Overview does a great job of explaining the fundamentals:

I read the iTunes Export csv‘s contents by using awswrangler.s3.read_csv with the following parameters:

  • path: My s3_path variable.
  • path_suffix: The files I want to read, in this case .csv.
  • boto3_session: My session variable.

This reads all the csv files in the S3 path, which is fine for now.

df = wr.s3.read_csv(path = s3_path,
                    path_suffix = ".csv",
                    boto3_session = session
                    )

I can then print the columns in a DataFrame:

print (f'Dataframe columns are {df.columns}')
Dataframe columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')

Deleting Unnecessary Columns

Having seen the list of columns, there are some I don’t need. I can get rid of them using pandas.DataFrame.drop:

df = df.drop(columns=
    [
        'Time',
        'Bit Rate',
        'Skips',
        'Last Skipped',
        'Location'
    ]
)

Now, when I print the list of columns, the removed columns are no longer included:

print (f'Dataframe columns are now {df.columns}')
Dataframe columns are now Index(['Name', 'Artist', 'Album', 'Genre', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Plays', 'Last Played', 'My Rating'], dtype='object')

Renaming Columns

Next, I want to rename the columns. I use pandas.DataFrame.rename to map the current column names to the new ones:

df = df.rename(columns=
    {
        'Name' : 'name',
        'Artist' : 'artist',
        'Album' : 'album',
        'Genre' : 'genre',
        'Track Number' : 'tracknumber',
        'Year' : 'year',
        'Date Modified' : 'datemodified',
        'Date Added' : 'dateadded',
        'Plays' : 'plays',
        'Last Played' : 'lastplayed',
        'My Rating' : 'myrating'
    }
)

The columns are now changed to:

print (f'Dataframe columns are now named {df.columns}')
Dataframe columns are now named Index(['name', 'artist', 'album', 'genre', 'tracknumber', 'year', 'datemodified', 'dateadded', 'plays', 'lastplayed', 'myrating'], dtype='object')

Reformatting DateTime Columns

I now want to make sure that the dates in my DataFrame are stored in ISO 8601 format, as this will make them earlier to work with and report against.

When I print the dateadded column as an example, the dates are not currently in this format:

print (f'Dataframe Date Added column is {df.dateadded}')
1       05/04/2021 13:29
2       26/01/2019 18:25
3       30/12/2016 17:34
4       12/12/2015 00:43

I can resolve this using the dayfirst and yearfirst arguments of pandas.to_datetime:

df['dateadded'] = pd.to_datetime(df['dateadded'],yearfirst=False,dayfirst=True)

This tells Pandas how to interpret the dates. In the case of 05/04/2021, dayfirst=True tells Pandas this is 5th April 2021, as opposed to 4th May 2021.

Pandas then parses the rest of my dates in the same way, giving me the formatting I want:

1      2021-04-05 13:29:00
2      2019-01-26 18:25:00
3      2016-12-30 17:34:00
4      2015-12-12 00:43:00

I repeat this for the datemodified and lastplayed columns.

Creating Date Columns From DateTime Columns

I now want to create some new columns in my DataFrame.

The first of these new columns will mirror the values in the existing date columns. However, these columns will not contain the full timestamp – they will only contain the date instead. This will make it easier to aggregate my data.

To do this, I use pandas.Series.dt.date to create three new columns in the DataFrame:

df['datemodifieddate'] = df['datemodified'].dt.date
df['dateaddeddate'] = df['dateadded'].dt.date
df['lastplayeddate'] = df['lastplayed'].dt.date

The new columns retain the original date values and remove the unneeded time values:

print (f'Dataframe Date Added Date column is {df.dateaddeddate}')
1       2021-04-05
2       2019-01-26
3       2016-12-30
4       2015-12-12

Creating Simplified Rating Columns

I now want to add another column to the DataFrame to simplify reporting against a track’s rating. Ratings in iTunes export files appear in multiples of twenty:

  • 1 star = 20
  • 2 stars = 40
  • 3 stars = 60
  • 4 stars = 80
  • 5 stars = 100

In my current DataFrame, printing myrating produces this:

print (f'Dataframe My Rating is {df.myrating}')
1        40.0
2        40.0
3        60.0
4        80.0

This produces a disconnect between the data in the DataFrame and the data in the iTunes GUI. I would prefer to keep things simple by having a column where the rating value mirrors the iTunes GUI.

This can be added to my DataFrame by using a function. I define an itunes_rating function that will return an integer based on the value that is passed to it:

def itunes_rating(r):
    """Converts ratings in export file to familiar format"""
    if r == 20:
        return 1
    elif r == 40:
        return 2
    elif r == 60:
        return 3
    elif r == 80:
        return 4
    elif r == 100:
        return 5
    else:
        return 0

I then create a new myratingdigit column in my DataFrame by passing each value in the myrating column to the itunes_rating function and capturing the result:

df['myratingdigit'] = df['myrating'].apply(itunes_rating)

And when I print the new column, the results are as expected:

print (f'Dataframe My Rating Digit is {df.myratingdigit}')
1       2
2       2
3       3
4       4

Setting Data Types

Finally, I want to make sure the DataFrame is using the correct data types for each column. Pandas will usually infer data types correctly but doesn’t always get it right.

I can use pandas.DataFrame.dtypes to see the current data types in my DataFrame. At the moment they are:

name                        object
artist                      object
album                       object
genre                       object
tracknumber                  int64
year                         int64
datemodified        datetime64[ns]
dateadded           datetime64[ns]
plays                      float64
lastplayed          datetime64[ns]
myrating                   float64
datemodifieddate            object
dateaddeddate               object
lastplayeddate              object
myratingdigit                int64

Most of these are correct but some need changing. For example, plays will never have decimal places so should be int, and columns like datemodifieddate should be datetime64.

Pandas has several options for this, which are laid out in this helpful Stack Overflow thread. Here, I use astype to assign data types to my dataframe:

df = df.astype(
    {
        'name' : str,
        'artist' : str,
        'album' : str,
        'genre' : str,
        'tracknumber' : int,
        'year' : int,
        'datemodified' : datetime64,
        'dateadded' : datetime64,
        'plays' : int,
        'lastplayed' : datetime64,
        'myrating' : int,
        'datemodifieddate' : datetime64,
        'dateaddeddate' : datetime64,
        'lastplayeddate' : datetime64,
        'myratingint' : int
    }
)

Pandas uses NumPy datetime64 dtypes for working with time series data, so I import it at the top of my script:

from numpy import datetime64

Fixing A Casting Exception

Unfortunately, while testing the newly assigned dtypes I started getting an error:

Exception has occurred: IntCastingNaNError
Cannot convert non-finite values (NA or inf) to integer

This error means that at least one of the columns I’m trying to cast as int contains an empty value. An infinite value is possible, but unlikely due to the various integrity checks iTunes performs on its library.

To find the empty values, I create a second DataFrame using the data in the first, using pandas.DataFrame.isna and pandas.DataFrame.any to find any NA values:

df1 = df[df.isna().any(axis=1)]

Included within the resulting DataFrame were the following tracks:

3571	7 Hours (Original Mix)	Dan Stone	07A-Dm	...	2019-01-26	NaT	1

3575	8th Wonder (Espen & Stian Remix)	8 Wonders	04A-Fm	...	2019-01-26	NaT	1

Checking iTunes shows that these tracks have no plays:

iTunes represents no plays as an empty string as opposed to a zero. This is then extracted into the DataFrame as NA, causing the IntCastingNaN error.

To fix this, I use pandas.DataFrame.fillna to replace the empty fields with zero. Although only the plays column is generating the error, I apply fillna to all the columns being cast as int to prevent any future problems for the ETL:

df['tracknumber'] = df['tracknumber'].fillna(0)
df['year'] = df['year'].fillna(0)
df['plays'] = df['plays'].fillna(0)
df['myrating'] = df['myrating'].fillna(0)

The myratingint column doesn’t need this approach, since my itunes_rating function always returns zero if no conditions are met.

This time, printing the data types shows an acceptable list:

name                        object
artist                      object
album                       object
genre                       object
tracknumber                  int64
year                         int64
datemodified        datetime64[ns]
dateadded           datetime64[ns]
plays                        int64
lastplayed          datetime64[ns]
myrating                     int64
datemodifieddate    datetime64[ns]
dateaddeddate       datetime64[ns]
lastplayeddate      datetime64[ns]
myratingdigit                int64

Exporting The DataFrame As A CSV

This is as far as I’m going to take the DataFrame in this post. As a final check, I want to extract the DataFrame in some form to confirm its suitability for future work I have planned.

The quickest way to do this is with pandas.DataFrame.to_csv. This writes the entire DataFrame to a csv file. When I run:

df.to_csv('ETL-ITU.csv')

A ETL-ITU.csv file is created in the terminal’s working directory that can be viewed and sandboxed as needed.

Scripts

My gitignore file commit from 2022-07-17 can be viewed here:

Basic_iTunes_Python_ETL .gitignore on GitHub

My ETL_ITU_Play.py file commit from 2022-07-17 can be viewed here:

ETL_ITU_Play.py on GitHub

A requirements.txt file has also been created to aid installation. The file commit from 2022-07-20 can be viewed here:

Basic_iTunes_Python_ETL requirements.txt on GitHub

Summary

In this post I used Python and AWS Data Wrangler to create a basic iTunes ETL that extracts data from an iTunes export file into a Pandas DataFrame. I have used various Python modules to extract and transform the data, and the data is now ready to be loaded to a staging area of my choosing.

Expect to see further posts on this in the coming months. This basic iTunes ETL probably won’t stay basic for long!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~