Data & Analytics

Creating A Basic iTunes ETL With Python And AWS Data Wrangler

In this post I will use Python and AWS Data Wrangler to create a basic iTunes ETL that extracts data from an iTunes export file into a Pandas DataFrame.

Table of Contents


For many years I have enjoyed various forms of dance music. Starting with my first compilation CDs in 2000, I’ve since amassed a large collection of records, CDs and virtual media ranging from the late 80s to modern times.

I started using iTunes as my main media player in 2010. Since then I have built up a large database of iTunes metadata that includes various counts, ratings and timestamps.

Currently I use this data for a series of iTunes Smart Playlists. To derive further meaning from the data and to practise my Python skills, I want to extract this data from iTunes and analyse it using the various data tools at my disposal.

To get the ball rolling I’m going to build a basic iTunes ETL, which I will continue to develop over the coming months.

Let’s start by looking at the iTunes export process.

iTunes Export Files

I use iTunes This isn’t by choice – iTunes is the last version with a built-in App Store, allowing my battered old iPhone 3GS to live on in its second life as an iPod Touch:

Still works!

I mention this as newer versions of iTunes may be different, or may not offer an export feature at all. Why do I persist with this ageing setup? That…is a post for another time.

Every week I sync my Not-iPhone via iTunes, and then create an export of my master playlist:

iTunes doesn’t have many export options, and exports playlists as tab-delimited txt files by default:

To give myself an easier time for this post, I manually made the following changes to a recent iTunes export file:

  • Imported the txt file into Microsoft Excel.
  • Removed columns I didn’t want.
  • Saved the altered file as a csv.
  • Uploaded the csv to Amazon S3.

This Franken-File will be what I use to build my basic iTunes ETL. I understand there are ways of dealing with txt files in Python – I’ll be exploring this in future posts.


Before starting to write any code, I have done the following:


During this post, I will make several decisions that will be revisited in the coming months as my skills improve. I have taken steps to protect my AWS credentials (more on that shortly) but at this stage my basic iTunes ETL Python script is a work in progress and should not be used in a Production environment.

Creating Secure Variables

My first job is to create the variables I’m going to need. As these variables can compromise my AWS account in the wrong hands, I want to create them as securely as possible.

The topic of security is something I will be returning to in future posts. For now, I’m using a similar method to PowerShell’s Dot Sourcing in last month’s post.

Python’s import statement can import other Python scripts in the same way as modules. With this in mind, I create a new file for my variables.

Importing ETL_ITU_Play_Variables into my main script will allow Python to locate the variables and call them successfully:

import ETL_ITU_Play_Variables

aws_accesskey = ETL_ITU_Play_Variables.AWS_ACCESSKEY
aws_secret = ETL_ITU_Play_Variables.AWS_SECRET

Next I create a gitignore file and add to it. I can now use these variables in my local environment, safe in the knowledge that Git will not track ETL_ITU_Play_Variables and will not include it in any commits.

With that taken care of, I need two sets of variables.

Creating Authentication Variables

AWS authenticates every request before completing it. As none of my AWS resources are public, I need to provide credentials that have the necessary IAM permissions.

There are various ways to provide these credentials – in this case I’m using an AWS Access Key / Secret Key combination with a variable for each string:

aws_accesskey = 'accesskey123456789'
aws_secretkey = 'secretkey123456789'

As additional security, these keys belong to a new IAM user that only has permission to read S3 objects in the appropriate bucket.

I now need a way to pass these keys to AWS. I use the AWS SDK for Python (Boto3) for this, creating a session variable using boto3.session.Session

session = boto3.session.Session
aws_access_key_id = aws_accesskey,
aws_secret_access_key = aws_secret

Creating S3 Variables

Next I create the S3 variables I need. I use s3_bucket for the bucket name and s3_prefix for the iTunes export csv‘s bucket prefix.

s3_bucket = 'example-my-bucket'
s3_prefix = 'Example/MyPath/'

I then use these variables to create s3_path for AWS Data Wrangler to use:

s3_path = f"s3://{s3_bucket}/{s3_prefix}"

Making The ETL

With my variables in place, I can start working on my basic iTunes ETL! AWS is now accepting my requests, so let’s start configuring AWS Data Wrangler.

Creating The DataFrame

AWS Data Wrangler is essentially Pandas on AWS, and the two tools share many commands. This DataEng Uncomplicated AWS Data Wrangler Overview does a great job of explaining the fundamentals:

I read the iTunes Export csv‘s contents by using awswrangler.s3.read_csv with the following parameters:

  • path: My s3_path variable.
  • path_suffix: The files I want to read, in this case .csv.
  • boto3_session: My session variable.

This reads all the csv files in the S3 path, which is fine for now.

df = wr.s3.read_csv(path = s3_path,
                    path_suffix = ".csv",
                    boto3_session = session

I can then print the columns in a DataFrame:

print (f'Dataframe columns are {df.columns}')
Dataframe columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')

Deleting Unnecessary Columns

Having seen the list of columns, there are some I don’t need. I can get rid of them using pandas.DataFrame.drop:

df = df.drop(columns=
        'Bit Rate',
        'Last Skipped',

Now, when I print the list of columns, the removed columns are no longer included:

print (f'Dataframe columns are now {df.columns}')
Dataframe columns are now Index(['Name', 'Artist', 'Album', 'Genre', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Plays', 'Last Played', 'My Rating'], dtype='object')

Renaming Columns

Next, I want to rename the columns. I use pandas.DataFrame.rename to map the current column names to the new ones:

df = df.rename(columns=
        'Name' : 'name',
        'Artist' : 'artist',
        'Album' : 'album',
        'Genre' : 'genre',
        'Track Number' : 'tracknumber',
        'Year' : 'year',
        'Date Modified' : 'datemodified',
        'Date Added' : 'dateadded',
        'Plays' : 'plays',
        'Last Played' : 'lastplayed',
        'My Rating' : 'myrating'

The columns are now changed to:

print (f'Dataframe columns are now named {df.columns}')
Dataframe columns are now named Index(['name', 'artist', 'album', 'genre', 'tracknumber', 'year', 'datemodified', 'dateadded', 'plays', 'lastplayed', 'myrating'], dtype='object')

Reformatting DateTime Columns

I now want to make sure that the dates in my DataFrame are stored in ISO 8601 format, as this will make them earlier to work with and report against.

When I print the dateadded column as an example, the dates are not currently in this format:

print (f'Dataframe Date Added column is {df.dateadded}')
1       05/04/2021 13:29
2       26/01/2019 18:25
3       30/12/2016 17:34
4       12/12/2015 00:43

I can resolve this using the dayfirst and yearfirst arguments of pandas.to_datetime:

df['dateadded'] = pd.to_datetime(df['dateadded'],yearfirst=False,dayfirst=True)

This tells Pandas how to interpret the dates. In the case of 05/04/2021, dayfirst=True tells Pandas this is 5th April 2021, as opposed to 4th May 2021.

Pandas then parses the rest of my dates in the same way, giving me the formatting I want:

1      2021-04-05 13:29:00
2      2019-01-26 18:25:00
3      2016-12-30 17:34:00
4      2015-12-12 00:43:00

I repeat this for the datemodified and lastplayed columns.

Creating Date Columns From DateTime Columns

I now want to create some new columns in my DataFrame.

The first of these new columns will mirror the values in the existing date columns. However, these columns will not contain the full timestamp – they will only contain the date instead. This will make it easier to aggregate my data.

To do this, I use to create three new columns in the DataFrame:

df['datemodifieddate'] = df['datemodified']
df['dateaddeddate'] = df['dateadded']
df['lastplayeddate'] = df['lastplayed']

The new columns retain the original date values and remove the unneeded time values:

print (f'Dataframe Date Added Date column is {df.dateaddeddate}')
1       2021-04-05
2       2019-01-26
3       2016-12-30
4       2015-12-12

Creating Simplified Rating Columns

I now want to add another column to the DataFrame to simplify reporting against a track’s rating. Ratings in iTunes export files appear in multiples of twenty:

  • 1 star = 20
  • 2 stars = 40
  • 3 stars = 60
  • 4 stars = 80
  • 5 stars = 100

In my current DataFrame, printing myrating produces this:

print (f'Dataframe My Rating is {df.myrating}')
1        40.0
2        40.0
3        60.0
4        80.0

This produces a disconnect between the data in the DataFrame and the data in the iTunes GUI. I would prefer to keep things simple by having a column where the rating value mirrors the iTunes GUI.

This can be added to my DataFrame by using a function. I define an itunes_rating function that will return an integer based on the value that is passed to it:

def itunes_rating(r):
    """Converts ratings in export file to familiar format"""
    if r == 20:
        return 1
    elif r == 40:
        return 2
    elif r == 60:
        return 3
    elif r == 80:
        return 4
    elif r == 100:
        return 5
        return 0

I then create a new myratingdigit column in my DataFrame by passing each value in the myrating column to the itunes_rating function and capturing the result:

df['myratingdigit'] = df['myrating'].apply(itunes_rating)

And when I print the new column, the results are as expected:

print (f'Dataframe My Rating Digit is {df.myratingdigit}')
1       2
2       2
3       3
4       4

Setting Data Types

Finally, I want to make sure the DataFrame is using the correct data types for each column. Pandas will usually infer data types correctly but doesn’t always get it right.

I can use pandas.DataFrame.dtypes to see the current data types in my DataFrame. At the moment they are:

name                        object
artist                      object
album                       object
genre                       object
tracknumber                  int64
year                         int64
datemodified        datetime64[ns]
dateadded           datetime64[ns]
plays                      float64
lastplayed          datetime64[ns]
myrating                   float64
datemodifieddate            object
dateaddeddate               object
lastplayeddate              object
myratingdigit                int64

Most of these are correct but some need changing. For example, plays will never have decimal places so should be int, and columns like datemodifieddate should be datetime64.

Pandas has several options for this, which are laid out in this helpful Stack Overflow thread. Here, I use astype to assign data types to my dataframe:

df = df.astype(
        'name' : str,
        'artist' : str,
        'album' : str,
        'genre' : str,
        'tracknumber' : int,
        'year' : int,
        'datemodified' : datetime64,
        'dateadded' : datetime64,
        'plays' : int,
        'lastplayed' : datetime64,
        'myrating' : int,
        'datemodifieddate' : datetime64,
        'dateaddeddate' : datetime64,
        'lastplayeddate' : datetime64,
        'myratingint' : int

Pandas uses NumPy datetime64 dtypes for working with time series data, so I import it at the top of my script:

from numpy import datetime64

Fixing A Casting Exception

Unfortunately, while testing the newly assigned dtypes I started getting an error:

Exception has occurred: IntCastingNaNError
Cannot convert non-finite values (NA or inf) to integer

This error means that at least one of the columns I’m trying to cast as int contains an empty value. An infinite value is possible, but unlikely due to the various integrity checks iTunes performs on its library.

To find the empty values, I create a second DataFrame using the data in the first, using pandas.DataFrame.isna and pandas.DataFrame.any to find any NA values:

df1 = df[df.isna().any(axis=1)]

Included within the resulting DataFrame were the following tracks:

3571	7 Hours (Original Mix)	Dan Stone	07A-Dm	...	2019-01-26	NaT	1

3575	8th Wonder (Espen & Stian Remix)	8 Wonders	04A-Fm	...	2019-01-26	NaT	1

Checking iTunes shows that these tracks have no plays:

iTunes represents no plays as an empty string as opposed to a zero. This is then extracted into the DataFrame as NA, causing the IntCastingNaN error.

To fix this, I use pandas.DataFrame.fillna to replace the empty fields with zero. Although only the plays column is generating the error, I apply fillna to all the columns being cast as int to prevent any future problems for the ETL:

df['tracknumber'] = df['tracknumber'].fillna(0)
df['year'] = df['year'].fillna(0)
df['plays'] = df['plays'].fillna(0)
df['myrating'] = df['myrating'].fillna(0)

The myratingint column doesn’t need this approach, since my itunes_rating function always returns zero if no conditions are met.

This time, printing the data types shows an acceptable list:

name                        object
artist                      object
album                       object
genre                       object
tracknumber                  int64
year                         int64
datemodified        datetime64[ns]
dateadded           datetime64[ns]
plays                        int64
lastplayed          datetime64[ns]
myrating                     int64
datemodifieddate    datetime64[ns]
dateaddeddate       datetime64[ns]
lastplayeddate      datetime64[ns]
myratingdigit                int64

Exporting The DataFrame As A CSV

This is as far as I’m going to take the DataFrame in this post. As a final check, I want to extract the DataFrame in some form to confirm its suitability for future work I have planned.

The quickest way to do this is with pandas.DataFrame.to_csv. This writes the entire DataFrame to a csv file. When I run:


A ETL-ITU.csv file is created in the terminal’s working directory that can be viewed and sandboxed as needed.


My gitignore file commit from 2022-07-17 can be viewed here:

Basic_iTunes_Python_ETL .gitignore on GitHub

My file commit from 2022-07-17 can be viewed here: on GitHub

A requirements.txt file has also been created to aid installation. The file commit from 2022-07-20 can be viewed here:

Basic_iTunes_Python_ETL requirements.txt on GitHub


In this post I used Python and AWS Data Wrangler to create a basic iTunes ETL that extracts data from an iTunes export file into a Pandas DataFrame. I have used various Python modules to extract and transform the data, and the data is now ready to be loaded to a staging area of my choosing.

Expect to see further posts on this in the coming months. This basic iTunes ETL probably won’t stay basic for long!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Internet Of Things & Robotics

Building A Raspberry Pi Zero Lean Green Machine

In this post I will set up my new Raspberry Pi Zero, wire up a moisture sensor and use Python to convert the sensor data into email alerts.

Table of Contents


I recently wrote about setting up my Raspberry Pi 4. This isn’t the only Pi I have in the house though, as I put a Raspberry Pi Zero on my Christmas wishlist at about the same time and it turns out Santa was taking notes. Glad I got it before the current supply issues!

This post will flow differently from the last one as the Raspberry Pi Zero didn’t come as a kit. This time my board came in a plastic sleeve inside an unbranded white box. It came with GPIO pins attached (so I don’t have to learn how to solder properly yet!), but it didn’t come with a MicroSD card containing the Raspberry Pi OS. So let’s start there!

Installing Raspberry Pi OS 11 On A MicroSD Card

Raspberry Pis have no hard drives. Instead, they rely on memory cards for their storage. These memory cards can run a wide range of operating systems, the main limitations being the processor’s power and the operating system’s security.

For example, Windows 10 could be loaded onto a memory card but a Raspberry Pi Zero would have no chance of running it well. On the other hand, a Raspberry Pi might be able to run Windows 95 but it’d be a security nightmare.

Most use cases for a Raspberry Pi lend themselves to Raspberry Pi OS (previously called Raspbian) – a Debian-based operating system that is optimized for most Raspberry Pi hardware. Here, I’ll be installing it using the Raspberry Pi Imager.

Raspberry Pi Imager: Main Options

In my last post I mentioned that I’d need to take a look at the Raspberry Pi Imager. This software installs Raspberry Pi OS on MicroSD cards and replaces NOOBS. Versions are available for Windows, Mac and Ubuntu.

Firstly I need to choose an Operating System. There are a lot of choices here! There are different versions of the Raspberry Pi OS depending on whether I want Buster or Bullseye, and whether I want a desktop environment or not. Alternatively, I can choose to install a non-Raspberry Pi OS such as Ubuntu or Manjaro Linux.

Here the recommended setup is fine.

Next I need to choose my storage. I got a good deal on a multipack of Sandisk 64GB MicroSD cards, and will use one of those instead of overwriting the MicroSD card that came with my Labists Raspberry Pi 4 4GB Complete Starter Kit.

Raspberry Pi Imager: Advanced Options

After selecting the main options I can then access some advanced options to further customise my Raspberry Pi OS installation.

These Advanced Options are as follows:

Set Hostname

The hostname of a Raspberry PI enables it to be addressed by a name as well as an IP address. It is how a Raspberry Pi identifies itself to other systems on a local network. By default, the hostname is set to raspberrypi, but as I have more than one Raspberry Pi I will change this to avoid confusion.

Arise, mako!

Enable SSH

This option is essentially the same as the one in the Raspberry Pi Configuration settings. Enabling this now will save me a job later on.

Set Username And Password

This is a recent addition. The default username and password for any new Pi are pi and raspberry respectively, but a default password carries obvious security problems if unchanged.

Knowing a username isn’t as risky, but leaving the default in place makes life easier for a potential hacker so changing it is in my best interests. And no – it’s not mako.

Configure Wireless LAN

This is a technical way of saying ‘Set Up Wifi’.

Set Locale Settings

This selects the timezone and keyboard layout. Both were already set to GB, so this might have used my laptop’s settings. No changes needed here.

Writing To The MicroSD Card

Finally I started writing to the card. Seven minutes later I had a card which, upon insertion into my Raspberry Pi Zero and after a couple of minutes finding its feet, gave me a working installation of the latest version of Raspberry Pi OS 11 to play with!

Setting Up The Moisture Sensor

Time for my Raspberry Pi Zero to get its first job! This comes in the form of the ModMyPi Soil Moisture Sensor Kit.

The moisture sensor board features both analogue and digital outputs. The digital output gives a On or Off signal when the soil moisture content is above a certain value. The value can be set or calibrated using the adjustable onboard potentiometer.

Let’s get building!

Wiring Up The Sensor

The sensor kit comes with a jumper wire for connecting the sensor spade to the comparator board. It doesn’t come with any wires to connect the board to the Raspberry Pi though! Some hastily ordered jumper wires later and I was back in business.

As a first-timer potentially talking to other first-timers, I will say that the process of connecting the jumper cables to the various pins is an anxious experience.

All my senses were telling me that the pins looked delicate, so I used minimal pressure with the jumper wires and wondered why they kept coming off. It turns out these pins were designed to cope with some abuse after all, so being heavy-handed is encouraged. Just give the wires a good push!

It is also important to connect the wires to the correct GPIO pins on the Raspberry Pi. I used the Pi4J Project’s GPIO diagram to make sure the correct wires were connected to the correct pins.

Checking The Python Script

ModMyPi offer a Python script for this sensor on their Github repo. Let’s run through it and see what it does.

Importing Libraries

The script uses three libraries, all of which are included by default with Raspberry Pi OS 11:

  • RPi.GPIO: for controlling the GPIO pins on the Raspberry Pi.
  • smtplib: sends emails via SMTP.
  • time: for a sleep function that is part of the script.

Sending Emails

A sendEmail function uses the smtplib library and a set of variables to send emails when the function is called. The script prints "Successfully sent email" for successes and "Error: unable to send email" for failures.

Monitoring GPIO Pin

A callback function uses the RPi.GPIO library and a pair of variables to monitor the GPIO pin that the sensor is connected to. When a change is registered, one of two emails is sent via the sendEmail function depending on whether the sensor’s output is On or Off.

To keep the script running, the time library is used to add a slight delay to an infinite loop. This stops all of the CPU being used by the script, which would leave none for Raspberry Pi OS.

Testing The Sensor

To check that the sensor readings were coming through correctly, I needed a way to make the sensor wet or dry quickly. Necessity is the mother of invention, so I came up with this:

A sliced up milk bottle with a small hole in the lid and wires coming out of it. What a time to be alive.

My first attempts showed that the sensor was working (LED off and LED on) but the emails were failing:

Python 3.9.2 (/usr/bin/python3)
>>> %Run
LED off
Error: unable to send email
LED on
Error: unable to send email

Troubleshooting: Statements Must Be Separated By Newlines Or Semicolons

An example of one of the uses of the print function in the script is:

print "LED off"

Visual Studio Code flags this as a problem, stating that Statements must be separated by newlines or semicolons.

This is down to differences between Python 2 and Python 3. Those differences are beyond the scope of my post, but the problem itself is easy to fix. As print is considered a function in Python 3, it requires parentheses to work correctly:

print ("LED off")

Troubleshooting: Blocked SMTP Port

The original script sets the smtp_port to 25. This wouldn’t be a problem if my Raspberry Pi Zero was sending the emails. However, here I’m using Google’s Gmail SMTP server to send emails instead.

TCP port 25 is frequently blocked by Internet Service Providers, including Google, as an anti-spam technique. ISPs prefer port 587 as it is more advanced and supports secure communication via Transport Layer Security (TLS).

TLS enables secure and trustworthy communication. This security requires some additional information to work properly though…

Troubleshooting: Missing SMTP Methods

This is the section of the sample script that handles sending emails:

smtpObj = smtplib.SMTP(smtp_host, smtp_port)
smtpObj.login(smtp_username, smtp_password) 
smtpObj.sendmail(smtp_sender, smtp_receivers, smtp_message)
  • The first line sends a request to on TCP port 25.
  • The second line provides a user name and password for
  • Lastly, strings are given for the email sender, the recipients and what the email says.

In its current form this request will be rejected, as Gmail blocks TCP port 25. Initially, I just changed the port from 25 to 587 and ran the script again. This still didn’t work so I continued my research.

Having consulted Stack Overflow and Python’s smtplib documentation I realised that the script needed some additional methods. The sendEmail function needed two extra lines:

smtpObj = smtplib.SMTP(smtp_host, smtp_port)
smtpObj.login(smtp_username, smtp_password)
smtpObj.sendmail(smtp_sender, smtp_receivers, smtp_message) 

With the move to TCP port 587 and TLS, these new methods are needed to correctly introduce the script (and by extension the Raspberry Pi) to the Gmail SMTP server.

SMTP.ehlo opens communication between the script and Gmail. The script identifies itself via the Raspberry Pi’s fully qualified domain name, giving Gmail a way to identify the source of the request.

SMTP.starttls then asks Gmail if it supports TLS encryption. Gmail replies that it does, and all SMTP commands that follow are encrypted.

That’ll work now, right? Right?!

Troubleshooting: Insufficient Authentication

Even after these changes I was still getting problems. A Stack Abuse article suggested enabling the Less Secure App Access setting of my Gmail account. It turns out that Google and I have something in common – neither of us is keen on plain-text passwords flying around the Internet.

Google had to find some sort of middle ground and came up with this setting:

It is disabled by default and can’t be enabled on accounts with active MFA. Google are actively in the process of removing this setting, and will no longer support it from the end of May 2022. But for now this should be enough to get the emails flowing.

Retesting The Sensor

I ran the script again and got the feedback I was hoping for:

Python 3.9.2 (/usr/bin/python3)
>>> %Run
LED off
Successfully sent email
LED on
Successfully sent email

And a string of emails in my inbox:


Next Steps

Although this approach does work, it isn’t ideal for several reasons and will stop working completely when Google pull the plug on the Less Secure App Access setting. There are a number of changes I want to make.

Use AWS For Emails Instead Of Python

Sending emails via Python caused the bulk of the problems here. The approach this script uses is not secure and will soon be unsupported by the third party it relies on.

I could set up the Raspberry Pi in such a way that it could send emails itself, but ideally I want the Raspberry Pi to be doing as little work as possible with as few credentials as possible.

Enter AWS. I’ve already used SNS a number of times for emails, and the various AWS IoT services offer several options for communication with my device. This would let me decouple the email functionality from the sensor functionality.

In addition, AWS can handle the security side of things. Instead of my Raspberry Pi having root credentials for a Google account, it can have an AWS IoT certificate that will only allow specific actions.

Disable Google Less Secure App Access Setting

If AWS are handling the emails then I don’t need to use the smtplib library anymore. Which means I don’t need to use the Gmail SMTP. Which means I can disable the Less Secure App Access setting!

Google is sending me security warnings about this on almost a weekly basis at the moment. They want this OFF. So I want this off too.

Control Email Frequency

As the earlier screenshot showed, I got hammered with emails by the script. I would prefer to get emails based on a series of events or periods of time instead of a blow-by-blow account. CloudWatch is an obvious candidate here, and AWS IoT Events may also be a contender.

Monitor Device Health

Finally, there’s the question of device health. Any number of problems could occur with the current setup. The sensor could stop working (soil is acidic after all). There could be a power cut. Wolfie could knock everything over. In all of these situations, nothing happens. The readings just stop.

With AWS I can monitor the flow of data. I can look for periods with no data, and set alarms based on those periods. I might have to adjust how the Python script reads the outputs from the sensor, but at least I’ll know that the sensor is working and that my plant is lush and green instead of brown and crisp.


In this post I have set up my new Raspberry Pi Zero, wired up a moisture sensor and used Python to convert the sensor data into email alerts. I have modernised the Python script and removed syntax errors, and have identified areas where I can improve the security, reliability and operational excellence of the overall solution.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Developing & Application Integration

Re-Runnable Strava API Calls Using Python

In this post I make my existing Python code re-runnable by enabling it to replace expired access tokens when it sends requests to the Strava API.

A couple of posts ago I wrote about authenticating Strava API calls. I ended up successfully requesting data using this Python code:

import requests

activities_url = "" 

header = {'Authorization': 'Bearer ' + "access_token"}
param = {'per_page': 200, 'page': 1}

my_dataset = requests.get(activities_url, headers=header, params=param).json()


Although successful, the uses for this code are limited as it stops working when the header’s access_token expires. Ideally the code should be able to function constantly once Strava grants initial authorisation, which is what I’m exploring here. Plus the last post was unclear in places so this one will hopefully tie up some loose ends.

Please note that I have altered or removed all sensitive codes and tokens in this post in the interests of security.

The Story So Far

First of all, some reminders. Strava uses OAuth 2.0 for authentication, and this is a typical OAuth 2.0 workflow:


I am sending GET requests to Strava via Get Activity. Strava’s documentation for this is as follows:


Finally, during my initial setup I created an API Application on the Strava site. Strava provided these details upon completion:


Authorizing My App To View Data

Strava’s Getting Started page explains that they require authentication via OAuth 2.0 for data requests and gives the following link for that process:[REPLACE_WITH_YOUR_CLIENT_ID]&response_type=code&redirect_uri=http://localhost/exchange_token&approval_prompt=force&scope=read

I must amend this URL as scope=read is insufficient for Get Activity requests. The end of the URL becomes scope=activity:read_all and the updated URL loads a Strava authorization screen:


Selecting Authorize gives the following response:


Where code=CODE9fbb is a single-use authorization code that I will use to create access tokens.

Getting Tokens For API Requests

Next I will use CODE9fbb to request access tokens which Get Activity will accept. This is done via the following cURL request:

curl -X POST \
  -d client_id=APIAPP-CLIENTID \
  -d client_secret=APIAPP-SECRET \
  -d code=CODE9fbb \
  -d grant_type=authorization_code

Here, Client_ID and Client_Secret are from my API application, Code is the authorization code CODE9fbb and Grant_Type is what I’m asking for – Strava’s Authentication documentation states this must always be authorization_code for initial authentication.

Strava then responds to my cURL request with a refresh token, access token, and access token expiration date in Unix Epoch format:

"token_type": "Bearer",
  "expires_at": 1642370007,
  "expires_in": 21600,
  "refresh_token": "REFRESHc8c4",
  "access_token": "ACCESS22e5",

Why two tokens? Access tokens expire six hours after they are created and must be refreshed to maintain access to the desired data. Strava uses the refresh tokens as part of the access token refresh process.

Writing The API Request Code

With the tokens now available I can start assembling the Python code for my Strava API requests. I will again be using Visual Studio Code here. I make a new Python virtual environment called StravaAPI by running py -3 -m venv StravaAPI, activate it using StravaAPI\Scripts\activate and run pip install requests to install the module I need. Finally I create an empty file in the StravaAPI virtual environment folder for the Python code.

Onto the code. The first part imports the requests module, declares some variables and sets up a request to refresh an expired access code as detailed in the Strava Authentication documentation:

# Import modules
import requests

# Set Variables
apiapp_clientid = "APIAPP-CLIENTID"
apiapp_secret = 'APIAPP-SECRET'
token_refresh = 'REFRESHc8c4'

# Requesting Access Token
url_oauth = ""
payload_oauth = {
	'client_id': apiapp_clientid,
	'client_secret': apiapp_secret,
	'refresh_token': token_refresh,
	'grant_type': "refresh_token",
	'f': 'json'

Note this time that the Grant_Type is refresh_token instead of authorization_code. These variables can then be used by the requests module to send a request to Strava’s API:

print("Requesting Token...\n")
req_access_token =, data=payload_oauth, verify=False)

This request is successful and returns existing tokens ACCESS22e5 and REFRESHc8c4 as they have not yet expired:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESS22e5', 'expires_at': 1642370008, 'expires_in': 20208, 'refresh_token': 'REFRESHc8c4'}

A warning is also presented here as my request is not secure:

InsecureRequestWarning: Unverified HTTPS request is being made to host ''. Adding certificate verification is strongly advised.

The warning includes a link to urllib3 documentation, which states:

Making unverified HTTPS requests is strongly discouraged, however, if you understand the risks and wish to disable these warnings, you can use disable_warnings()

As this code is currently in development, I import the urllib3 module and disable the warnings:

# Import modules
import requests
import urllib3

# Disable Insecure Request Warnings

Next I extract the access token from Strava’s response into a new token_access variable and print that in the terminal as a process indicator:

print("Requesting Token...\n")
req_access_token =, data=payload_oauth, verify=False)

token_access = req_access_token.json()['access_token']
print("Access Token = {}\n".format(token_access))

So far the terminal’s output is:

Requesting Token...

Access Token = ACCESS22e5

Let’s get some data! I’m making a call to Get Activities now, so I declare three variables to compose the request and include the token_access variable from earlier :

# Requesting Athlete Activities
url_api_activities = ""
header_activities = {'Authorization': 'Bearer ' + token_access}
param_activities = {'per_page': 200, 'page' : 1}

Then I use the requests module to send the request to Strava’s API:

print("Requesting Athlete Activities...\n")
dataset_activities = requests.get(url_api_activities, headers=header_activities, params=param_activities).json()

And receive data about several recent activities as JSON in return. Success! The full Python code is as follows:

# Import modules
import requests
import urllib3

# Disable Insecure Request Warnings

# Set Variables
apiapp_clientid = "APIAPP-CLIENTID"
apiapp_secret = 'APIAPP-SECRET'
token_refresh = 'REFRESHc8c4'

# Requesting Access Token
url_oauth = ""
payload_oauth = {
	'client_id': apiapp_clientid,
	'client_secret': apiapp_secret,
	'refresh_token': token_refresh,
	'grant_type': "refresh_token",
	'f': 'json'

print("Requesting Token...\n")
req_access_token =, data=payload_oauth, verify=False)

token_access = req_access_token.json()['access_token']
print("Access Token = {}\n".format(token_access))

# Requesting Athlete Activities
url_api_activities = ""
header_activities = {'Authorization': 'Bearer ' + token_access}
param_activities = {'per_page': 200, 'page' : 1}
print("Requesting Athlete Activities...\n")
dataset_activities = requests.get(url_api_activities, headers=header_activities, params=param_activities).json()

But Does It Work?

This only leaves the question of whether the code works when the access code expires. As a reminder this was Strava’s original response:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESS22e5', 'expires_at': 1642370008, 'expires_in': 20208, 'refresh_token': 'REFRESHc8c4'}

Expiry 1642370008 is Sunday, 16 January 2022 21:53:28. I run the code at 22:05 and:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESSe0e7', 'expires_at': 1642392321, 'expires_in': 21559, 'refresh_token': 'REFRESHc8c4'}

A new access token! The new expiry 1642392321 is Monday, 17 January 2022 04:05:21. And when I run the code at 09:39:

{'token_type': 'Bearer', 'access_token': 'ACCESS74dd', 'expires_at': 1642433966, 'expires_in': 21600, 'refresh_token': 'REFRESHc8c4'}

A second new access code. All working fine! As long as my refresh token remains valid I can continue to get valid access tokens when they expire.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~