Python is an interpreted, interactive, object-oriented programming language. It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes.
To mark the occasion, I decided to examine the MySQL database that WordPress uses to run amazonwebshark, and see what it could tell me about the past twelve months.
In addition, I’ve been trying out DataCamp and have recently finished their Intermediate Python course. It introduced me to Matplotlib, and this post is a great chance to try out those new skills!
Timeline
I’ll start by answering a question. Why is amazonwebshark’s birthday on January 09 when my first post’s publication date is December 02 2021?
In response, here’s a brief timeline of how amazonwebshark came to be:
17 October 2021: Work starts on amazonwebshark as a WordPress.com free site.
November 2021: WordPress.com’s limitations make WordPress.org more attractive. I decide to write a blog post and see how things go.
While phpMyAdmin is great for basic maintenance, it’s not very convenient for data analysis:
There are no data tools like schema visualization or query plans.
It lacks scripting tools like IntelliSense or auto-complete.
Accessing it usually involves accessing the web host’s portal first.
Ideally, I’d prefer to access my database remotely using a SQL client instead. While this needs some additional config, SiteGround makes this very simple!
Remote Database Access
By default, SiteGround denies any remote access requests. Many people will never use this feature, so disabling it is a good security measure. Remote access is enabled in the SiteGround portal, which grants immediate remote access for specified IPs and hostnames.
This isn’t strictly necessary, but the default user has unlimited access and violates the principle of least privilege in this situation.
The following SiteGround video demonstrates enabling remote access:
Analysis Tools
In this section, I examine the various tools I’ll use for amazonwebshark’s 1st birthday data analysis.
DBeaver
DBeaver is a free database tool and SQL client. It is multi-platform, open-source and supports a variety of databases including Microsoft SQL Server, Amazon Athena and MySQL.
After mysql-connector-python is installed and imported, a connection to a MySQL database can be made using the mysql.connector.connect() function with the following arguments:
host: Hostname or IP address of the MySQL server.
database: MySQL database name.
user: User name used to authenticate with the MySQL server.
password: Password to authenticate the user with the MySQL server.
This opens a connection to the MySQL server and creates a connection object. I store this in the variable conn_mysql:
I have my credentials stored in a separate Python script that is imported as var. This means I can protect them via .gitignore until I get something better in place!
After that, I need a cursor for running my SQL queries. I create this using the cursor() method of my conn_mysql connection object and store the whole thing as cursor:
I’m now in a position to start running SQL queries from Python. I import a SQL query from my var script (in this case my Category query) and store it as query:
In this section, I begin amazonwebshark’s 1st birthday data analysis by writing a SQL query for amazonwebshark’s categories and analysing the results with Python.
Categories Analysis: SQL Query
For my Category SQL query, I’ll be using the terms and term_taxonomy tables:
WordPress has a taxonomy system for content organization. Individual taxonomy items are called terms, and they are stored in the terms table. Terms for amazonwebshark include Data & Analytics and Security & Monitoring.
The term_taxonomy table links a term_id to a taxonomy, giving context for each term. Common taxonomies are category, post_tag and nav_menu.
If I join these tables on term_id then I can map terms to taxonomies. In the following query results, the first two columns are from terms and the rest are from term_taxonomy:
My final query keeps the join, cleans up terms.name, returns all categories with at least one use and orders the results by term_taxonomy.count and terms.name:
So how do I turn this into a graph? Firstly, I need to split results up into what will be my X-axis and Y-axis. For this, I create two empty lists called name and count:
I use barh to change the graph to a horizontal bar chart, and then swap the xlabel and ylabel strings around. This time the chart is far easier to read:
Tags Analysis
In this section, I continue amazonwebshark’s 1st birthday data analysis by writing a SQL query for amazonwebshark’s tags and analysing the results with Python.
Tags Analysis: SQL Query
My Tags SQL query is almost the same as my Categories one. This time, my WHERE clause is filtering on post_tag:
I also planned to analyse my post views here. But, as I mentioned in my last post, some of this data is missing! So any chart will be wrong.
I haven’t had time to look at this yet, so stay tuned!
Summary
In this post, I celebrated amazonwebshark’s 1st birthday with an analysis of my site’s MySQL data using DBeaver, Python and Matplotlib.
I had fun researching and writing this! It can be tricky to find a data source that isn’t contrived or overused. Having access to the amazonwebshark database gives me data that I’m personally invested in, and an opportunity to practise writing MySQL queries.
I’ve also been able to improve my Python, and meaningfully experiment with Matplotlib to get charts that will be useful going forward. For example, I used the very first Tags chart to prune some unneeded tags from my WordPress taxonomy.
If this post has been useful, please feel free to follow me on the following platforms for future updates:
Please be as specific as possible with your examples and include your reasoning.
Good question!
In each section, I’ll use a different language. Firstly I’ll create a script, and then show a problem the script could encounter in production. Finally, I’ll show how a different approach can prevent that problem from occurring.
I’m limiting myself to three production code qualities to keep the post at a reasonable length, and so I can show some good examples.
Precision
In this section, I use T-SQL to show how precise code in production can save a data pipeline from unintended failure.
Now, let’s say that I have a data pipeline that uses data in amazonwebshark.2022.sharkspecies for transformations further down the pipeline.
No problem – I create a #tempsharkstemp table and insert everything from amazonwebshark.2022.sharkspecies using SELECT *:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#tempsharks has five columns but amazonwebshark.2022.sharkspecies now has six. My script is now trying to insert all six sharkspecies columns into the temp table, causing the msg 213 error.
Doing Things Differently
The solution here is to replace row 21’s SELECT * with the precise columns to insert from amazonwebshark.2022.sharkspecies:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
While amazonwebshark.2022.sharkspecies now has six columns, my script is only inserting five of them into the temp table:
I can add the last_evaluated column into #tempsharks in future, but its absence in the temp table isn’t causing any immediate problems.
Works The Same In Other Environments
In this section, I use Python to show the value of production code that works the same in non-production.
Setting The Scene
Here I have a Python script that reads data from an Amazon S3 bucket using a boto3 session. I pass my AWS_ACCESSKEY and AWS_SECRET credentials in from a secrets manager, and create an s3bucket variable for the S3 bucket path:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When I deploy this script to my dev environment it works fine.
What’s The Problem?
When I deploy this script to production, s3bucket will still be s3://dev-bucket. The potential impact of this depends on the AWS environment setup:
Different AWS account for each environment:
dev-bucket doesn’t exist in Production. The script fails.
Same AWS account for all environments:
Production IAM roles might not have any permissions for dev-bucket. The script fails.
Production processes might start using a dev resource. The script succeeds but now data has unintentionally crossed environment boundaries.
Doing Things Differently
A solution here is to dynamically set the s3bucket variable based on the ID of the AWS account the script is running in.
I can get the AccountID using AWS STS. I’m already using boto3, so can use it to initiate an STS client with my AWS credentials.
STS then has a GetCallerIdentity action that returns the AWS AccountID linked to the AWS credentials. I capture this AccountID in an account_id variable, then use that to set s3bucket‘s value:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For bonus points, I can terminate the script if the AWS AccountID isn’t defined. This prevents undesirable states if the script is run in an unexpected account.
Speaking of which…
Prevents Undesirable States
In this section, I use PowerShell to demonstrate how to stop production code from doing unintended things.
Setting The Scene
In June I started writing a PowerShell script to upload lossless music files from my laptop to one of my S3 buckets.
I worked on it in stages. This made it easier to script and test the features I wanted. By the end of Version 1, I had a script that dot-sourced its variables and wrote everything in my local folder $ExternalLocalSource to my S3 bucket $ExternalS3BucketName:
NOTE: There were several problems with Version 1, all of which were fixed in Version 2. In the interests of simplicity, I’ll focus on a single one here.
In this script, Write-S3Object will upload everything in the local folder $ExternalLocalSource to the S3 bucket $ExternalS3BucketName.
Problem is, the $ExternalS3BucketName S3 bucket isn’t for everything! It should only contain lossless music files!
At best, Write-S3Object will upload everything in the local folder to S3 whether it’s music or not.
At worst, if the script is pointing at a different folder it will start uploading everything there instead! PowerShell commonly defaults to C:\Windows, so this could cause all kinds of problems.
Doing Things Differently
I decided to limit the extensions that the PowerShell script could upload.
So now, if I attempt to upload an unacceptable .log file, PowerShell raises an exception and terminates the script:
**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log
Checking extensions are valid for each local file.
Unacceptable .log file found. Exiting.
**********************
While an acceptable .flac file will produce this message:
**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log
Checking extensions are valid for each local file.
Acceptable .flac file.
**********************
In this post, I responded to November 2022’s T-SQL Tuesday #156 Invitation and gave my thoughts on some production code qualities. I gave examples of each quality and showed how they could save time and prevent unintended problems in a production environment.
It’s time to do something with that data! I want to analyse my iTunes data and look for trends and insights into my listening habits. I also want to access these insights in the cloud, as my laptop is a bit bulky and quite slow. Finally, I’d prefer to keep my costs to a minimum.
Here, I’ll show how AWS and Python can be used together to meet these requirements. Let’s start with AWS.
Amazon S3
In this section, I will update my S3 setup. I’ll create some new buckets and explain my approach.
New S3 Buckets
Currently, I have a single S3 bucket containing my iTunes Export CSV. Moving forward, this bucket will contain all of my unmodified source objects, otherwise known as raw data.
To partner the raw objects bucket, I now have an ingested objects bucket. This bucket will contain objects where the data has been transformed in some way. My analytics tools and Athena tables will point here for their data.
Speaking of Athena, the other new bucket will be used for Athena’s query results. Although Athena is serverless, it still needs a place to record queries and store results. Creating this bucket now will save time later on.
Having separate buckets for each of these functions isn’t a requirement, although it is something I prefer to do. Before moving on, I’d like to run through some of the benefits I find with this approach.
Advantages Of Multiple Buckets
Firstly, having buckets with clearly defined purposes makes navigation way easier. I always know where to find objects, and rarely lose track of or misplace them.
Secondly, having multiple buckets usually makes my S3 paths shorter. This doesn’t sound like much of a benefit upfront, but the S3 path textboxes in the AWS console are quite small, and using long S3 paths in the command line can be a pain.
Finally, I find security and access controls are far simpler to implement with a multi-bucket setup. Personally I prefer “You can’t come into this house/bucket” over “You can come into this house/bucket, but you can’t go into this room/prefix”. However, both S3 buckets and S3 prefixes can be used as IAM policy resources so there’s technically no difference.
That concludes the S3 section. Next, let’s set up Athena.
Amazon Athena
In this section, I’ll get Athena ready for use. I’ll show the process I followed and explain my key decisions. Let’s start with my reasons for choosing Athena.
Why Athena?
Plenty has been written about Athena’s benefits over the years. So instead of retreading old ground, I’ll discuss what makes Athena a good choice for this particular use case.
Firstly, Athena is cheap. The serverless nature of Athena means I only pay for what I query, scan and store, and I’ve yet to see a charge for Athena in the three years I’ve been an AWS customer.
Secondly, like S3, Athena’s security is managed by IAM. I can use IAM policies to control who and what can access my Athena data, and can monitor that access in CloudTrail. This also means I can manage access to Athena independently of S3.
Finally, Athena is highly available. Authorised calls to the service have a 99.9% Monthly Uptime Percentage SLA and Athena benefits from S3’s availability and durability. This allows 24/7 access to Athena data for users and applications.
Setting Up Athena
To start this section, I recommend reading the AWS Athena Getting Started documentation for a great Athena introduction. I’ll cover some basics here, but I can’t improve on the AWS documentation.
Athena needs three things to get off the ground:
An S3 path for Athena query results.
A database for Athena tables.
A table for interacting with S3 data objects.
I’ve already talked about the S3 path, so let’s move on to the database. A database in Athena is a logical grouping for the tables created in it. Here, I create a blog_amazonwebshark database using the following script:
Next, I enter the column names from my iTunes Export CSV into Athena’s Create Table form, along with appropriate data types for each column. In response, the form creates this Athena table:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Please note that I have removed the S3 path from the LOCATION property to protect my data. The actual Athena table is pointing at an S3 prefix in my ingested objects bucket that will receive my transformed iTunes data.
Speaking of data, the form offers several choices of source data format including CSV, JSON and Parquet. I chose Parquet, but why do this when I’m already getting a CSV? Why create extra work?
Let me explain.
About Parquet
Apache Parquet is a file format that supports fast processing for complex data. It can essentially be seen as the next generation of CSV. Both formats have their place, but at scale CSV files have large file sizes and slow performance.
In contrast, Parquet files have built-in compression and indexing for rapid data location and retrieval. In addition, the data in Parquet files is organized by column, resulting in smaller sizes and faster queries.
This also results in Athena cost savings as Athena only needs to read the columns relevant to the queries being run. If the same data was in a CSV, Athena would have to read the entire CSV whether the data is needed or not.
That’s everything for Athena. Now I need to update my Python scripts.
Python
In this section, I’ll make changes to my Basic iTunes ETL to include my new S3 and Athena resources and to replace the CSV output with a Parquet file. Let’s start with some variables.
New Python Variables
My first update is a change to ETL_ITU_Play_Variables.py, which contains my global variables. Originally there were two S3 global variables – S3_BUCKET containing the bucket name and S3_PREFIX containing the S3 prefix path leading to the raw data:
Now I have two buckets and two prefixes, so it makes sense to update the variable names. I now have two additional global variables, adding _RAW to the originals and _INGESTED to the new ones for clarity:
The next change is to ETL_ITU_Play.py. The initial version converts a Pandas DataFrame to CSV using pandas.DataFrame.to_csv. I’m now replacing this with awswrangler.s3.to_parquet, which needs three parameters:
Before committing my changes, I took the time to put the main workings of my ETL in a class. This provides a clean structure for my Python script and will make it easier to reuse in future projects.
That completes the changes. Let’s review what has been created.
Architecture
Here is an architectural diagram of how everything fits together:
Here is a breakdown of the processes involved:
User runs the Python ETL script locally.
Python reads the CSV object in datalake-raw S3 bucket.
Python extracts data from CSV into a DataFrame and transforms several columns.
Python writes the DataFrame to datalake-ingested S3 bucket as a Parquet file.
Python notifies User of a successful run.
User sends query to Athena.
Athena reads data from datalake-ingested S3 bucket.
Athena returns query results to User.
Testing
In this section, I will test my resources to make sure they work as expected. Bare in mind that this setup hasn’t been designed with production use in mind, so my testing is somewhat limited and would be insufficient for production deployment.
Testing Python
TEST: Upload a CSV to the datalake-raw S3 bucket, then run the Python script. The Python script must run successfully and print updates in the terminal throughout.
RESULT: I upload an iTunes Export CSV to the datalake-raw S3 bucket:
The Python script runs, printing the following output in the terminal:
Replacing blank values to prevent IntCastingNaN errors.
9
Setting Data Types.
10
Creating Parquet file from DataFrame.
11
Processes complete.
Testing S3
TEST: After the Python script successfully runs, the datalake-ingested S3 bucket must contain an itunesdata.parquet object.
RESULT: Upon accessing the datalake-ingested S3 bucket, an itunesdata.parquet object is found:
(On an unrelated note, look at the size difference between the Parquet and CSV files!)
Testing Athena
TEST: When the datalake-ingested S3 bucket contains an itunesdata.parquet object, data from the iTunes Export CSV must be shown when the following Athena query is run:
SELECT * FROM basic_itunes_python_etl;
RESULT: Most of the Athena results match the iTunes Export data. However, the transformed dates did not match expectations:
This appears to be a formatting problem, as some parts of a date format are still visible.
To diagnose the problem I wanted to see how these columns were being stored in the Parquet file. I used mukunku’sParquetViewer for this, which is described in the GitHub repo as:
…a quick and dirty utility that I created to easily view Apache Parquet files on Windows desktop machines.
It works very well!
Here is a screenshot of the data. The lastplayed column has dates and times, while the datamodifieddate column has dates only:
The cause of the problem becomes apparent when the date columns are viewed using the ISO 8601 format:
The date columns are all using timestamps, even when no times are included!
A potential fix would be to change the section of my Python ETL script that handles data types. Instead, I update the data types used in my Athena table from date:
This time, when I view my Athena table the values all appear as expected:
Scripts
My ETL_ITU_Play.py file commit from 2022-08-08 can be viewed here:
ETL_ITU_Play.py on GitHub
My updated repo readme can be viewed here:
README.md on GitHub
Summary
In this post, I updated my existing iTunes Python ETL to return a Parquet file, which I then uploaded S3 and viewed using Athena. I explained my reasoning for choosing S3, Athena and the Parquet file format, and I handled a data formatting issue.
If this post has been useful, please feel free to follow me on the following platforms for future updates: