Tag: GitHub

Posts containing links to my GitHub account or discussing GitHub as a service.

GitHub is a provider of Internet hosting for software development and version control using Git.

More information can be found on the GitHub website.

Migrating amazonwebshark To SiteGround

Post author By Damien Jones
Post date December 29, 2022

In this post, I examine the process of migrating amazonwebshark to SiteGround and give an overview of the processes involved.

Introduction
Hosting Alternatives
- Amazon Lightsail
- SiteGround
Migrate What Exactly?
Data Migration
DNS Migration
Any Problems?
Future Plans
Summary

Introduction

When I started amazonwebshark I had to make some infrastructure decisions. I registered the domain name with Amazon Route 53, and then needed to choose a blog hosting platform.

In December 2021 I took advantage of a Bluehost offer and paid £31.90 for their Basic WordPress Hosting package. This included a variety of services including:

Provisioned server space for site resources.
Managed and automated WordPress & MySQL installations.
Web services like free SSL certificates and CDN caching.

My Bluehost renewal came through earlier this month, priced at £107.76. I’ve had great service from Bluehost and have no complaints, but that price was quite a leap. So, before I accepted it, I decided to do some research and see what my alternatives were.

Hosting Alternatives

In this section, I go through the results of my research into alternative hosting platforms.

Amazon Lightsail

I started by looking at Amazon Lightsail. Essentially, Lightsail is a simplified way of deploying AWS services like EC2, EBS and Elastic Load Balancing.

Lightsail pricing differs from most AWS services. Instead of the common Pay-As-You-Go pricing model, Lightsail has set monthly pricing. For example, a Linux server with similar memory, processing and storage to my Bluehost server currently costs $3.50 a month.

There is an important difference though. While Bluehost has teams of people responsible for tasks like server maintenance, database recovery, hard disk failures and security patches, with Lightsail the infrastructure would become my responsibility. I would save money over Bluehost, but at the cost of doing my own systems admin.

And the above list isn’t even exhaustive! It doesn’t include the setup and maintenance of services like CDN, SSL certificates and email accounts, all of which come with their own extra requirements and costs.

At this point, Bluehost was still on top.

SiteGround

SiteGround is in the same business as Bluehost. It offers a variety of hosting solutions for an array of use cases and has good standing in the industry.

SiteGround also had a great Black Friday offer this year! It was offering pretty much the same deal as Bluehost at £1.99 a month:

This is INSANELY cheap, especially considering how much all this infrastructure costs to run!

SiteGround has also developed a free WordPress plugin to automate migrations from other hosting platforms. While this isn’t unique to them, a combination of good reviews, extensive services, low hassle and a great price was more than enough to get me on board.

Migrate What Exactly?

Before continuing, I thought it best to go into a bit of detail about what exactly is being migrated. I’ve mentioned servers, databases and domains, but what gets moved where? And why?

Well, because I’m moving things around on the Internet, I need to talk about the Domain Name System (DNS).

Wait! Come back!

Explain DNS Like I’m 5

What follows is a very simple introduction to DNS. There’s far more to DNS than this, but that’s beyond the scope of this post.

Let’s say I want to phone The Shark Trust. I can’t type “The Shark Trust” into my handset – I need their phone number. So I open my phone book, turn to the S section and find The Shark Trust. Next to this entry is a phone number: 01752 672008. I type that number into the handset and get through to their office.

DNS is like the Internet’s phone book. Websites are held on servers, and the ‘phone numbers’ for those servers are IP addresses. When I request a website like amazonwebshark.com, my web browser needs to know the IP address for the server holding the site’s data, for example 34.91.95.18.

Explain DNS With Pictures

This WebDeasy diagram show DNS at a high level:

WebDeasy: How the Domain Name System (DNS) works – Basics

When a URL is entered into a web browser, a query is sent to a DNS server. Using the phone book analogy, the web browser is asking the DNS server for the amazonwebshark.com phone number.

DNS servers don’t have any IP addresses, but they know which ‘phone book’ to look in. These ‘phone books’ are called name servers. The DNS server finds and contacts the right name server, which matches the amazonwebshark.com domain name to an IP address.

The DNS server then returns this IP address to the web browser, which uses it to contact the server hosting the amazonwebshark.com resources.

In the diagram, the DNS-Server represents Route 53. Route 53 holds DNS records for the amazonwebshark.com domain name, and knows where to find the name servers that have the amazonwebshark.com IP address.

The webdeasy.de server represents the Bluehost name servers. These servers can answer a variety of DNS queries, and are considered the ground truth for initial site visits and browser caching.

amazonwebshark’s DNS Setup

At the start of December 2022 the amazonwebshark.com domain name was hosted by Route 53, with an NS record pointing at the Bluehost name servers:

The basic infrastructure looked like this, with outbound requests in blue and inbound responses in green:

And that’s it! To further explore DNS core concepts, this DNSimple comic is well worth a read and this Fireship video gives a solid, if a little more technical, account:

Data Migration

In this section, I start migrating my amazonwebshark data from Bluehost to SiteGround.

SiteGround has an automated migrator plugin that copies existing WordPress sites from other hosting platforms. And it’s very good! The process boils down to:

The process can also be seen in this Avada video:

The plugin copies all the amazonwebshark server files, scripts and database objects in a process that takes about five minutes. SiteGround then provides a temporary URL for testing and performance checks:

After completely migrating amazonwebshark to SiteGround, the next step involves telling the amazonwebshark domain where to find the new server. Time for some DNS!

DNS Migration

In this section, I update the amazonwebshark DNS records with the SiteGround name servers.

I repointed the existing amazonwebshark NS record from Bluehost to SiteGround by updating the values in Route 53 from this:

To this:

My change then needed to propagate through the Internet. Internet Service Providers update their records at different rates, so changes can take up to 72 hours to complete worldwide.

Free DNS checking tools like WhatIsMyDNS can perform global checks on a domain name’s IP address and DNS record information. The check below was done after around 30 hours, by which time most of the servers were returning SiteGround IPs:

Any Problems?

First, the good news. There was no downtime while migrating amazonwebshark to SiteGround! During the migration, DNS queries were resolved by either Bluehost’s or SiteGround’s name servers. Both platforms had amazonwebshark data, so both could answer DNS queries.

Additionally, as I set a change freeze on amazonwebshark until the migration was over, there was no lost or orphaned content.

I did lose some WPStatistics hit statistics data though. There is no data for December 03 and December 04:

This was my fault. The DNS propagation took longer than it should have because of a misunderstanding on my part!

So why was data lost? WPStatistics stores data in tables in the site’s MySQL database. When I first migrated my data on December 02, the Bluehost and SiteGround tables were the same. After that point, Bluehost continued to serve amazonwebshark until December 05, and wrote its statistics in the Bluehost MySQL tables.

It was only after I corrected my DNS mistake that SiteGround could start serving content and writing statistics on the SiteGround MySQL tables. So SiteGround didn’t record anything for December 03 and December 04, and as no additional data migration was done the statistics that Bluehost recorded never made it to the SiteGround tables.

I can recover this if I want to though. I took a full backup of my Bluehost data before ending the contract. That included a full backup of the Bluehost MySQL database with the WPStatistics tables. I’ll take a look at the tables at some point, see how the data is arranged and decide from there.

Future Plans

I’m considering moving amazonwebshark to a serverless architecture in 2023. While the migration was a success, servers still have inherent problems:

Servers can break or go offline.
They can be hacked.
They can be over or under-provisioned.

Serverless infrastructure could remove those pain points. I don’t use any WordPress enterprise features, and amazonwebshark could exist very well as an event-driven static website. Tools like Hugo and Jekyll are designed for the job and documented well, and people like Kendra Little and Chrissy LeMaire have successfully transitioned their blogs to serverless infrastructures.

The biggest challenge here isn’t architectural. If I moved to a serverless architecture, I would want something similar to the Yoast SEO analysis plugin. This plugin has really helped me improve my posts, and by extension has made them more enjoyable to write.

I’ve seen lots of serverless tooling for migrating resources and serving content, but not so much for SEO guidance and proofreading. Any amazonwebshark serverless migration would be contingent on finding something decent along these lines. After all, if the blog becomes a pain to write for then what’s the point?

Summary

In this post, I examined the process of migrating amazonwebshark to SiteGround and gave an overview of the processes involved.

I’m very happy with how things went overall! The heavy lifting was done for me, both companies were open and professional throughout and what could have been a daunting process was made very simple!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Tags Amazon Route53, Amazon Web Services, DNS, GitHub, Troubleshooting, WordPress

Developing & Application Integration

Production Code Qualities

Post author By Damien Jones
Post date November 8, 2022

In this post, I respond to November 2022’s T-SQL Tuesday #156 Invitation and give my thoughts on some production code qualities.

Introduction
Precision
Works The Same In Other Environments
Prevents Undesirable States
Summary

Introduction

This month, Tomáš Zíka’s T-SQL Tuesday invitation was as follows:

Which quality makes code production grade?

Please be as specific as possible with your examples and include your reasoning.

Good question!

In each section, I’ll use a different language. Firstly I’ll create a script, and then show a problem the script could encounter in production. Finally, I’ll show how a different approach can prevent that problem from occurring.

I’m limiting myself to three production code qualities to keep the post at a reasonable length, and so I can show some good examples.

Precision

In this section, I use T-SQL to show how precise code in production can save a data pipeline from unintended failure.

Setting The Scene

Consider the following SQL table:

USE [amazonwebshark]
GO

CREATE TABLE [2022].[sharkspecies](
	[shark_id] [int] IDENTITY(1,1) NOT NULL,
	[name_english] [varchar](100) NOT NULL,
	[name_scientific] [varchar](100) NOT NULL,
	[length_max_cm] [int] NULL,
	[url_source] [varchar](1000) NULL
)
GO

This table contains a list of sharks, courtesy of the Shark Foundation.

Now, let’s say that I have a data pipeline that uses data in amazonwebshark.2022.sharkspecies for transformations further down the pipeline.

No problem – I create a #tempsharks temp table and insert everything from amazonwebshark.2022.sharkspecies using SELECT *:

When this script runs in production, I get two tables with the same data:

What’s The Problem?

One day a new last_evaluated column is needed in the amazonwebshark.2022.sharkspecies table. I add the new column and backfill it with 2019:

ALTER TABLE [2022].sharkspecies
ADD last_evaluated INT DEFAULT 2019 WITH VALUES
GO

However, my script now fails when trying to insert data into #tempsharks:

(1 row affected)

(4 rows affected)

Msg 213, Level 16, State 1, Line 17
Column name or number of supplied values does not match table definition.

Completion time: 2022-11-02T18:00:43.5997476+00:00

#tempsharks has five columns but amazonwebshark.2022.sharkspecies now has six. My script is now trying to insert all six sharkspecies columns into the temp table, causing the msg 213 error.

Doing Things Differently

The solution here is to replace row 21’s SELECT * with the precise columns to insert from amazonwebshark.2022.sharkspecies:

While amazonwebshark.2022.sharkspecies now has six columns, my script is only inserting five of them into the temp table:

I can add the last_evaluated column into #tempsharks in future, but its absence in the temp table isn’t causing any immediate problems.

Works The Same In Other Environments

In this section, I use Python to show the value of production code that works the same in non-production.

Setting The Scene

Here I have a Python script that reads data from an Amazon S3 bucket using a boto3 session. I pass my AWS_ACCESSKEY and AWS_SECRET credentials in from a secrets manager, and create an s3bucket variable for the S3 bucket path:

When I deploy this script to my dev environment it works fine.

What’s The Problem?

When I deploy this script to production, s3bucket will still be s3://dev-bucket. The potential impact of this depends on the AWS environment setup:

Different AWS account for each environment:

dev-bucket doesn’t exist in Production. The script fails.

Same AWS account for all environments:

Production IAM roles might not have any permissions for dev-bucket. The script fails.
Production processes might start using a dev resource. The script succeeds but now data has unintentionally crossed environment boundaries.

Doing Things Differently

A solution here is to dynamically set the s3bucket variable based on the ID of the AWS account the script is running in.

I can get the AccountID using AWS STS. I’m already using boto3, so can use it to initiate an STS client with my AWS credentials.

STS then has a GetCallerIdentity action that returns the AWS AccountID linked to the AWS credentials. I capture this AccountID in an account_id variable, then use that to set s3bucket‘s value:

More details about get_caller_identity can be found in the AWS Boto3 documentation.

For bonus points, I can terminate the script if the AWS AccountID isn’t defined. This prevents undesirable states if the script is run in an unexpected account.

Speaking of which…

Prevents Undesirable States

In this section, I use PowerShell to demonstrate how to stop production code from doing unintended things.

Setting The Scene

In June I started writing a PowerShell script to upload lossless music files from my laptop to one of my S3 buckets.

I worked on it in stages. This made it easier to script and test the features I wanted. By the end of Version 1, I had a script that dot-sourced its variables and wrote everything in my local folder $ExternalLocalSource to my S3 bucket $ExternalS3BucketName:

#Load Variables Via Dot Sourcing
. .\EDMTracksLosslessS3Upload-Variables.ps1


#Upload File To S3
Write-S3Object -BucketName $ExternalS3BucketName -Folder $ExternalLocalSource -KeyPrefix $ExternalS3KeyPrefix -StorageClass $ExternalS3StorageClass

What’s The Problem?

NOTE: There were several problems with Version 1, all of which were fixed in Version 2. In the interests of simplicity, I’ll focus on a single one here.

In this script, Write-S3Object will upload everything in the local folder $ExternalLocalSource to the S3 bucket $ExternalS3BucketName.

Problem is, the $ExternalS3BucketName S3 bucket isn’t for everything! It should only contain lossless music files!

At best, Write-S3Object will upload everything in the local folder to S3 whether it’s music or not.

At worst, if the script is pointing at a different folder it will start uploading everything there instead! PowerShell commonly defaults to C:\Windows, so this could cause all kinds of problems.

Doing Things Differently

I decided to limit the extensions that the PowerShell script could upload.

Firstly, the script captures the extensions for each file in the local folder $ExternalLocalSource using Get-ChildItem and [System.IO.Path]::GetExtension:

$LocalSourceObjectFileExtensions = Get-ChildItem -Path $ExternalLocalSource | ForEach-Object -Process { [System.IO.Path]::GetExtension($_) }

Then it checks each extension using a ForEach loop. If an extension isn’t in the list, PowerShell reports this and terminates the script:

ForEach ($LocalSourceObjectFileExtension In $LocalSourceObjectFileExtensions) 

{
If ($LocalSourceObjectFileExtension -NotIn ".flac", ".wav", ".aif", ".aiff") 
{
Write-Output "Unacceptable $LocalSourceObjectFileExtension file found.  Exiting."
Start-Sleep -Seconds 10
Exit
}

So now, if I attempt to upload an unacceptable .log file, PowerShell raises an exception and terminates the script:

**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log

Checking extensions are valid for each local file.
Unacceptable .log file found.  Exiting.
**********************

While an acceptable .flac file will produce this message:

**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log

Checking extensions are valid for each local file.
Acceptable .flac file.
**********************

To see the code in full, as well as the other problems I solved, please check out my post from June.

Summary

In this post, I responded to November 2022’s T-SQL Tuesday #156 Invitation and gave my thoughts on some production code qualities. I gave examples of each quality and showed how they could save time and prevent unintended problems in a production environment.

Thanks to Tomáš for this month’s topic! My previous T-SQL Tuesday posts are here.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Tags Amazon Web Services, GitHub, Microsoft SQL Server, PowerShell, Python, SQL, T-SQL Tuesday, Troubleshooting

Data & Analytics

Ingesting iTunes Data Into AWS With Python And Athena

Post author By Damien Jones
Post date August 10, 2022
1 Comment on Ingesting iTunes Data Into AWS With Python And Athena

In this post, I will update my existing iTunes Python ETL to return a Parquet file, which I will then upload to S3 and view using Athena.

Introduction
Amazon S3
- New S3 Buckets
- Advantages Of Multiple Buckets
Amazon Athena
Python
- New Python Variables
- Changing CSV To Parquet
Architecture
Testing
Scripts
Summary

Introduction

In my last post, I made an ETL that exported data from a CSV into a Pandas DataFrame using AWS Data Wrangler. That post ended with the transformed data being saved locally as a new CSV.

It’s time to do something with that data! I want to analyse my iTunes data and look for trends and insights into my listening habits. I also want to access these insights in the cloud, as my laptop is a bit bulky and quite slow. Finally, I’d prefer to keep my costs to a minimum.

Here, I’ll show how AWS and Python can be used together to meet these requirements. Let’s start with AWS.

Amazon S3

In this section, I will update my S3 setup. I’ll create some new buckets and explain my approach.

New S3 Buckets

Currently, I have a single S3 bucket containing my iTunes Export CSV. Moving forward, this bucket will contain all of my unmodified source objects, otherwise known as raw data.

To partner the raw objects bucket, I now have an ingested objects bucket. This bucket will contain objects where the data has been transformed in some way. My analytics tools and Athena tables will point here for their data.

Speaking of Athena, the other new bucket will be used for Athena’s query results. Although Athena is serverless, it still needs a place to record queries and store results. Creating this bucket now will save time later on.

Having separate buckets for each of these functions isn’t a requirement, although it is something I prefer to do. Before moving on, I’d like to run through some of the benefits I find with this approach.

Advantages Of Multiple Buckets

Firstly, having buckets with clearly defined purposes makes navigation way easier. I always know where to find objects, and rarely lose track of or misplace them.

Secondly, having multiple buckets usually makes my S3 paths shorter. This doesn’t sound like much of a benefit upfront, but the S3 path textboxes in the AWS console are quite small, and using long S3 paths in the command line can be a pain.

Finally, I find security and access controls are far simpler to implement with a multi-bucket setup. Personally I prefer “You can’t come into this house/bucket” over “You can come into this house/bucket, but you can’t go into this room/prefix”. However, both S3 buckets and S3 prefixes can be used as IAM policy resources so there’s technically no difference.

That concludes the S3 section. Next, let’s set up Athena.

Amazon Athena

In this section, I’ll get Athena ready for use. I’ll show the process I followed and explain my key decisions. Let’s start with my reasons for choosing Athena.

Why Athena?

Plenty has been written about Athena’s benefits over the years. So instead of retreading old ground, I’ll discuss what makes Athena a good choice for this particular use case.

Firstly, Athena is cheap. The serverless nature of Athena means I only pay for what I query, scan and store, and I’ve yet to see a charge for Athena in the three years I’ve been an AWS customer.

Secondly, like S3, Athena’s security is managed by IAM. I can use IAM policies to control who and what can access my Athena data, and can monitor that access in CloudTrail. This also means I can manage access to Athena independently of S3.

Finally, Athena is highly available. Authorised calls to the service have a 99.9% Monthly Uptime Percentage SLA and Athena benefits from S3’s availability and durability. This allows 24/7 access to Athena data for users and applications.

Setting Up Athena

To start this section, I recommend reading the AWS Athena Getting Started documentation for a great Athena introduction. I’ll cover some basics here, but I can’t improve on the AWS documentation.

Athena needs three things to get off the ground:

An S3 path for Athena query results.
A database for Athena tables.
A table for interacting with S3 data objects.

I’ve already talked about the S3 path, so let’s move on to the database. A database in Athena is a logical grouping for the tables created in it. Here, I create a blog_amazonwebshark database using the following script:

CREATE DATABASE blog_amazonwebshark

Next, I enter the column names from my iTunes Export CSV into Athena’s Create Table form, along with appropriate data types for each column. In response, the form creates this Athena table:

The form adds several table properties to the table’s DDL. These, along with the data types, are expanded on in the Athena Create Table documentation.

Please note that I have removed the S3 path from the LOCATION property to protect my data. The actual Athena table is pointing at an S3 prefix in my ingested objects bucket that will receive my transformed iTunes data.

Speaking of data, the form offers several choices of source data format including CSV, JSON and Parquet. I chose Parquet, but why do this when I’m already getting a CSV? Why create extra work?

Let me explain.

About Parquet

Apache Parquet is a file format that supports fast processing for complex data. It can essentially be seen as the next generation of CSV. Both formats have their place, but at scale CSV files have large file sizes and slow performance.

In contrast, Parquet files have built-in compression and indexing for rapid data location and retrieval. In addition, the data in Parquet files is organized by column, resulting in smaller sizes and faster queries.

This also results in Athena cost savings as Athena only needs to read the columns relevant to the queries being run. If the same data was in a CSV, Athena would have to read the entire CSV whether the data is needed or not.

For further reading, Databricks have a great Parquet section in their glossary.

That’s everything for Athena. Now I need to update my Python scripts.

Python

In this section, I’ll make changes to my Basic iTunes ETL to include my new S3 and Athena resources and to replace the CSV output with a Parquet file. Let’s start with some variables.

New Python Variables

My first update is a change to ETL_ITU_Play_Variables.py, which contains my global variables. Originally there were two S3 global variables – S3_BUCKET containing the bucket name and S3_PREFIX containing the S3 prefix path leading to the raw data:

S3_BUCKET
S3_PREFIX

Now I have two buckets and two prefixes, so it makes sense to update the variable names. I now have two additional global variables, adding _RAW to the originals and _INGESTED to the new ones for clarity:

S3_BUCKET_RAW
S3_PREFIX_RAW

S3_BUCKET_INGESTED
S3_PREFIX_INGESTED

Changing CSV To Parquet

The next change is to ETL_ITU_Play.py. The initial version converts a Pandas DataFrame to CSV using pandas.DataFrame.to_csv. I’m now replacing this with awswrangler.s3.to_parquet, which needs three parameters:

df is my existing Pandas DataFrame.
boto3_session is the authentication session being used by awswrangler.s3.read_csv.
path is the full path for my _INGESTED S3 bucket.

Put together, it looks like this:

wr.s3.to_parquet(
    df = df,
    boto3_session = session,
    path = s3_path_ingested

Before committing my changes, I took the time to put the main workings of my ETL in a class. This provides a clean structure for my Python script and will make it easier to reuse in future projects.

That completes the changes. Let’s review what has been created.

Architecture

Here is an architectural diagram of how everything fits together:

Here is a breakdown of the processes involved:

User runs the Python ETL script locally.
Python reads the CSV object in datalake-raw S3 bucket.
Python extracts data from CSV into a DataFrame and transforms several columns.
Python writes the DataFrame to datalake-ingested S3 bucket as a Parquet file.
Python notifies User of a successful run.
User sends query to Athena.
Athena reads data from datalake-ingested S3 bucket.
Athena returns query results to User.

Testing

In this section, I will test my resources to make sure they work as expected. Bare in mind that this setup hasn’t been designed with production use in mind, so my testing is somewhat limited and would be insufficient for production deployment.

Testing Python

TEST: Upload a CSV to the datalake-raw S3 bucket, then run the Python script. The Python script must run successfully and print updates in the terminal throughout.

RESULT: I upload an iTunes Export CSV to the datalake-raw S3 bucket:

The Python script runs, printing the following output in the terminal:

Creating DataFrame.
DataFrame columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')
Deleting unnecessary DataFrame columns.
Renaming DataFrame columns.
Reformatting DateTime DataFrame columns.
Creating Date Columns From DateTime Columns.
Creating MyRatingDigit Column.
Replacing blank values to prevent IntCastingNaN errors.
Setting Data Types.
Creating Parquet file from DataFrame.
Processes complete.

Testing S3

TEST: After the Python script successfully runs, the datalake-ingested S3 bucket must contain an itunesdata.parquet object.

RESULT: Upon accessing the datalake-ingested S3 bucket, an itunesdata.parquet object is found:

(On an unrelated note, look at the size difference between the Parquet and CSV files!)

Testing Athena

TEST: When the datalake-ingested S3 bucket contains an itunesdata.parquet object, data from the iTunes Export CSV must be shown when the following Athena query is run:

SELECT * FROM basic_itunes_python_etl;

RESULT: Most of the Athena results match the iTunes Export data. However, the transformed dates did not match expectations:

This appears to be a formatting problem, as some parts of a date format are still visible.

To diagnose the problem I wanted to see how these columns were being stored in the Parquet file. I used mukunku’s ParquetViewer for this, which is described in the GitHub repo as:

…a quick and dirty utility that I created to easily view Apache Parquet files on Windows desktop machines.

It works very well!

Here is a screenshot of the data. The lastplayed column has dates and times, while the datamodifieddate column has dates only:

The cause of the problem becomes apparent when the date columns are viewed using the ISO 8601 format:

The date columns are all using timestamps, even when no times are included!

A potential fix would be to change the section of my Python ETL script that handles data types. Instead, I update the data types used in my Athena table from date:

  `datemodifieddate` date, 
  `dateaddeddate` date, 
  `lastplayeddate` date,

To timestamp:

  `datemodifieddate` timestamp, 
  `dateaddeddate` timestamp, 
  `lastplayeddate` timestamp,

This time, when I view my Athena table the values all appear as expected:

Scripts

My ETL_ITU_Play.py file commit from 2022-08-08 can be viewed here:

My updated repo readme can be viewed here:

Summary

In this post, I updated my existing iTunes Python ETL to return a Parquet file, which I then uploaded S3 and viewed using Athena. I explained my reasoning for choosing S3, Athena and the Parquet file format, and I handled a data formatting issue.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Tags Amazon Athena, Amazon S3, Amazon Web Services, Apache Parquet, AWS Data Wrangler / AWS SDK For Pandas, GitHub, iTunes, Music, Open Source, Project: iTunes Export Data Pipeline (2022), Python, SQL, Troubleshooting, Visual Studio Code

Table of Contents

Introduction

Hosting Alternatives

Amazon Lightsail

SiteGround

Migrate What Exactly?

Explain DNS Like I’m 5

Explain DNS With Pictures

amazonwebshark’s DNS Setup

Data Migration

DNS Migration

Any Problems?

Future Plans

Summary

Table of Contents

Introduction

Precision

Setting The Scene

What’s The Problem?

Doing Things Differently

Works The Same In Other Environments

Setting The Scene

What’s The Problem?

Doing Things Differently

Prevents Undesirable States

Setting The Scene

What’s The Problem?

Doing Things Differently

Summary

Table of Contents

Introduction

Amazon S3

New S3 Buckets

Advantages Of Multiple Buckets

Amazon Athena

Why Athena?

Setting Up Athena

About Parquet

Python

New Python Variables

Changing CSV To Parquet

Architecture

Testing

Testing Python

Testing S3

Testing Athena

Scripts

Summary