Categories
Developing & Application Integration

Uploading Music Files To Amazon S3 (PowerShell Mix)

In this post, I will upload lossless music files from my laptop to one of my Amazon S3 buckets using PowerShell.

Table of Contents

Introduction

For several months I’ve been going through some music from an old hard drive. These music files are currently on my laptop, and exist mainly as lossless .flac files.

For each file I’m doing the following:

  • Creating an .mp3 copy of each lossless file.
  • Storing the .mp3 file on my laptop.
  • Uploading a copy of the lossless file to S3 Glacier.
  • Transferring the original lossless file from my laptop to my desktop PC.

I usually do the uploads using the S3 console, and have been meaning to automate the process for some time. So I decided to write some code to upload files to S3 for me, in this case using PowerShell.

Prerequisites

Before starting to write my PowerShell script, I have done the following on my laptop:

Version 0: Functionality

Version 0 gets the basic functionality in place. No bells and whistles here – I just want to upload a file to an S3 bucket prefix, stored using the Glacier Flexible Retrieval storage class.

V0: Writing To S3

I am using the PowerShell Write-S3Object cmdlet to upload my files to S3. This cmdlet needs a couple of parameters to do what’s required:

  • -BucketName: The S3 bucket receiving the files.
  • -Folder: The folder on my laptop containing the files.
  • -KeyPrefix: The S3 bucket key prefix to assign to the uploaded objects.
  • -StorageClass: The S3 storage class to assign to the uploaded objects.

I create a variable for each of these so that my script is easier to read as I continue its development. I couldn’t find the inputs that the -StorageClass parameter uses in the Write-S3Object documentation. In the end, I found them in the S3 PutObject API Reference.

Valid inputs are as follows:

STANDARD | REDUCED_REDUNDANCY | STANDARD_IA | ONEZONE_IA | INTELLIGENT_TIERING | GLACIER | DEEP_ARCHIVE | OUTPOSTS | GLACIER_IR

V0: Code

V0BasicRedacted.ps1

#Set Variables
$LocalSource = "C:\Users\Files\"
$S3BucketName = "my-s3-bucket"
$S3KeyPrefix = "Folder\SubFolder\"
$S3StorageClass = "GLACIER"


#Upload File To S3
Write-S3Object -BucketName $S3BucketName -Folder $LocalSource -KeyPrefix $S3KeyPrefix -StorageClass $S3StorageClass
V0BasicRedacted.ps1 On GitHub

V0: Evaluation

Version 0 offers me the following benefits:

  • I don’t have to log onto the S3 console for uploads anymore.
  • Forgetting to specify Glacier Flexible Retrieval as the S3 storage class is no longer a problem. The script does this for me.
  • Starting an upload to S3 is now as simple as right-clicking the script and selecting Run With PowerShell from the Windows Context Menu.

Version 0 works great, but I’ll give away one of my S3 bucket names if I start sharing a non-redacted version. This has been known to cause security issues in severe cases. Ideally, I’d like to separate the variables from the Powershell commands, so let’s work on that next.

Version 1: Security

Version 1 enhances the security of my script by separating my variables from my PowerShell commands. To make this work without breaking things, I’m using the following features:

To take advantage of these features, I’ve made two new files in my repo:

  • Variables.ps1 for my variables.
  • V1Security.ps1 for my Write-S3Object command.

So let’s now talk about how this all works.

V1: Isolating Variables With Dot Sourcing

At the moment, my script is broken. Running Variables.ps1 will create the variables but do nothing with them. Running V1Security.ps1 will fail as the variables aren’t in that script anymore.

This is where Dot Sourcing comes in. Using Dot Sourcing lets PowerShell look for code in other places. Here, when I run V1Security.ps1 I want PowerShell to look for variables in Variables.ps1.

To dot source a script, type a dot (.) and a space before the script path. As both of my files are in the same folder, PowerShell doesn’t even need the full path:

. .\EDMTracksLosslessS3Upload-Variables.ps1

Now my script works again! But I still have the same problem – if Variables.ps1 is committed to GitHub at any point then my variables are still visible. How can I stop that?

This time it’s Git to the rescue. I need a .gitignore file.

V1: Selective Tracking With .gitignore

.gitignore is a way of telling Git what not to include in commits. Entering a file, folder or pattern into a repo’s .gitignore file tells Git not to track it.

When Visual Studio Code finds a .gitignore file, it helps out by making visual changes in response to the file’s contents. When I create a .gitignore file and add the following lines to it:

#Ignore PowerShell Files Containing Variables

EDMTracksLosslessS3Upload-V0Basic.ps1
EDMTracksLosslessS3Upload-Variables.ps1

Visual Studio Code’s Explorer tab will show those files as grey:

They won’t be visible at all in the Source Control tab:

And finally, when committed to GitHub the ignored files are not present:

Before moving on, I found this Steve Griffith .gitignore tutorial helpful in introducing the basics:

And this DevOps Journey tutorial helps show how .gitignore behaves within Visual Studio Code:

V1: Code

gitignore Version 1

#Ignore PowerShell Files Containing Variables

EDMTracksLosslessS3Upload-V0Basic.ps1
EDMTracksLosslessS3Upload-Variables.ps1

V1Security.ps1

#Load Variables
. .\EDMTracksLosslessS3Upload-Variables.ps1


#Upload File To S3
Write-S3Object -BucketName $S3BucketName -Folder $LocalSource -KeyPrefix $S3KeyPrefix -StorageClass $S3StorageClass
V1Security.ps1 On GitHub

VariablesBlank.ps1 Version 1

#Set Variables


#The local file path for objects to upload to S3
#E.g. "C:\Users\Files\"
$LocalSource =

#The S3 bucket to upload the objects to
#E.g. "my-s3-bucket"
$S3BucketName =

#The S3 bucket prefix / folder to upload the objects to (if applicable)
#E.g. "Folder\SubFolder\"
$S3KeyPrefix =

#The S3 Storage Class to upload to
#E.g. "GLACIER"
$S3StorageClass =
Version 1 VariablesBlank.ps1 On GitHub

V1: Evaluation

Version 1 now gives me the benefits of Version 0 with the following additions:

  • My variables and commands have now been separated.
  • I can now call Variables.ps1 from other scripts in the same folder, knowing the variables will be the same each time for each script.
  • I can use .gitignore to make sure Variables.ps1 is never uploaded to my GitHub repo.

The next problem is one of visibility. I have no way to know if my uploads have been successful. Or if they were duplicated. Nor do I have any auditing.

The S3 console gives me a summary at the end of each upload:

It would be great to have something similar with my script! In addition, some error handling and quality control checks would increase my confidence levels.

Let’s get to work!

Version 2: Visibility

Version 2 enhances the visibility of my script. The length of the script grows a lot here, so let’s run through the changes and I’ll explain what’s going on.

As a starting point, I copied V1Security.ps1 and renamed it to V2Visibility.ps1.

V2: Variables.ps1 And .gitignore Changes

Additions are being made to these files as a result of the Version 2 changes. I’ll mention them as they come up, but it makes sense to cover a few things up-front:

  • I added External to all variable names in Variables.ps1 to keep track of them in the script. For example, $S3BucketName is now $ExternalS3BucketName.
  • There are some additional local file paths in Variables.ps1 that I’m using for transcripts and some post-upload checks.
  • .gitignore now includes a log file (more on that shortly) and the Visual Studio Code debugging folder.

V2: Transcripts

The first change is perhaps the simplest. PowerShell has built-in cmdlets for creating transcripts:

  • Start-Transcript creates a record of all or part of a PowerShell session in a separate file.
  • Stop-Transcript stops a transcript that was started by the Start-Transcript cmdlet.

These go at the start and end of V2Visibility.ps1, along with a local file path for the EDMTracksLosslessS3Upload.log file I’m using to record everything.

Start-Transcript -Path $ExternalTranscriptPath -IncludeInvocationHeader

This new path is stored in Variables.ps1. In addition, EDMTracksLosslessS3Upload.log has been added to .gitignore.

V2: Check If There Are Any Files

Now the error handing begins. I want the script to fail gracefully, and I start by checking that there are files in the correct folder. First I count the files using Get-ChildItem and Measure-Object:

$LocalSourceCount = (Get-ChildItem -Path $ExternalLocalSource | Measure-Object).Count

And then stop the script running if no files are found:

If ($LocalSourceCount -lt 1) 
{
Write-Output "No Local Files Found.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit
}

There are a couple of cmdlets here that make several appearances in Version 2:

  • Start-Sleep suspends PowerShell activity for the time stated. This gives me time to read the output when I’m running the script using the context menu.
  • Exit causes PowerShell to completely stop everything it’s doing. In this case, there’s no point continuing as there’s nothing in the folder.

If files are found, PowerShell displays the count and carries on:

Else 
{
Write-Output "$LocalSourceCount Local Files Found"          
}

V2: Check If The Files Are Lossless

Next, I want to stop any file uploads that don’t belong in the S3 bucket. The bucket should only contain lossless music – anything else should be rejected.

To arrange this, I first capture the extensions for each file using Get-ChildItem and [System.IO.Path]::GetExtension:

$LocalSourceObjectFileExtensions = Get-ChildItem -Path $ExternalLocalSource | ForEach-Object -Process { [System.IO.Path]::GetExtension($_) }

Then I check each extension using a ForEach loop. If an extension isn’t in the list, PowerShell will report this and exit the script:

ForEach ($LocalSourceObjectFileExtension In $LocalSourceObjectFileExtensions) 

{
If ($LocalSourceObjectFileExtension -NotIn ".flac", ".wav", ".aif", ".aiff") 
{
Write-Output "Unacceptable $LocalSourceObjectFileExtension file found.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit
}

If the extension is in the list, PowerShell records this and checks the next one:

Else 
{
Write-Output "Acceptable $LocalSourceObjectFileExtension file."
}

So now, if I attempt to upload an unacceptable .log file, the transcript will say:

**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log

Checking extensions are valid for each local file.
Unacceptable .log file found.  Exiting.
**********************

Whereas an acceptable .flac file will produce:

**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log

Checking extensions are valid for each local file.
Acceptable .flac file.
**********************

And when uploading multiple files:

**********************
Transcript started, output file is C:\Files\EDMTracksLosslessS3Upload.log

Checking extensions are valid for each local file.
Acceptable .flac file.
Acceptable .wav file.
Acceptable .flac file.
**********************

V2: Check If The Files Are Already In S3

The next step checks if the files are already in S3. This might not seem like a problem, as S3 usually overwrites an object if it already exists.

Thing is, this bucket is replicated. This means it’s also versioned. As a result, S3 will keep both copies in this scenario. In the world of Glacier this doesn’t cost much, but it will distort the bucket’s S3 Inventory. This could lead to confusion when I check them with Athena. And if I can stop this situation with some automation then I might as well.

I’m going to use the Get-S3Object cmdlet to query my bucket for each file. For this to work, I need two things:

  • -BucketName: This is in Variables.ps1.
  • -Key

-Key is the object’s S3 file path. For example, Folder\SubFolder\Music.flac. As the files shouldn’t be in S3 yet, these keys shouldn’t exist. So I’ll have to make them using PowerShell.

I start by getting all the filenames I want to check using Get-ChildItem and [System.IO.Path]::GetFileName:

$LocalSourceObjectFileNames = Get-ChildItem -Path $ExternalLocalSource | ForEach-Object -Process { [System.IO.Path]::GetFileName($_) }

Now I start another ForEach loop. I make an S3 key for each filename by combining it with $ExternalS3KeyPrefix in Variables.ps1:

ForEach ($LocalSourceObjectFileName In $LocalSourceObjectFileNames) 

{
$LocalSourceObjectFileNameS3Key = $ExternalS3KeyPrefix + $LocalSourceObjectFileName 

Then I query S3 using Get-S3Object and my constructed S3 key, and capture the result in a variable:

$LocalSourceObjectFileNameS3Check = Get-S3Object -BucketName $ExternalS3BucketName -Key $LocalSourceObjectFileNameS3Key

Get-S3Object should return null as the object shouldn’t exist.

If this doesn’t happen then the object is already in the bucket. In this situation, PowerShell identifies the file causing the problem and then exits the script:

If ($null -ne $LocalSourceObjectFileNameS3Check) 
{
Write-Output "File already exists in S3 bucket: $LocalSourceObjectFileName.  Please review.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit

If the file isn’t found then PowerShell continues to run:

Else 
{
Write-Output "$LocalSourceObjectFileName does not currently exist in S3 bucket."
}

Assuming no files are found at this point, the log will read as follows:

Checking if local files already exist in S3 bucket.
Checking S3 bucket for Artist-Track-ExtendedMix.flac
Artist-Track-ExtendedMix.flac does not currently exist in S3 bucket.
Checking S3 bucket for Artist-Track-OriginalMix.flac
Artist-Track-OriginalMix.flac does not currently exist in S3 bucket.

V2: Uploading Files Instead Of Folders

Now to start uploading to S3!

In Version 2 I’ve altered how this is done. Previously my script’s purpose was to upload a folder to S3 using the PowerShell cmdlet Write-S3Object.

Version 2 now uploads individual files instead. There is a reason for this that I’ll go into shortly.

This means I have to change things around as Write-S3Object now needs different parameters:

  • Instead of telling the -Folder parameter where the local folder is, I now need to tell the -File parameter where each file is located.
  • Instead of telling the -KeyPrefix parameter where to store the uploaded objects in S3, I now need to tell the -Key parameter the full S3 path for each object.

I’ll do -Key first. I start by opening another ForEach loop, and create an S3 key for each file in the same way I did earlier:

$LocalSourceObjectFileNameS3Key = $ExternalS3KeyPrefix + $LocalSourceObjectFileName 

Next is -File. I make the local file path for each file using variables I’ve already created:

$LocalSourceObjectFilepath = $ExternalLocalSource + "\" + $LocalSourceObjectFileName

Then I begin uploads for each file using Write-S3Object with the new -File and -Key parameters instead of -Folder and -KeyPrefix:

Write-Output "Starting S3 Upload Of $LocalSourceObjectFileName"

Write-S3Object -BucketName $ExternalS3BucketName -File $LocalSourceObjectFilepath -Key $LocalSourceObjectFileNameS3Key -StorageClass $ExternalS3StorageClass

The main benefit of this approach is that, if something goes wrong mid-upload, the transcript will tell me which uploads were successful. Version 1’s script would only tell me that uploads had started, so in the event of failure I’d need to check the S3 bucket’s contents.

Speaking of failure, wouldn’t it be good to check that the uploads worked?

V2: Were The Uploads Successful?

For this, I’m still working in the ForEach loop I started for the uploads. After an upload finishes, PowerShell checks if the object is in S3 using the Get-S3Object command I wrote earlier:

Write-Output "Starting S3 Upload Check Of $LocalSourceObjectFileName"
      
$LocalSourceObjectFileNameS3Check = Get-S3Object -BucketName $ExternalS3BucketName -Key $LocalSourceObjectFileNameS3Key

This time I want the object to be found, so null is a bad result.

Next, I get PowerShell to do some heavy lifting for me. I’ve created a pair of new local folders called S3WriteSuccess and S3WriteFail. The paths for these are stored in Variables.ps1.

If my S3 upload check doesn’t find anything and returns null, PowerShell moves the file from the source folder to S3WriteFail using Move-Item:

If ($null -eq $LocalSourceObjectFileNameS3Check) 

{
Write-Output "S3 Upload Check FAIL: $LocalSourceObjectFileName.  Moving to local Fail folder"
Move-Item -Path $LocalSourceObjectFilepath -Destination $ExternalLocalDestinationFail
}

If the object is found, PowerShell moves the file to S3WriteSuccess:

Else 

{
Write-Output "S3 Upload Check Success: $LocalSourceObjectFileName.  Moving to local Success folder"
Move-Item -Path $LocalSourceObjectFilepath -Destination $ExternalLocalDestinationSuccess           
} 

The ForEach loop then repeats with the next file until all are processed.

So now, a failed upload produces the following log:

**********************
Beginning S3 Upload Checks On Following Objects: StephenJKroos-Micrsh-OriginalMix
S3 Upload Check: StephenJKroos-Micrsh-OriginalMix.flac
S3 Upload Check FAIL: StephenJKroos-Micrsh-OriginalMix.  Moving to local Fail folder
**********************
Windows PowerShell transcript end
**********************

While a successful S3 upload produces this one:

**********************
Beginning S3 Upload Checks On Following Objects: StephenJKroos-Micrsh-OriginalMix
S3 Upload Check: StephenJKroos-Micrsh-OriginalMix.flac
S3 Upload Check Success: StephenJKroos-Micrsh-OriginalMix.  Moving to local Success folder
**********************
Windows PowerShell transcript end
**********************

PowerShell then shows a final message before ending the transcript:

Write-Output "All files processed.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript

V2: Code

gitignore Version 2

###################
###### FILES ######
###################

#Powershell Transcript log
EDMTracksLosslessS3Upload.log

#PowerShell Files Containing Variables
EDMTracksLosslessS3Upload-V0Basic.ps1

#PowerShell Files Containing Variables
EDMTracksLosslessS3Upload-Variables.ps1


#####################
###### FOLDERS ######
#####################

#VSCode Debugging
.vscode/
Version 2.gitignore On GitHub

V2Visibility.ps1

##################################
####### EXTERNAL VARIABLES #######
##################################


#Load External Variables Via Dot Sourcing
. .\EDMTracksLosslessS3Upload-Variables.ps1

#Start Transcript
Start-Transcript -Path $ExternalTranscriptPath -IncludeInvocationHeader


###############################
####### LOCAL VARIABLES #######
###############################


#Get count of items in $ExternalLocalSource
#Get list of filenames in $ExternalLocalSource
$LocalSourceCount = (Get-ChildItem -Path $ExternalLocalSource | Measure-Object).Count

#Get list of extensions in $ExternalLocalSource
$LocalSourceObjectFileExtensions = Get-ChildItem -Path $ExternalLocalSource | ForEach-Object -Process { [System.IO.Path]::GetExtension($_) }

#Get list of filenames in $ExternalLocalSource
$LocalSourceObjectFileNames = Get-ChildItem -Path $ExternalLocalSource | ForEach-Object -Process { [System.IO.Path]::GetFileName($_) }


##########################
####### OPERATIONS #######
##########################


#Check there are files in local folder.
Write-Output "Counting files in local folder."

#If local folder less than 1, output this and stop the script.  
If ($LocalSourceCount -lt 1) 

{
Write-Output "No Local Files Found.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit
}

#If files are found, output the count and continue.
Else 

{
Write-Output "$LocalSourceCount Local Files Found"          
}


#Check extensions are valid for each file.
Write-Output " "
Write-Output "Checking extensions are valid for each local file."

ForEach ($LocalSourceObjectFileExtension In $LocalSourceObjectFileExtensions) 

{
#If any extension is unacceptable, output this and stop the script. 
If ($LocalSourceObjectFileExtension -NotIn ".flac", ".wav", ".aif", ".aiff") 

{
Write-Output "Unacceptable $LocalSourceObjectFileExtension file found.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit
}

#If extension is fine, output the extension for each file and continue.
Else 
{
Write-Output "Acceptable $LocalSourceObjectFileExtension file."
}
}


#Check if local files already exist in S3 bucket.
Write-Output " "
Write-Output "Checking if local files already exist in S3 bucket."

#Do following actions for each file in local folder
ForEach ($LocalSourceObjectFileName In $LocalSourceObjectFileNames) 

{
#Create S3 object key using $ExternalS3KeyPrefix and current object's filename
$LocalSourceObjectFileNameS3Key = $ExternalS3KeyPrefix + $LocalSourceObjectFileName 

#Create local filepath for each object for the file move
$LocalSourceObjectFilepath = $ExternalLocalSource + "\" + $LocalSourceObjectFileName

#Output that S3 upload check is starting
Write-Output "Checking S3 bucket for $LocalSourceObjectFileName"
      
#Attempt to get S3 object data using $LocalSourceObjectFileNameS3Key
$LocalSourceObjectFileNameS3Check = Get-S3Object -BucketName $ExternalS3BucketName -Key $LocalSourceObjectFileNameS3Key

#If local file found in S3, output this and stop the script.
If ($null -ne $LocalSourceObjectFileNameS3Check) 

{
Write-Output "File already exists in S3 bucket: $LocalSourceObjectFileName.  Please review.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
Exit
}

#If local file not found in S3, report this and continue.
Else 
{
Write-Output "$LocalSourceObjectFileName does not currently exist in S3 bucket."
}
}


#Output that S3 uploads are starting - count and file names
Write-Output " "
Write-Output "Starting S3 Upload Of $LocalSourceCount Local Files."
Write-Output "These files are as follows: $LocalSourceObjectFileNames"
Write-Output " "


#Do following actions for each file in local folder
ForEach ($LocalSourceObjectFileName In $LocalSourceObjectFileNames) 

{
#Create S3 object key using $ExternalS3KeyPrefix and current object's filename
$LocalSourceObjectFileNameS3Key = $ExternalS3KeyPrefix + $LocalSourceObjectFileName 

#Create local filepath for each object for the file move
$LocalSourceObjectFilepath = $ExternalLocalSource + "\" + $LocalSourceObjectFileName

#Output that S3 upload is starting
Write-Output "Starting S3 Upload Of $LocalSourceObjectFileName"

#Write object to S3 bucket
Write-S3Object -BucketName $ExternalS3BucketName -File $LocalSourceObjectFilepath -Key $LocalSourceObjectFileNameS3Key -StorageClass $ExternalS3StorageClass

#Output that S3 upload check is starting
Write-Output "Starting S3 Upload Check Of $LocalSourceObjectFileName"
      
#Attempt to get S3 object data using $LocalSourceObjectFileNameS3Key
$LocalSourceObjectFileNameS3Check = Get-S3Object -BucketName $ExternalS3BucketName -Key $LocalSourceObjectFileNameS3Key

#If $LocalSourceObjectFileNameS3Key doesn't exist in S3, move to local Fail folder.
If ($null -eq $LocalSourceObjectFileNameS3Check) 

{
Write-Output "S3 Upload Check FAIL: $LocalSourceObjectFileName.  Moving to local Fail folder"
Move-Item -Path $LocalSourceObjectFilepath -Destination $ExternalLocalDestinationFail
}

#If $LocalSourceObjectFileNameS3Key does exist in S3, move to local Success folder.
Else 
{
Write-Output "S3 Upload Check Success: $LocalSourceObjectFileName.  Moving to local Success folder"
Move-Item -Path $LocalSourceObjectFilepath -Destination $ExternalLocalDestinationSuccess           
}
}


#Stop Transcript
Write-Output " "
Write-Output "All files processed.  Exiting."
Start-Sleep -Seconds 10
Stop-Transcript
V2Visibility.ps1 On GitHub

VariablesBlank.ps1 Version 2

##################################
####### EXTERNAL VARIABLES #######
##################################

#The local file path for the transcript file
#E.g. "C:\Users\Files\"
$ExternalTranscriptPath =

#The local file path for objects to upload to S3
#E.g. "C:\Users\Files\"
$ExternalLocalSource =

#The S3 bucket to upload objects to
#E.g. "my-s3-bucket"
$ExternalS3BucketName =

#The S3 bucket prefix / folder to upload  objects to (if applicable)
#E.g. "Folder\SubFolder\"
$ExternalS3KeyPrefix =

#The S3 Storage Class to upload to
#E.g. "GLACIER"
$ExternalS3StorageClass =

#The local file path for moving successful S3 uploads to
#E.g. "C:\Users\Files\"
$ExternalLocalDestinationSuccess =

#The local file path for moving failed S3 uploads to
#E.g. "C:\Users\Files\"
$ExternalLocalDestinationFail =
Version 2 VariablesBlank.ps1 On GitHub

V2: Evaluation

Overall I’m very happy with how this all turned out! Version 2 took a script that worked with some supervision, and turned it into something I can set and forget.

The various checks now have my back if I select the wrong files or if my connection breaks. And, while the Get-S3Object checks mean that I’m making more S3 API calls, the increase won’t cause any bill spikes.

The following is a typical transcript that my script produces following a successful upload of two .flac files:

**********************
Transcript started, output file is C:\Users\Files\EDMTracksLosslessS3Upload.log
Counting files in local folder.
2 Local Files Found

Checking extensions are valid for each local file.
Acceptable .flac file.
Acceptable .flac file.

Checking if local files already exist in S3 bucket.
Checking S3 bucket for MarkOtten-Tranquility-OriginalMix.flac
MarkOtten-Tranquility-OriginalMix.flac does not currently exist in S3 bucket.
Checking S3 bucket for StephenJKroos-Micrsh-OriginalMix.flac
StephenJKroos-Micrsh-OriginalMix.flac does not currently exist in S3 bucket.

Starting S3 Upload Of 2 Local Files.
These files are as follows: MarkOtten-Tranquility-OriginalMix StephenJKroos-Micrsh-OriginalMix.flac

Starting S3 Upload Of MarkOtten-Tranquility-OriginalMix.flac
Starting S3 Upload Check Of MarkOtten-Tranquility-OriginalMix.flac
S3 Upload Check Success: MarkOtten-Tranquility-OriginalMix.flac.  Moving to local Success folder
Starting S3 Upload Of StephenJKroos-Micrsh-OriginalMix.flac
Starting S3 Upload Check Of StephenJKroos-Micrsh-OriginalMix.flac
S3 Upload Check Success: StephenJKroos-Micrsh-OriginalMix.flac.  Moving to local Success folder

All files processed.  Exiting.
**********************
Windows PowerShell transcript end
End time: 20220617153926
**********************

GitHub ReadMe

To round everything off, I’ve written a ReadMe for the repo. This is written in Markdown using the template at makeareadme.com, and the finished article is available here.

Summary

In this post, I created a script to upload lossless music files from my laptop to one of my Amazon S3 buckets using PowerShell.

I introduced automation to perform checks before and after each upload, and logged the outputs to a transcript. I then produced a repo for the scripts, accompanied by a ReadMe document.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Developing & Application Integration

Next-Level S3 Notifications With EventBridge

In this post I will use AWS managed services to enhance my S3 user experience with custom EventBridge notifications that are low cost, quick to set up and perform well at scale.

Table of Contents

Introduction

I’ve been restoring some S3 Glacier Flexible Retrieval objects lately. I use bulk retrievals to reduce costs – these finish within 5–12 hours. However, on a couple of occasions I’ve totally forgotten about them and almost missed the download deadline!

Having recently set up some alerting, I decided to make a similar setup that will trigger emails at key points in the retrieval process, using the following AWS services:

  • S3 for holding the objects and managing the retrieval process
  • EventBridge for receiving events from S3 and looking for patterns
  • SNS for sending notifications to me

The end result will look like this:

Let’s start with SNS.

SNS: The Notifier

I went into detail about Amazon Simple Notification Service (SNS) in my last post about making some security alerts so feel free to read that if some SNS terms are unfamiliar.

Here I want SNS to send me emails, so I start by making a new standard topic called s3-object-restore. I then create a new subscription with an email endpoint and link it to my new topic.

This completes my SNS setup. Next I need to make some changes to one of my S3 buckets.

S3: The Storage

Amazon S3 stores objects in buckets. The properties of a bucket can be customised to complement its intended purpose. For example, the Default Encryption property forces encryption on buckets containing sensitive objects. The Bucket Versioning property protects objects from accidental changes and deletes.

Here I’m interested in the Event Notifications property. This property sends notifications when certain events occur in the bucket. Examples of S3 events include uploads, deletes and, importantly for this use case, restore requests.

S3 can send events to a number of AWS services including, helpfully, EventBridge! This isn’t on by default but is easily enabled in the bucket’s properties:

My bucket will now send events to EventBridge. But what is EventBridge?

EventBridge: The Go-Between

Full disclosure. At first I wasn’t entirely sure what EventBridge was. The AWS description did little to change that:

I tend to uncomplicate topics by abstracting them. Here I found it helpful to think of EventBridge as a bus:

  • Busses provide high-capacity transport between bus stops. The bus is EventBridge.
  • Passengers use the bus to get to where they need to go. The passengers are events.
  • Bus stops are where passengers join or depart the bus. The bus stops are event sources and targets.

In the same way that a bus picks up passengers at one bus stop and drops them off at another, EventBridge receives events from a source and directs them to a target.

Much has been written about EventBridge’s benefits. Rather than spending the next few paragraphs copy/pasting, I will instead suggest the following for further reading:

In this use case, EventBridge’s main advantage is that it is decoupled from S3. This allows one EventBridge Rule to serve many S3 buckets. S3 can send notifications to SNS without EventBridge, but each bucket needs configuring separately so this quickly causes headaches with multiple buckets.

Currently my S3 bucket is already sending events to EventBridge, so let’s create an EventBridge rule for them.

EventBridge Rule: Setting A Pattern & Choosing A Source

Rules allow EventBridge to route events from a source to a target. After naming my new rule s3-object-restore, I need to choose what kind of rule I want:

  • Event Pattern: the rule will be triggered by an event.
  • Schedule: the rule will be triggered by a schedule.

I select Event Pattern. EventBridge then poses further questions to establish what events to look for:

  • Event Matching Pattern: Do I want to use EventBridge presets or write my own pattern?
  • Service Provider: Are the events coming from an AWS service or a third party?
  • Service Name: What service will be the source of events?

EventBridge will only present options relevant to the previous choices. For example, choosing AWS as Service Provider means that no third party services are available in Service Name.

My choices so far tell EventBrdige that S3 is the event source:

Next up is Event Type. As EventBridge knows the events are coming from S3, the options here are very specific:

I choose Amazon S3 Event Notification.

EventBridge now knows enough to create a rule, and offers the following JSON as an Event Pattern:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Access Tier Changed", "Object ACL Updated", "Object Created", "Object Deleted", "Object Restore Completed", "Object Restore Expired", "Object Restore Initiated", "Object Storage Class Changed", "Object Tags Added", "Object Tags Deleted"]
}

I’m only interested in restores, so I open the Specific Event(s) list and choose the three Object Restore events:

EventBridge then amends the event pattern to:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Restore Completed", "Object Restore Initiated", "Object Restore Expired"]
}

That’s it for the source. Now EventBridge needs to know what to do when it finds something!

EventBridge Rule: Choosing A Target & Configuring Inputs

One of EventBridge’s big selling points is how it interacts with targets. There are already numerous targets, and EventBridge rules can have more than one.

I select SNS Topic as a target then choose my s3-object-restore SNS topic from the list:

This alone is enough for EventBridge to interact with SNS. When I save this EventBridge rule and trigger it by running an S3 object restore, I receive this email:

Although this is technically a success, some factors aren’t ideal:

  • The formatting of the email is hard to read.
  • There’s a lot of information here, most of which is irrelevant.
  • It’s not immediately clear what this email is telling me.

To address this I can use EventBridge’s Configure Input feature to change what is sent to the target. This feature offers four options:

  • Matched Events: EventBridge passes all of the event text to the target. This is the default.
  • Part Of The Matched Event: EventBridge only sends part of the event text to the target.
  • Constant (JSON text): None of the event text is sent to the target. EventBridge sends user-defined JSON instead.
  • Input Transformer: EventBridge assigns lines of event text as variables, then uses those variables in a template.

Let’s look at the input transformer.

The AWS EventBridge user guide goes into detail about the input transformer and includes a good tutorial. Having consulted these resources, I start by getting the desired JSON from the initial email:

{
"detail-type":"Object Restore Initiated",
"source":"aws.s3",
"time":"2022-02-21T12:51:21Z",
"detail":
{
"bucket":{"name":"redacted"},
"object":{"key":"redacted"}
}
}

Then I convert the JSON into an Input Path:

{
"bucket":"$.detail.bucket.name",
"detail-type":"$.detail-type",
"object":"$.detail.object.key",
"source":"$.source",
"time":"$.time"
}

And finally specify an Input Template:

"<source> <detail-type> at <time>. Bucket: <bucket>. Object: <object>"

EventBridge checks input templates before accepting them, and will throw an error if the input template is invalid:

I update my EventBridge rule with the new Input Transformer configuration. Time to test it out!

Testing

When I trigger an S3 object restore I receive this email moments later:

I then receive a second email when the object is ready for download:

"aws.s3 Object Restore Completed at 2022-03-04T00:15:33Z. Bucket: REDACTED. Object: REDACTED"

And a final one when the object expires:

"aws.s3 Object Restore Expired at 2022-03-05T10:12:04Z. Bucket: REDACTED. Object: REDACTED"

Success!

Before moving on, let me share the results of an earlier test. My very first input path (not included here) contained some mistakes. The input template was valid but it couldn’t read the S3 event properly, so I ended up with this:

Something to bear in mind for future rules!

Cost Analysis

Before I wrap up, let’s run through the expected costs with this setup:

  • SNS: the first thousand email notifications SNS every month are included in the AWS Always Free tier, and I’m nowhere near that!
  • S3: There is no change for S3 passing events to EventBridge. Charges for object storage and retrieval are out of scope for this post.
  • EventBridge: All events published by AWS services are free.

There is no expected cost rise for this setup based on my current use.

Summary

In this post I’ve used EventBridge and SNS to produce free bespoke notifications at key points in the S3 object retrieval process. This offers me the following benefits:

  • Reassurance: I can choose the longer S3 retrieval offerings knowing that AWS will keep me updated on progress.
  • Convenience: I will know the status of retrievals without accessing the AWS console or using the CLI.
  • Cost: I am less likely to forget to download retrieved objects before expiry, and therefore less likely to need to retrieve those objects again.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Using Athena To Query S3 Inventory Parquet Objects

In this post I’ll be using Amazon Athena to query data created by the S3 Inventory service.

When I wrote about my first impressions of S3 Glacier Instant Retrieval last month, I noticed some of my S3 Inventory graphs showed figures I didn’t expect. I couldn’t remember many of the objects in the InMotion bucket, and didn’t know that some were in Standard! I went through the bucket manually and found the Standard objects, but still had other questions that I wasn’t keen on solving by hand.

So while I was on-call over Christmas I decided to take a closer look at Athena – the AWS serverless query service designed to analyse data in S3. I’ve used existing setups at work but this was my first time experiencing it from scratch, and I made use of the AWS documentation about querying Amazon S3 Inventory with Amazon Athena and the Andy Grimes blog “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena” to fill in the blanks.

We’ve Got a File On You

First I created an empty s3inventory Athena database. Then I created a s3inventorytable table using the script below, specifying the 2022-01-01 symlink.txt Hive object created by S3 Inventory as the data source:

CREATE EXTERNAL TABLE s3inventorytable(
         bucket string,
         key string,
         version_id string,
         is_latest boolean,
         is_delete_marker boolean,
         size bigint,
         last_modified_date bigint,
         e_tag string,
         storage_class string,
         is_multipart_uploaded boolean,
         replication_status string,
         encryption_status string,
         object_lock_retain_until_date bigint,
         object_lock_mode string,
         object_lock_legal_hold_status string,
         intelligent_tiering_access_tier string,
         bucket_key_status string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://[REDACTED]/hive/dt=2022-01-01-01-00/';

Then I ran a query to determine the storage classes in use in the InMotion bucket and the number of objects assigned to each:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results were as follows:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"

41 Standard objects?! I wasn’t sure what they were and so added object size into the query:

SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class
SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"

The zero size and subsequent investigations confirmed that the Standard objects were prefixes, and so presented no problems.

Next, I wanted to check for unwanted previous versions of objects using the following query:

SELECT key, size 
FROM "s3inventory"."s3inventorytable" 
WHERE is_latest = FALSE

This query returned another prefix, so again there were no actions needed:

SELECT key, size 
FROM "s3inventory"."s3inventorytable"

Further investigation found that this prefix also has no storage class assigned to it, as seen in the results above.

For Old Time’s Sake

I then wanted to see the youngest and oldest objects for each storage class, and ran the following query:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

What I got back was unexpected:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"

S3 Inventory stores dates as Unix Epoch Time, so I needed a function to transform the data to a human-legible format. Traditionally this would involve CAST or CONVERT, but as Athena uses Presto additional functions are available such as from_unixtime:

from_unixtime(unixtime) → timestamp

Returns the UNIX timestamp unixtime as a timestamp.

I updated the query to include this function:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date)),
MAX(from_unixtime(last_modified_date))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

This time the dates were human-legible but completely inaccurate:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
human

I then found a solution in Stack Overflow, where a user suggested converting a Unix Epoch Time value from microseconds to milliseconds. I applied this suggestion to my query by dividing the last modified dates by 1000:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date/1000)),
MAX(from_unixtime(last_modified_date/1000))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results after this looked far more reasonable:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
FINAL

And EpochConverter confirmed the human time was correct for the Deep Archive MIN(last_modified_date) Unix value of 1620147401000:

So there we go! An introduction to Athena and utilization of the data from S3 Inventory!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~