Categories
Developing & Application Integration

Next-Level S3 Notifications With EventBridge

In this post I will use AWS managed services to enhance my S3 user experience with custom EventBridge notifications that are low cost, quick to set up and perform well at scale.

Table of Contents

Introduction

I’ve been restoring some S3 Glacier Flexible Retrieval objects lately. I use bulk retrievals to reduce costs – these finish within 5–12 hours. However, on a couple of occasions I’ve totally forgotten about them and almost missed the download deadline!

Having recently set up some alerting, I decided to make a similar setup that will trigger emails at key points in the retrieval process, using the following AWS services:

  • S3 for holding the objects and managing the retrieval process
  • EventBridge for receiving events from S3 and looking for patterns
  • SNS for sending notifications to me

The end result will look like this:

Let’s start with SNS.

SNS: The Notifier

I went into detail about Amazon Simple Notification Service (SNS) in my last post about making some security alerts so feel free to read that if some SNS terms are unfamiliar.

Here I want SNS to send me emails, so I start by making a new standard topic called s3-object-restore. I then create a new subscription with an email endpoint and link it to my new topic.

This completes my SNS setup. Next I need to make some changes to one of my S3 buckets.

S3: The Storage

Amazon S3 stores objects in buckets. The properties of a bucket can be customised to complement its intended purpose. For example, the Default Encryption property forces encryption on buckets containing sensitive objects. The Bucket Versioning property protects objects from accidental changes and deletes.

Here I’m interested in the Event Notifications property. This property sends notifications when certain events occur in the bucket. Examples of S3 events include uploads, deletes and, importantly for this use case, restore requests.

S3 can send events to a number of AWS services including, helpfully, EventBridge! This isn’t on by default but is easily enabled in the bucket’s properties:

My bucket will now send events to EventBridge. But what is EventBridge?

EventBridge: The Go-Between

Full disclosure. At first I wasn’t entirely sure what EventBridge was. The AWS description did little to change that:

I tend to uncomplicate topics by abstracting them. Here I found it helpful to think of EventBridge as a bus:

  • Busses provide high-capacity transport between bus stops. The bus is EventBridge.
  • Passengers use the bus to get to where they need to go. The passengers are events.
  • Bus stops are where passengers join or depart the bus. The bus stops are event sources and targets.

In the same way that a bus picks up passengers at one bus stop and drops them off at another, EventBridge receives events from a source and directs them to a target.

Much has been written about EventBridge’s benefits. Rather than spending the next few paragraphs copy/pasting, I will instead suggest the following for further reading:

In this use case, EventBridge’s main advantage is that it is decoupled from S3. This allows one EventBridge Rule to serve many S3 buckets. S3 can send notifications to SNS without EventBridge, but each bucket needs configuring separately so this quickly causes headaches with multiple buckets.

Currently my S3 bucket is already sending events to EventBridge, so let’s create an EventBridge rule for them.

EventBridge Rule: Setting A Pattern & Choosing A Source

Rules allow EventBridge to route events from a source to a target. After naming my new rule s3-object-restore, I need to choose what kind of rule I want:

  • Event Pattern: the rule will be triggered by an event.
  • Schedule: the rule will be triggered by a schedule.

I select Event Pattern. EventBridge then poses further questions to establish what events to look for:

  • Event Matching Pattern: Do I want to use EventBridge presets or write my own pattern?
  • Service Provider: Are the events coming from an AWS service or a third party?
  • Service Name: What service will be the source of events?

EventBridge will only present options relevant to the previous choices. For example, choosing AWS as Service Provider means that no third party services are available in Service Name.

My choices so far tell EventBrdige that S3 is the event source:

Next up is Event Type. As EventBridge knows the events are coming from S3, the options here are very specific:

I choose Amazon S3 Event Notification.

EventBridge now knows enough to create a rule, and offers the following JSON as an Event Pattern:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Access Tier Changed", "Object ACL Updated", "Object Created", "Object Deleted", "Object Restore Completed", "Object Restore Expired", "Object Restore Initiated", "Object Storage Class Changed", "Object Tags Added", "Object Tags Deleted"]
}

I’m only interested in restores, so I open the Specific Event(s) list and choose the three Object Restore events:

EventBridge then amends the event pattern to:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Restore Completed", "Object Restore Initiated", "Object Restore Expired"]
}

That’s it for the source. Now EventBridge needs to know what to do when it finds something!

EventBridge Rule: Choosing A Target & Configuring Inputs

One of EventBridge’s big selling points is how it interacts with targets. There are already numerous targets, and EventBridge rules can have more than one.

I select SNS Topic as a target then choose my s3-object-restore SNS topic from the list:

This alone is enough for EventBridge to interact with SNS. When I save this EventBridge rule and trigger it by running an S3 object restore, I receive this email:

Although this is technically a success, some factors aren’t ideal:

  • The formatting of the email is hard to read.
  • There’s a lot of information here, most of which is irrelevant.
  • It’s not immediately clear what this email is telling me.

To address this I can use EventBridge’s Configure Input feature to change what is sent to the target. This feature offers four options:

  • Matched Events: EventBridge passes all of the event text to the target. This is the default.
  • Part Of The Matched Event: EventBridge only sends part of the event text to the target.
  • Constant (JSON text): None of the event text is sent to the target. EventBridge sends user-defined JSON instead.
  • Input Transformer: EventBridge assigns lines of event text as variables, then uses those variables in a template.

Let’s look at the input transformer.

The AWS EventBridge user guide goes into detail about the input transformer and includes a good tutorial. Having consulted these resources, I start by getting the desired JSON from the initial email:

{
"detail-type":"Object Restore Initiated",
"source":"aws.s3",
"time":"2022-02-21T12:51:21Z",
"detail":
{
"bucket":{"name":"redacted"},
"object":{"key":"redacted"}
}
}

Then I convert the JSON into an Input Path:

{
"bucket":"$.detail.bucket.name",
"detail-type":"$.detail-type",
"object":"$.detail.object.key",
"source":"$.source",
"time":"$.time"
}

And finally specify an Input Template:

"<source> <detail-type> at <time>. Bucket: <bucket>. Object: <object>"

EventBridge checks input templates before accepting them, and will throw an error if the input template is invalid:

I update my EventBridge rule with the new Input Transformer configuration. Time to test it out!

Testing

When I trigger an S3 object restore I receive this email moments later:

I then receive a second email when the object is ready for download:

"aws.s3 Object Restore Completed at 2022-03-04T00:15:33Z. Bucket: REDACTED. Object: REDACTED"

And a final one when the object expires:

"aws.s3 Object Restore Expired at 2022-03-05T10:12:04Z. Bucket: REDACTED. Object: REDACTED"

Success!

Before moving on, let me share the results of an earlier test. My very first input path (not included here) contained some mistakes. The input template was valid but it couldn’t read the S3 event properly, so I ended up with this:

Something to bear in mind for future rules!

Cost Analysis

Before I wrap up, let’s run through the expected costs with this setup:

  • SNS: the first thousand email notifications SNS every month are included in the AWS Always Free tier, and I’m nowhere near that!
  • S3: There is no change for S3 passing events to EventBridge. Charges for object storage and retrieval are out of scope for this post.
  • EventBridge: All events published by AWS services are free.

There is no expected cost rise for this setup based on my current use.

Summary

In this post I’ve used EventBridge and SNS to produce free bespoke notifications at key points in the S3 object retrieval process. This offers me the following benefits:

  • Reassurance: I can choose the longer S3 retrieval offerings knowing that AWS will keep me updated on progress.
  • Convenience: I will know the status of retrievals without accessing the AWS console or using the CLI.
  • Cost: I am less likely to forget to download retrieved objects before expiry, and therefore less likely to need to retrieve those objects again.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Using Athena To Query S3 Inventory Parquet Objects

In this post I’ll be using Amazon Athena to query data created by the S3 Inventory service.

When I wrote about my first impressions of S3 Glacier Instant Retrieval last month, I noticed some of my S3 Inventory graphs showed figures I didn’t expect. I couldn’t remember many of the objects in the InMotion bucket, and didn’t know that some were in Standard! I went through the bucket manually and found the Standard objects, but still had other questions that I wasn’t keen on solving by hand.

So while I was on-call over Christmas I decided to take a closer look at Athena – the AWS serverless query service designed to analyse data in S3. I’ve used existing setups at work but this was my first time experiencing it from scratch, and I made use of the AWS documentation about querying Amazon S3 Inventory with Amazon Athena and the Andy Grimes blog “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena” to fill in the blanks.

We’ve Got a File On You

First I created an empty s3inventory Athena database. Then I created a s3inventorytable table using the script below, specifying the 2022-01-01 symlink.txt Hive object created by S3 Inventory as the data source:

CREATE EXTERNAL TABLE s3inventorytable(
         bucket string,
         key string,
         version_id string,
         is_latest boolean,
         is_delete_marker boolean,
         size bigint,
         last_modified_date bigint,
         e_tag string,
         storage_class string,
         is_multipart_uploaded boolean,
         replication_status string,
         encryption_status string,
         object_lock_retain_until_date bigint,
         object_lock_mode string,
         object_lock_legal_hold_status string,
         intelligent_tiering_access_tier string,
         bucket_key_status string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://[REDACTED]/hive/dt=2022-01-01-01-00/';

Then I ran a query to determine the storage classes in use in the InMotion bucket and the number of objects assigned to each:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results were as follows:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"

41 Standard objects?! I wasn’t sure what they were and so added object size into the query:

SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class
SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"

The zero size and subsequent investigations confirmed that the Standard objects were prefixes, and so presented no problems.

Next, I wanted to check for unwanted previous versions of objects using the following query:

SELECT key, size 
FROM "s3inventory"."s3inventorytable" 
WHERE is_latest = FALSE

This query returned another prefix, so again there were no actions needed:

SELECT key, size 
FROM "s3inventory"."s3inventorytable"

Further investigation found that this prefix also has no storage class assigned to it, as seen in the results above.

For Old Time’s Sake

I then wanted to see the youngest and oldest objects for each storage class, and ran the following query:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

What I got back was unexpected:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"

S3 Inventory stores dates as Unix Epoch Time, so I needed a function to transform the data to a human-legible format. Traditionally this would involve CAST or CONVERT, but as Athena uses Presto additional functions are available such as from_unixtime:

from_unixtime(unixtime) → timestamp

Returns the UNIX timestamp unixtime as a timestamp.

I updated the query to include this function:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date)),
MAX(from_unixtime(last_modified_date))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

This time the dates were human-legible but completely inaccurate:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
human

I then found a solution in Stack Overflow, where a user suggested converting a Unix Epoch Time value from microseconds to milliseconds. I applied this suggestion to my query by dividing the last modified dates by 1000:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date/1000)),
MAX(from_unixtime(last_modified_date/1000))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results after this looked far more reasonable:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
FINAL

And EpochConverter confirmed the human time was correct for the Deep Archive MIN(last_modified_date) Unix value of 1620147401000:

So there we go! An introduction to Athena and utilization of the data from S3 Inventory!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Architecture & Resilience

S3 Glacier Instant Retrieval: First Impressions

On 30/11/2021, AWS introduced S3 Glacier Instant Retrieval – a new archive storage class for S3 that operates alongside S3 Glacier (now renamed S3 Glacier Flexible Retrieval) and S3 Glacier Deep Archive. Their announcements can be seen here and here and a summary of all Glacier classes is available on the S3 Glacier product page.

I already use most of the S3 storage classes in my AWS accounts. Earlier in the year I got tired of my laptop backups needing to run overnight and made an S3 cross-account replication setup in which whatever I upload to the AtRest bucket in my main account gets replicated to the AtRest bucket in my backup account and gets set as S3 Glacier Deep Archive. This way I have two versions of the object in different regions in different accounts, and although there are data transfer costs they are offset by the reduced cost I get from using S3 Glacier Deep Archive for the backup objects.

Objects in my main account use different classes depending on their purpose. Before I upload any objects there I consider whether the object is in motion or at rest and what my access pattern for the object is likely to be, then choose a storage class accordingly. This is the current storage class distribution for all buckets in my main account according to S3 Storage Lens:

The arrival of S3 Glacier Instant Retrieval is of interest to me as it might offer cost savings and accessibility improvements over my current setup. So far my decisions over S3 storage classes have usually boiled down to trade-offs. For example:

  • For Object X I could use S3 Intelligent Tiering or S3 Infrequent Access. S3 Infrequent Access has a minimum storage duration of 30 days and has retrieval costs, but S3 Intelligent Tiering has a handing fee per 1000 objects and each object will spend the first 30 days in, and be charged as, S3 Standard. So if I know I’m not going to touch this object for at least a month which class is most suitable?
  • For Object Y I could use S3 Glacier or S3 Glacier Deep Archive. Deep Archive will cost less for storage but the retrieval fees are higher than Glacier and Deep Archive’s minimum storage duration is 180 days where Glacier’s is only 90 days. Plus I can get objects out of Glacier far quicker as its standard retrieval time is 3 to 5 hours compared to Deep Archive’s standard of 12 hours. So could I afford to wait half a day for this object if I needed it? And how long do I see this object being around for?

Comparisons With Other S3 Storage Classes

So how does S3 Glacier Instant Retrieval compare to S3 Infrequent Access and S3 Glacier Flexible Retrieval? I loaded the S3 pricing site and had a look at various costs in eu-west-1 for S3 Infrequent Access (IFA), S3 Glacier Instant Retrieval (GIR) and S3 Glacier Flexible Retrieval (GFR), then used the S3 calculator to get some estimates based on my current S3 Storage Lens statistics and November 2021 bill.

Storage:

  • IFA $0.0125 per GB
  • GIR $0.004 per GB
  • GFR $0.0036 per GB

PUT, COPY, POST, LIST requests (per 1,000 requests):

  • IFA $0.01
  • GIR $0.02
  • GFR $0.33

GET, SELECT, and all other requests (per 1,000 requests):

  • IFA $0.001
  • GIR $0.01
  • GFR $0.0004

Data Retrieval requests (per 1,000 requests):

  • IFA N/A
  • GIR N/A
  • GFR $0.055 (Standard)

Data retrievals (per GB):

  • IFA $0.01
  • GIR $0.03
  • GFR $0.01 (Standard)

Estimated cost for storing 200GB per month (with average size of 4.4MB for Glacier Flexible Retrieval)24265 PUT, COPY, POST, LIST requests, 10402 GET, SELECT, and all other requests and retrieval of 50GB per month (using 1 Standard request for Glacier Flexible Retrieval):

  • IFR $3.25
  • GIR $2.89
  • GFR $2.38

A couple other items of note:

  • S3 Glacier Instant Retrieval has a minimum billable object size of 128 KB, which it shares with S3 Standard Infrequent Access
  • S3 Glacier Instant Retrieval offers instant retrieval in milliseconds, which it also shares with S3 Standard Infrequent Access
  • S3 Glacier Instant Retrieval has a minimum storage duration of 90 days, which it shares with S3 Glacier Flexible Retrieval

What’s interesting in the cost estimates for me is now close S3 Glacier Instant Retrieval is to S3 Standard Infrequent Access. The major difference between the two classes that I can see is that, while S3 Glacier Instant Retrieval has a minimum storage duration of 90 days, the same period for S3 Standard Infrequent Access is only 30 days. If you delete an object before the end of a minimum storage duration period, you are charged for the full period specified. Depending on the size and amount of the objects, this could get expensive if mismanaged. That said, AWS are offering S3 Glacier Instant Retrieval as being “For long-lived archive data accessed once a quarter with instant retrieval in milliseconds” so there are no smoke and mirrors here.

Conclusions

Would I use S3 Glacier Instant Retrieval over S3 Glacier Flexible Retrieval or S3 Standard Infrequent Access? Definitely in my AtRest bucket. The S3 Storage Lens stats for that bucket shows many objects in S3 Standard Infrequent Access, including all the old TV shows from Internet Archive because let’s face it – if you want to watch old TV you want to watch it now not in 3 hours’ time </Glacier>. In this scenario S3 Glacier Instant Retrieval keeps the millisecond access and, although the retrieval cost is higher (GIR $0.03 vs IFA $0.01) the cost of data storage is lower (GIR $0.004 per GB vs IFA $0.0125 per GB). So S3 Glacier Instant Retrieval looks like a winner there.

My InMotion bucket is a different story though. The objects here aren’t being retained permanently and most of them are in S3 so they don’t bring my laptop’s hard drive to its knees. If I’m looking at uploading objects here it’s usually with a question of “When will I deal with this?”, the answer to which will usually be:

  • The next few weeks, in which case I’ll keep the object in OneDrive instead (What a TWIST)
  • Next month, in which case I’d put the object in S3 Standard Infrequent Access because of its 30-day minimum storage duration
  • “I don’t know”, in which case I’d put the object in S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive because their storage costs are less than S3 Glacier Instant Retrieval

As a side note, most of the objects in my InMotion bucket are S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive already, so it looks like my estimates from the start of the year were half decent!

Thanks for reading! ~~^~~