Categories
Architecture & Resilience

Can SQL Upgrades Be Avoided In The Cloud?

In this post I consider the February 2022 T-SQL Tuesday #147 Invitation “Upgrade Strategies” and look at the importance of upgrades in the cloud.

For this month’s T-SQL Tuesday, VoiceOfTheDBA’s invitation was as follows:

This month I want you to write about how you look at SQL Server upgrades. A few things you might think about:

Why we wait to upgrade?

Strategies for testing an upgrade

Smoke tests or other ways to verify the upgrade worked

Moving to the cloud to avoid upgrades

Using compatibility levels to upgrade an instance by not a database.

Checklists of things to use in planning

The time it takes to upgrade your environment

What you evaluate in making a decision to upgrade or not?

Anything else

Immediately I was drawn to “Moving to the cloud to avoid upgrades”. Some perceive the cloud as a ‘set it and forget it’ environment. The reality is that cloud services still require upgrades that can cause security vulnerabilities and data issues if left unchecked.

What follows are some SQL based observations from my experience to date. While it’s true this list is AWS specific, it isn’t AWS exclusive as Azure and GCP operate similar services with similar considerations.

EC2 Upgrades

When I create a new EC2 instance I can generally expect it to be running the latest build of my chosen OS. However, an instance that has been running for a while will soon find itself needing system updates like any other computer. Some updates offer performance improvements or new features and are essentially optional. Others fix security vulnerabilities and bugs and are non-negotiable.

If that instance is running my relational database of choice, that too will need a range of updates from the desirable to the critical. AWS views this as a customer responsibility, with the AWS Shared Responsibility Model including the following:

Customers that deploy an Amazon EC2 instance are responsible for management of the guest operating system (including updates and security patches), any application software or utilities installed by the customer on the instances, and the configuration of the AWS-provided firewall (called a security group) on each instance.

However the managed services are viewed differently:

For abstracted services, AWS operates the infrastructure layer, the operating system, and platforms, and customers access the endpoints to store and retrieve data. Customers are responsible for managing their data (including encryption options), classifying their assets, and using IAM tools to apply the appropriate permissions.

So what if I get AWS to run my database for me?

RDS Upgrades

Amazon Relational Database Service (RDS) offers managed relational databases including Microsoft SQL Server, MySQL and PostgreSQL. RDS still uses EC2 instances but here they are managed by AWS and are not accessible by the user. This means OS management is no longer a customer responsibility.

Upgrades to the database engine are still a factor though. AWS try to make this as painless as possible – upgrades can be done using the console, AWS CLI or the RDS API. This is still a manual process though, although it is possible to upgrade some minor engine versions automatically.

However, even on rails it’s still possible for an update to go wrong. AWS have a nine-point checklist for testing an upgrade that wouldn’t be out of place on-premises. AWS also encourage database snapshots and non-production testing. While RDS removes infrastructure complexity, the data is still the customer’s responsibility and needs the same care as ever.

Operational Upgrades

AWS constantly release new services intended to simplify workflows and reduce costs. Even when an organisation’s cloud setup is fully mature, it can still benefit from upgrading to these services.

When Athena debuted in 2016 it enabled the analysis of data in S3 using standard SQL. This removed the need for complex ETLs and data warehouses, and with Athena being serverless it was faster to set up and cheaper to operate than EC2, RDS or Redshift.

In 2020 Amazon announced new EBS GP3 volumes. GP3s have separate settings for performance and storage, and are recommended for applications like MySQL that need high performance at low costs. This meant organisations could save money by reducing their use of the more expensive IO1 volumes.

More recently, AWS announced a new S3 Glacier Instant Retrieval storage class in 2021. This made S3 less expensive for a range of use cases including SQL backup storage and data lake archival.

Conclusion

The Cloud offers numerous opportunities for individuals and organisations to develop, build and deploy quicker and easier. But upgrades are a fact of life in technology regardless of platform. The cloud is still a collection of computers, which still need to respond to changing requirements and threats.

A well maintained and fully upgraded cloud environment is reliable, scalable and secure. A poorly maintained one can, at best, be expensive, slow and unwieldy. At worst it can be unreliable, vulnerable and in breach of terms of service.

If you want to check the health of your AWS account, AWS offers their Trusted Advisor and Well-Architected Tool services. These give free architectural advice, security recommendations, cost optimisations and best practice guidance.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Developing & Application Integration

Re-Runnable Strava API Calls Using Python

In this post I make my existing Python code re-runnable by enabling it to replace expired access tokens when it sends requests to the Strava API.

A couple of posts ago I wrote about authenticating Strava API calls. I ended up successfully requesting data using this Python code:

import requests

activities_url = "https://www.strava.com/api/v3/athlete/activities" 

header = {'Authorization': 'Bearer ' + "access_token"}
param = {'per_page': 200, 'page': 1}

my_dataset = requests.get(activities_url, headers=header, params=param).json()

print(my_dataset)

Although successful, the uses for this code are limited as it stops working when the header’s access_token expires. Ideally the code should be able to function constantly once Strava grants initial authorisation, which is what I’m exploring here. Plus the last post was unclear in places so this one will hopefully tie up some loose ends.

Please note that I have altered or removed all sensitive codes and tokens in this post in the interests of security.

The Story So Far

First of all, some reminders. Strava uses OAuth 2.0 for authentication, and this is a typical OAuth 2.0 workflow:

OAuthAnOverview

I am sending GET requests to Strava via Get Activity. Strava’s documentation for this is as follows:

StravaDevelopers

Finally, during my initial setup I created an API Application on the Strava site. Strava provided these details upon completion:

StravaAPICredentials

Authorizing My App To View Data

Strava’s Getting Started page explains that they require authentication via OAuth 2.0 for data requests and gives the following link for that process:

http://www.strava.com/oauth/authorize?client_id=[REPLACE_WITH_YOUR_CLIENT_ID]&response_type=code&redirect_uri=http://localhost/exchange_token&approval_prompt=force&scope=read

I must amend this URL as scope=read is insufficient for Get Activity requests. The end of the URL becomes scope=activity:read_all and the updated URL loads a Strava authorization screen:

StravaAuthorization

Selecting Authorize gives the following response:

http://localhost/exchange_token?state=&code=CODE9fbb&scope=read,activity:read_all

Where code=CODE9fbb is a single-use authorization code that I will use to create access tokens.

Getting Tokens For API Requests

Next I will use CODE9fbb to request access tokens which Get Activity will accept. This is done via the following cURL request:

curl -X POST https://www.strava.com/api/v3/oauth/token \
  -d client_id=APIAPP-CLIENTID \
  -d client_secret=APIAPP-SECRET \
  -d code=CODE9fbb \
  -d grant_type=authorization_code

Here, Client_ID and Client_Secret are from my API application, Code is the authorization code CODE9fbb and Grant_Type is what I’m asking for – Strava’s Authentication documentation states this must always be authorization_code for initial authentication.

Strava then responds to my cURL request with a refresh token, access token, and access token expiration date in Unix Epoch format:

"token_type": "Bearer",
  "expires_at": 1642370007,
  "expires_in": 21600,
  "refresh_token": "REFRESHc8c4",
  "access_token": "ACCESS22e5",

Why two tokens? Access tokens expire six hours after they are created and must be refreshed to maintain access to the desired data. Strava uses the refresh tokens as part of the access token refresh process.

Writing The API Request Code

With the tokens now available I can start assembling the Python code for my Strava API requests. I will again be using Visual Studio Code here. I make a new Python virtual environment called StravaAPI by running py -3 -m venv StravaAPI, activate it using StravaAPI\Scripts\activate and run pip install requests to install the module I need. Finally I create an empty StravaAPI.py file in the StravaAPI virtual environment folder for the Python code.

Onto the code. The first part imports the requests module, declares some variables and sets up a request to refresh an expired access code as detailed in the Strava Authentication documentation:

# Import modules
import requests


# Set Variables
apiapp_clientid = "APIAPP-CLIENTID"
apiapp_secret = 'APIAPP-SECRET'
token_refresh = 'REFRESHc8c4'

# Requesting Access Token
url_oauth = "https://www.strava.com/oauth/token"
payload_oauth = {
	'client_id': apiapp_clientid,
	'client_secret': apiapp_secret,
	'refresh_token': token_refresh,
	'grant_type': "refresh_token",
	'f': 'json'
}

Note this time that the Grant_Type is refresh_token instead of authorization_code. These variables can then be used by the requests module to send a request to Strava’s API:

print("Requesting Token...\n")
req_access_token = requests.post(url_oauth, data=payload_oauth, verify=False)
print(req_access_token.json())

This request is successful and returns existing tokens ACCESS22e5 and REFRESHc8c4 as they have not yet expired:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESS22e5', 'expires_at': 1642370008, 'expires_in': 20208, 'refresh_token': 'REFRESHc8c4'}

A warning is also presented here as my request is not secure:

InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.strava.com'. Adding certificate verification is strongly advised.

The warning includes a link to urllib3 documentation, which states:

Making unverified HTTPS requests is strongly discouraged, however, if you understand the risks and wish to disable these warnings, you can use disable_warnings()

As this code is currently in development, I import the urllib3 module and disable the warnings:

# Import modules
import requests
import urllib3

# Disable Insecure Request Warnings
urllib3.disable_warnings()

Next I extract the access token from Strava’s response into a new token_access variable and print that in the terminal as a process indicator:

print("Requesting Token...\n")
req_access_token = requests.post(url_oauth, data=payload_oauth, verify=False)

token_access = req_access_token.json()['access_token']
print("Access Token = {}\n".format(token_access))

So far the terminal’s output is:

Requesting Token...

Access Token = ACCESS22e5

Let’s get some data! I’m making a call to Get Activities now, so I declare three variables to compose the request and include the token_access variable from earlier :

# Requesting Athlete Activities
url_api_activities = "https://www.strava.com/api/v3/athlete/activities"
header_activities = {'Authorization': 'Bearer ' + token_access}
param_activities = {'per_page': 200, 'page' : 1}

Then I use the requests module to send the request to Strava’s API:

print("Requesting Athlete Activities...\n")
dataset_activities = requests.get(url_api_activities, headers=header_activities, params=param_activities).json()
print(dataset_activities)

And receive data about several recent activities as JSON in return. Success! The full Python code is as follows:

# Import modules
import requests
import urllib3

# Disable Insecure Request Warnings
urllib3.disable_warnings()

# Set Variables
apiapp_clientid = "APIAPP-CLIENTID"
apiapp_secret = 'APIAPP-SECRET'
token_refresh = 'REFRESHc8c4'

# Requesting Access Token
url_oauth = "https://www.strava.com/oauth/token"
payload_oauth = {
	'client_id': apiapp_clientid,
	'client_secret': apiapp_secret,
	'refresh_token': token_refresh,
	'grant_type': "refresh_token",
	'f': 'json'
}

print("Requesting Token...\n")
req_access_token = requests.post(url_oauth, data=payload_oauth, verify=False)

token_access = req_access_token.json()['access_token']
print("Access Token = {}\n".format(token_access))

# Requesting Athlete Activities
url_api_activities = "https://www.strava.com/api/v3/athlete/activities"
header_activities = {'Authorization': 'Bearer ' + token_access}
param_activities = {'per_page': 200, 'page' : 1}
print("Requesting Athlete Activities...\n")
dataset_activities = requests.get(url_api_activities, headers=header_activities, params=param_activities).json()
print(dataset_activities)

But Does It Work?

This only leaves the question of whether the code works when the access code expires. As a reminder this was Strava’s original response:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESS22e5', 'expires_at': 1642370008, 'expires_in': 20208, 'refresh_token': 'REFRESHc8c4'}

Expiry 1642370008 is Sunday, 16 January 2022 21:53:28. I run the code at 22:05 and:

Requesting Token...

{'token_type': 'Bearer', 'access_token': 'ACCESSe0e7', 'expires_at': 1642392321, 'expires_in': 21559, 'refresh_token': 'REFRESHc8c4'}

A new access token! The new expiry 1642392321 is Monday, 17 January 2022 04:05:21. And when I run the code at 09:39:

{'token_type': 'Bearer', 'access_token': 'ACCESS74dd', 'expires_at': 1642433966, 'expires_in': 21600, 'refresh_token': 'REFRESHc8c4'}

A second new access code. All working fine! As long as my refresh token remains valid I can continue to get valid access tokens when they expire.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Using Athena To Query S3 Inventory Parquet Objects

In this post I’ll be using Amazon Athena to query data created by the S3 Inventory service.

When I wrote about my first impressions of S3 Glacier Instant Retrieval last month, I noticed some of my S3 Inventory graphs showed figures I didn’t expect. I couldn’t remember many of the objects in the InMotion bucket, and didn’t know that some were in Standard! I went through the bucket manually and found the Standard objects, but still had other questions that I wasn’t keen on solving by hand.

So while I was on-call over Christmas I decided to take a closer look at Athena – the AWS serverless query service designed to analyse data in S3. I’ve used existing setups at work but this was my first time experiencing it from scratch, and I made use of the AWS documentation about querying Amazon S3 Inventory with Amazon Athena and the Andy Grimes blog “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena” to fill in the blanks.

We’ve Got a File On You

First I created an empty s3inventory Athena database. Then I created a s3inventorytable table using the script below, specifying the 2022-01-01 symlink.txt Hive object created by S3 Inventory as the data source:

CREATE EXTERNAL TABLE s3inventorytable(
         bucket string,
         key string,
         version_id string,
         is_latest boolean,
         is_delete_marker boolean,
         size bigint,
         last_modified_date bigint,
         e_tag string,
         storage_class string,
         is_multipart_uploaded boolean,
         replication_status string,
         encryption_status string,
         object_lock_retain_until_date bigint,
         object_lock_mode string,
         object_lock_legal_hold_status string,
         intelligent_tiering_access_tier string,
         bucket_key_status string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://[REDACTED]/hive/dt=2022-01-01-01-00/';

Then I ran a query to determine the storage classes in use in the InMotion bucket and the number of objects assigned to each:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results were as follows:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"

41 Standard objects?! I wasn’t sure what they were and so added object size into the query:

SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class
SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"

The zero size and subsequent investigations confirmed that the Standard objects were prefixes, and so presented no problems.

Next, I wanted to check for unwanted previous versions of objects using the following query:

SELECT key, size 
FROM "s3inventory"."s3inventorytable" 
WHERE is_latest = FALSE

This query returned another prefix, so again there were no actions needed:

SELECT key, size 
FROM "s3inventory"."s3inventorytable"

Further investigation found that this prefix also has no storage class assigned to it, as seen in the results above.

For Old Time’s Sake

I then wanted to see the youngest and oldest objects for each storage class, and ran the following query:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

What I got back was unexpected:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"

S3 Inventory stores dates as Unix Epoch Time, so I needed a function to transform the data to a human-legible format. Traditionally this would involve CAST or CONVERT, but as Athena uses Presto additional functions are available such as from_unixtime:

from_unixtime(unixtime) → timestamp

Returns the UNIX timestamp unixtime as a timestamp.

I updated the query to include this function:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date)),
MAX(from_unixtime(last_modified_date))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

This time the dates were human-legible but completely inaccurate:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
human

I then found a solution in Stack Overflow, where a user suggested converting a Unix Epoch Time value from microseconds to milliseconds. I applied this suggestion to my query by dividing the last modified dates by 1000:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date/1000)),
MAX(from_unixtime(last_modified_date/1000))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results after this looked far more reasonable:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
FINAL

And EpochConverter confirmed the human time was correct for the Deep Archive MIN(last_modified_date) Unix value of 1620147401000:

So there we go! An introduction to Athena and utilization of the data from S3 Inventory!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~