Categories
Data & Analytics

Connecting Athena To Power BI With Simba Athena

In this post, I use Simba Athena to create a secure connection between my iTunes data in Amazon Athena and Microsoft Power BI.

Table of Contents

Introduction

In my recent posts, I’ve been transforming an iTunes Export CSV file using Python and AWS.

Firstly, in July I built a Python ETL that extracts data from my iTunes CSV into a Pandas DataFrame and transforms some columns.

Next, I updated my ETL script at the start of August. It now uploads the changed data to S3 as a Parquet file. Then I made my data available in an Athena table so I could use some of Athena’s benefits:

  • My data now has high availability at low cost.
  • My data can be queried faster from Athena than from the CSV.
  • I can limit what data is accessed, as opposed to all-or-nothing.

Now I want to start analysing my data. There are many business intelligence (BI) tools available to help me with this. I will be using the latest version of Power BI on my Windows 10 laptop.

But wait. If Power BI is on my laptop and my data is in Athena, how can Power BI access my data? Do I need to make my AWS resources publically accessible? Do I need to download the data to my laptop?

Fortunately not! Welcome to the world of data connectors. Meet Simba Athena.

Simba Athena

In this section, I will look at how Simba Athena bridges the gap between my locally-installed BI tool and my data in AWS.

What Is Simba Athena?

Simba Athena is an Open Database Connectivity (ODBC) driver built for Athena. The history of Simba dates back to 1992 when Simba Technologies co-developed the first standards-based ODBC driver with Microsoft. Magnitude acquired Simba Technologies in 2016.

Simba offers numerous data connectors that all work in roughly the same way:

Relating this diagram to Athena and Power BI:

  • The user sends a query to Power BI.
  • Power BI passes the query to Simba Athena via the ODBC Device Manager.
  • Simba Athena queries Athena and gets the results.
  • Simba Athena passes the results to Power BI via the ODBC Device Manager.
  • Power BI shows the results to the user.

Features Of Simba Athena

Simba Athena has several features that make it a great partner for Athena:

  • Simba Athena works with Windows, macOS and Linux. Just as Athena supports multiple operating systems, Simba Athena is also OS-agnostic.
  • Numerous applications support Simba Athena including Excel, Tableau and Power BI.

Speaking of Power BI…

Microsoft Power BI

In this section, I will examine Power BI and explain why I chose to use it.

What Is Power BI?

Microsoft Power BI is a data visualization solution with a primary focus on BI. At the time of writing, Power BI’s main components are:

  • Power BI Desktop: a free locally-installed application designed for connecting to, transforming, and visualizing data.
  • Power BI Service: a cloud-based SaaS supporting the deployment and sharing of dashboards.
  • Power BI Mobile: a mobile app platform for Windows, iOS, and Android devices.

So what makes Power BI a good choice here?

Choosing Power BI

My decision to use Power BI came down to three factors:

  • Prior Experience. I’ve used Power BI many times over the years, and have become very familiar with it. This will let me deliver results quickly.
  • Support: Both Microsoft and AWS have rich documentation for Simba Athena. This gives me confidence in setting it up and reduces the chance of any blockers.
2022 Gartner Magic Quadrant for Analytics and Business Intelligence Platforms

So now I’ve talked about Simba Athena and Power BI, let’s get them working together.

Setting Up Simba Athena

In this section, I will install and configure Simba Athena on my laptop. I will then attempt to extract data from Athena using Power BI.

The remainder of this post will focus on the Windows version of Simba Athena. AWS offers download links for Windows, Linux and macOS, and provides installation instructions in the Simba Athena Documentation.

Downloading Simba Athena

The first step is to download the Simba Athena ODBC driver provided by AWS. The options vary depending on platform and bitness.

The installation process mainly focuses on the end-user license agreement and destination folder selection. Once Simba Athena is installed, it can be configured.

Configuring Simba Athena

Simba Athena’s configuration settings are available via the Windows ODBC Data Source Administrator. This can be found in the Start Bar’s Windows Administrative Tools folder, or by running a Windows search for ODBC.

Accessing this and selecting the System DSN tab shows Simba Athena as a System Data Source:

2022-08-28-SimbaSystemDSN

From here, selecting Configure shows a setup screen with a few familiar fields:

Of these, Catalog, Schema and Workgroup are pre-populated with Athena defaults and Metadata Retrieval Method is set to Auto.

That leaves the Data Source Name and Description to identify the data source, and the AWS Region containing the Athena data.

In Output Options, I can state my S3 Output Location and Encryption Options. The output location is Athena’s Query Result Location, and the encryption options should mirror the S3 bucket’s encryption settings.

If the S3 Output Location is left blank, this will cause an error when Power BI tries to connect to Athena:

Details: "ODBC: ERROR [HY000] [Simba][Athena] (1040) An error has been thrown from the AWS Athena client. Athena Error No: 130, HTTP Response Code: 400, Exception Name: InvalidRequestException, Error Message: outputLocation is null or empty 

Simba Athena’s remaining settings are out of scope for this post, although there’s one I definitely need to mention – Authentication Options:

This is how Simba Athena authenticates its requests to AWS. As mentioned earlier, there are several options here. Depending on the authentication type selected, Simba Athena can store Access Keys, Session Tokens, TenantIDs and any other required credentials.

That’s all the Simba Athena configuration I’m going to do here. For full details on all of Simba Athena’s features, please refer to the Simba Athena Documentation.

Now let’s use Simba Athena to get Athena and Power BI talking to each other!

Using Simba Athena

The Athena documentation has a great section about using the Athena Power BI connector. After launching Power BI and selecting Amazon Athena as a data source, Power BI will need to know which DSN to use.

This is the Simba Athena DSN in the System DSN tab:

The Navigator screen then shows my Athena data catalog, my blog_amazonwebshark database, and my basic_itunes_python_etl table with a sample of the data it contains:

That’s everything! My basic_itunes_python_etl Athena table is now available in Power BI.

Summary

In this post, I used Simba Athena to create a secure connection between my iTunes data in Amazon Athena and Microsoft Power BI.

This post was originally part of a larger post that is still being written. But after I’d finished my Simba Athena section it made sense to have a separate post for it!

Finally, in other news, this post’s featured image is a DALL·E 2 creation. This was by far the best image it gave me for pixel art baby lion and shark – I’m sure it’ll improve soon!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Ingesting iTunes Data Into AWS With Python And Athena

In this post, I will update my existing iTunes Python ETL to return a Parquet file, which I will then upload to S3 and view using Athena.

Table of Contents

Introduction

In my last post, I made an ETL that exported data from a CSV into a Pandas DataFrame using AWS Data Wrangler. That post ended with the transformed data being saved locally as a new CSV.

It’s time to do something with that data! I want to analyse my iTunes data and look for trends and insights into my listening habits. I also want to access these insights in the cloud, as my laptop is a bit bulky and quite slow. Finally, I’d prefer to keep my costs to a minimum.

Here, I’ll show how AWS and Python can be used together to meet these requirements. Let’s start with AWS.

Amazon S3

In this section, I will update my S3 setup. I’ll create some new buckets and explain my approach.

New S3 Buckets

Currently, I have a single S3 bucket containing my iTunes Export CSV. Moving forward, this bucket will contain all of my unmodified source objects, otherwise known as raw data.

To partner the raw objects bucket, I now have an ingested objects bucket. This bucket will contain objects where the data has been transformed in some way. My analytics tools and Athena tables will point here for their data.

Speaking of Athena, the other new bucket will be used for Athena’s query results. Although Athena is serverless, it still needs a place to record queries and store results. Creating this bucket now will save time later on.

Having separate buckets for each of these functions isn’t a requirement, although it is something I prefer to do. Before moving on, I’d like to run through some of the benefits I find with this approach.

Advantages Of Multiple Buckets

Firstly, having buckets with clearly defined purposes makes navigation way easier. I always know where to find objects, and rarely lose track of or misplace them.

Secondly, having multiple buckets usually makes my S3 paths shorter. This doesn’t sound like much of a benefit upfront, but the S3 path textboxes in the AWS console are quite small, and using long S3 paths in the command line can be a pain.

Finally, I find security and access controls are far simpler to implement with a multi-bucket setup. Personally I prefer “You can’t come into this house/bucket” over “You can come into this house/bucket, but you can’t go into this room/prefix”. However, both S3 buckets and S3 prefixes can be used as IAM policy resources so there’s technically no difference.

That concludes the S3 section. Next, let’s set up Athena.

Amazon Athena

In this section, I’ll get Athena ready for use. I’ll show the process I followed and explain my key decisions. Let’s start with my reasons for choosing Athena.

Why Athena?

Plenty has been written about Athena’s benefits over the years. So instead of retreading old ground, I’ll discuss what makes Athena a good choice for this particular use case.

Firstly, Athena is cheap. The serverless nature of Athena means I only pay for what I query, scan and store, and I’ve yet to see a charge for Athena in the three years I’ve been an AWS customer.

Secondly, like S3, Athena’s security is managed by IAM. I can use IAM policies to control who and what can access my Athena data, and can monitor that access in CloudTrail. This also means I can manage access to Athena independently of S3.

Finally, Athena is highly available. Authorised calls to the service have a 99.9% Monthly Uptime Percentage SLA and Athena benefits from S3’s availability and durability. This allows 24/7 access to Athena data for users and applications.

Setting Up Athena

To start this section, I recommend reading the AWS Athena Getting Started documentation for a great Athena introduction. I’ll cover some basics here, but I can’t improve on the AWS documentation.

Athena needs three things to get off the ground:

  • An S3 path for Athena query results.
  • A database for Athena tables.
  • A table for interacting with S3 data objects.

I’ve already talked about the S3 path, so let’s move on to the database. A database in Athena is a logical grouping for the tables created in it. Here, I create a blog_amazonwebshark database using the following script:

CREATE DATABASE blog_amazonwebshark

Next, I enter the column names from my iTunes Export CSV into Athena’s Create Table form, along with appropriate data types for each column. In response, the form creates this Athena table:

The form adds several table properties to the table’s DDL. These, along with the data types, are expanded on in the Athena Create Table documentation.

Please note that I have removed the S3 path from the LOCATION property to protect my data. The actual Athena table is pointing at an S3 prefix in my ingested objects bucket that will receive my transformed iTunes data.

Speaking of data, the form offers several choices of source data format including CSV, JSON and Parquet. I chose Parquet, but why do this when I’m already getting a CSV? Why create extra work?

Let me explain.

About Parquet

Apache Parquet is a file format that supports fast processing for complex data. It can essentially be seen as the next generation of CSV. Both formats have their place, but at scale CSV files have large file sizes and slow performance.

In contrast, Parquet files have built-in compression and indexing for rapid data location and retrieval. In addition, the data in Parquet files is organized by column, resulting in smaller sizes and faster queries.

This also results in Athena cost savings as Athena only needs to read the columns relevant to the queries being run. If the same data was in a CSV, Athena would have to read the entire CSV whether the data is needed or not.

For further reading, Databricks have a great Parquet section in their glossary.

That’s everything for Athena. Now I need to update my Python scripts.

Python

In this section, I’ll make changes to my Basic iTunes ETL to include my new S3 and Athena resources and to replace the CSV output with a Parquet file. Let’s start with some variables.

New Python Variables

My first update is a change to ETL_ITU_Play_Variables.py, which contains my global variables. Originally there were two S3 global variables – S3_BUCKET containing the bucket name and S3_PREFIX containing the S3 prefix path leading to the raw data:

S3_BUCKET
S3_PREFIX

Now I have two buckets and two prefixes, so it makes sense to update the variable names. I now have two additional global variables, adding _RAW to the originals and _INGESTED to the new ones for clarity:

S3_BUCKET_RAW
S3_PREFIX_RAW

S3_BUCKET_INGESTED
S3_PREFIX_INGESTED

Changing CSV To Parquet

The next change is to ETL_ITU_Play.py. The initial version converts a Pandas DataFrame to CSV using pandas.DataFrame.to_csv. I’m now replacing this with awswrangler.s3.to_parquet, which needs three parameters:

Put together, it looks like this:

wr.s3.to_parquet(
    df = df,
    boto3_session = session,
    path = s3_path_ingested

Before committing my changes, I took the time to put the main workings of my ETL in a class. This provides a clean structure for my Python script and will make it easier to reuse in future projects.

That completes the changes. Let’s review what has been created.

Architecture

Here is an architectural diagram of how everything fits together:

Here is a breakdown of the processes involved:

  1. User runs the Python ETL script locally.
  2. Python reads the CSV object in datalake-raw S3 bucket.
  3. Python extracts data from CSV into a DataFrame and transforms several columns.
  4. Python writes the DataFrame to datalake-ingested S3 bucket as a Parquet file.
  5. Python notifies User of a successful run.
  6. User sends query to Athena.
  7. Athena reads data from datalake-ingested S3 bucket.
  8. Athena returns query results to User.

Testing

In this section, I will test my resources to make sure they work as expected. Bare in mind that this setup hasn’t been designed with production use in mind, so my testing is somewhat limited and would be insufficient for production deployment.

Testing Python

TEST: Upload a CSV to the datalake-raw S3 bucket, then run the Python script. The Python script must run successfully and print updates in the terminal throughout.

RESULT: I upload an iTunes Export CSV to the datalake-raw S3 bucket:

The Python script runs, printing the following output in the terminal:

Creating DataFrame.
DataFrame columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')
Deleting unnecessary DataFrame columns.
Renaming DataFrame columns.
Reformatting DateTime DataFrame columns.
Creating Date Columns From DateTime Columns.
Creating MyRatingDigit Column.
Replacing blank values to prevent IntCastingNaN errors.
Setting Data Types.
Creating Parquet file from DataFrame.
Processes complete.

Testing S3

TEST: After the Python script successfully runs, the datalake-ingested S3 bucket must contain an itunesdata.parquet object.

RESULT: Upon accessing the datalake-ingested S3 bucket, an itunesdata.parquet object is found:

(On an unrelated note, look at the size difference between the Parquet and CSV files!)

Testing Athena

TEST: When the datalake-ingested S3 bucket contains an itunesdata.parquet object, data from the iTunes Export CSV must be shown when the following Athena query is run:

SELECT * FROM basic_itunes_python_etl;

RESULT: Most of the Athena results match the iTunes Export data. However, the transformed dates did not match expectations:

This appears to be a formatting problem, as some parts of a date format are still visible.

To diagnose the problem I wanted to see how these columns were being stored in the Parquet file. I used mukunku’s ParquetViewer for this, which is described in the GitHub repo as:

…a quick and dirty utility that I created to easily view Apache Parquet files on Windows desktop machines.

It works very well!

Here is a screenshot of the data. The lastplayed column has dates and times, while the datamodifieddate column has dates only:

The cause of the problem becomes apparent when the date columns are viewed using the ISO 8601 format:

The date columns are all using timestamps, even when no times are included!

A potential fix would be to change the section of my Python ETL script that handles data types. Instead, I update the data types used in my Athena table from date:

  `datemodifieddate` date, 
  `dateaddeddate` date, 
  `lastplayeddate` date, 

To timestamp:

  `datemodifieddate` timestamp, 
  `dateaddeddate` timestamp, 
  `lastplayeddate` timestamp, 

This time, when I view my Athena table the values all appear as expected:

Scripts

My ETL_ITU_Play.py file commit from 2022-08-08 can be viewed here:

ETL_ITU_Play.py on GitHub

My updated repo readme can be viewed here:

README.md on GitHub

Summary

In this post, I updated my existing iTunes Python ETL to return a Parquet file, which I then uploaded S3 and viewed using Athena. I explained my reasoning for choosing S3, Athena and the Parquet file format, and I handled a data formatting issue.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Data & Analytics

Using Athena To Query S3 Inventory Parquet Objects

In this post I’ll be using Amazon Athena to query data created by the S3 Inventory service.

When I wrote about my first impressions of S3 Glacier Instant Retrieval last month, I noticed some of my S3 Inventory graphs showed figures I didn’t expect. I couldn’t remember many of the objects in the InMotion bucket, and didn’t know that some were in Standard! I went through the bucket manually and found the Standard objects, but still had other questions that I wasn’t keen on solving by hand.

So while I was on-call over Christmas I decided to take a closer look at Athena – the AWS serverless query service designed to analyse data in S3. I’ve used existing setups at work but this was my first time experiencing it from scratch, and I made use of the AWS documentation about querying Amazon S3 Inventory with Amazon Athena and the Andy Grimes blog “Manage and analyze your data at scale using Amazon S3 Inventory and Amazon Athena” to fill in the blanks.

We’ve Got a File On You

First I created an empty s3inventory Athena database. Then I created a s3inventorytable table using the script below, specifying the 2022-01-01 symlink.txt Hive object created by S3 Inventory as the data source:

CREATE EXTERNAL TABLE s3inventorytable(
         bucket string,
         key string,
         version_id string,
         is_latest boolean,
         is_delete_marker boolean,
         size bigint,
         last_modified_date bigint,
         e_tag string,
         storage_class string,
         is_multipart_uploaded boolean,
         replication_status string,
         encryption_status string,
         object_lock_retain_until_date bigint,
         object_lock_mode string,
         object_lock_legal_hold_status string,
         intelligent_tiering_access_tier string,
         bucket_key_status string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
  LOCATION 's3://[REDACTED]/hive/dt=2022-01-01-01-00/';

Then I ran a query to determine the storage classes in use in the InMotion bucket and the number of objects assigned to each:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results were as follows:

SELECT storage_class, count(*) 
FROM "s3inventory"."s3inventorytable"

41 Standard objects?! I wasn’t sure what they were and so added object size into the query:

SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class
SELECT storage_class, count(*), sum(size)
FROM "s3inventory"."s3inventorytable"

The zero size and subsequent investigations confirmed that the Standard objects were prefixes, and so presented no problems.

Next, I wanted to check for unwanted previous versions of objects using the following query:

SELECT key, size 
FROM "s3inventory"."s3inventorytable" 
WHERE is_latest = FALSE

This query returned another prefix, so again there were no actions needed:

SELECT key, size 
FROM "s3inventory"."s3inventorytable"

Further investigation found that this prefix also has no storage class assigned to it, as seen in the results above.

For Old Time’s Sake

I then wanted to see the youngest and oldest objects for each storage class, and ran the following query:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

What I got back was unexpected:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"

S3 Inventory stores dates as Unix Epoch Time, so I needed a function to transform the data to a human-legible format. Traditionally this would involve CAST or CONVERT, but as Athena uses Presto additional functions are available such as from_unixtime:

from_unixtime(unixtime) → timestamp

Returns the UNIX timestamp unixtime as a timestamp.

I updated the query to include this function:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date)),
MAX(from_unixtime(last_modified_date))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

This time the dates were human-legible but completely inaccurate:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
human

I then found a solution in Stack Overflow, where a user suggested converting a Unix Epoch Time value from microseconds to milliseconds. I applied this suggestion to my query by dividing the last modified dates by 1000:

SELECT storage_class, 
MIN(from_unixtime(last_modified_date/1000)),
MAX(from_unixtime(last_modified_date/1000))
FROM "s3inventory"."s3inventorytable"
GROUP BY storage_class
ORDER BY storage_class

The results after this looked far more reasonable:

SELECT storage_class, 
MIN(last_modified_date), 
MAX(last_modified_date) 
FROM "s3inventory"."s3inventorytable"
FINAL

And EpochConverter confirmed the human time was correct for the Deep Archive MIN(last_modified_date) Unix value of 1620147401000:

So there we go! An introduction to Athena and utilization of the data from S3 Inventory!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~