Categories
Data & Analytics

Ingesting iTunes Data Into AWS With Python And Athena

In this post, I will update my existing iTunes Python ETL to return a Parquet file, which I will then upload to S3 and view using Athena.

Table of Contents

Introduction

In my last post, I made an ETL that exported data from a CSV into a Pandas DataFrame using AWS Data Wrangler. That post ended with the transformed data being saved locally as a new CSV.

It’s time to do something with that data! I want to analyse my iTunes data and look for trends and insights into my listening habits. I also want to access these insights in the cloud, as my laptop is a bit bulky and quite slow. Finally, I’d prefer to keep my costs to a minimum.

Here, I’ll show how AWS and Python can be used together to meet these requirements. Let’s start with AWS.

Amazon S3

In this section, I will update my S3 setup. I’ll create some new buckets and explain my approach.

New S3 Buckets

Currently, I have a single S3 bucket containing my iTunes Export CSV. Moving forward, this bucket will contain all of my unmodified source objects, otherwise known as raw data.

To partner the raw objects bucket, I now have an ingested objects bucket. This bucket will contain objects where the data has been transformed in some way. My analytics tools and Athena tables will point here for their data.

Speaking of Athena, the other new bucket will be used for Athena’s query results. Although Athena is serverless, it still needs a place to record queries and store results. Creating this bucket now will save time later on.

Having separate buckets for each of these functions isn’t a requirement, although it is something I prefer to do. Before moving on, I’d like to run through some of the benefits I find with this approach.

Advantages Of Multiple Buckets

Firstly, having buckets with clearly defined purposes makes navigation way easier. I always know where to find objects, and rarely lose track of or misplace them.

Secondly, having multiple buckets usually makes my S3 paths shorter. This doesn’t sound like much of a benefit upfront, but the S3 path textboxes in the AWS console are quite small, and using long S3 paths in the command line can be a pain.

Finally, I find security and access controls are far simpler to implement with a multi-bucket setup. Personally I prefer “You can’t come into this house/bucket” over “You can come into this house/bucket, but you can’t go into this room/prefix”. However, both S3 buckets and S3 prefixes can be used as IAM policy resources so there’s technically no difference.

That concludes the S3 section. Next, let’s set up Athena.

Amazon Athena

In this section, I’ll get Athena ready for use. I’ll show the process I followed and explain my key decisions. Let’s start with my reasons for choosing Athena.

Why Athena?

Plenty has been written about Athena’s benefits over the years. So instead of retreading old ground, I’ll discuss what makes Athena a good choice for this particular use case.

Firstly, Athena is cheap. The serverless nature of Athena means I only pay for what I query, scan and store, and I’ve yet to see a charge for Athena in the three years I’ve been an AWS customer.

Secondly, like S3, Athena’s security is managed by IAM. I can use IAM policies to control who and what can access my Athena data, and can monitor that access in CloudTrail. This also means I can manage access to Athena independently of S3.

Finally, Athena is highly available. Authorised calls to the service have a 99.9% Monthly Uptime Percentage SLA and Athena benefits from S3’s availability and durability. This allows 24/7 access to Athena data for users and applications.

Setting Up Athena

To start this section, I recommend reading the AWS Athena Getting Started documentation for a great Athena introduction. I’ll cover some basics here, but I can’t improve on the AWS documentation.

Athena needs three things to get off the ground:

  • An S3 path for Athena query results.
  • A database for Athena tables.
  • A table for interacting with S3 data objects.

I’ve already talked about the S3 path, so let’s move on to the database. A database in Athena is a logical grouping for the tables created in it. Here, I create a blog_amazonwebshark database using the following script:

CREATE DATABASE blog_amazonwebshark

Next, I enter the column names from my iTunes Export CSV into Athena’s Create Table form, along with appropriate data types for each column. In response, the form creates this Athena table:

The form adds several table properties to the table’s DDL. These, along with the data types, are expanded on in the Athena Create Table documentation.

Please note that I have removed the S3 path from the LOCATION property to protect my data. The actual Athena table is pointing at an S3 prefix in my ingested objects bucket that will receive my transformed iTunes data.

Speaking of data, the form offers several choices of source data format including CSV, JSON and Parquet. I chose Parquet, but why do this when I’m already getting a CSV? Why create extra work?

Let me explain.

About Parquet

Apache Parquet is a file format that supports fast processing for complex data. It can essentially be seen as the next generation of CSV. Both formats have their place, but at scale CSV files have large file sizes and slow performance.

In contrast, Parquet files have built-in compression and indexing for rapid data location and retrieval. In addition, the data in Parquet files is organized by column, resulting in smaller sizes and faster queries.

This also results in Athena cost savings as Athena only needs to read the columns relevant to the queries being run. If the same data was in a CSV, Athena would have to read the entire CSV whether the data is needed or not.

For further reading, Databricks have a great Parquet section in their glossary.

That’s everything for Athena. Now I need to update my Python scripts.

Python

In this section, I’ll make changes to my Basic iTunes ETL to include my new S3 and Athena resources and to replace the CSV output with a Parquet file. Let’s start with some variables.

New Python Variables

My first update is a change to ETL_ITU_Play_Variables.py, which contains my global variables. Originally there were two S3 global variables – S3_BUCKET containing the bucket name and S3_PREFIX containing the S3 prefix path leading to the raw data:

S3_BUCKET
S3_PREFIX

Now I have two buckets and two prefixes, so it makes sense to update the variable names. I now have two additional global variables, adding _RAW to the originals and _INGESTED to the new ones for clarity:

S3_BUCKET_RAW
S3_PREFIX_RAW

S3_BUCKET_INGESTED
S3_PREFIX_INGESTED

Changing CSV To Parquet

The next change is to ETL_ITU_Play.py. The initial version converts a Pandas DataFrame to CSV using pandas.DataFrame.to_csv. I’m now replacing this with awswrangler.s3.to_parquet, which needs three parameters:

Put together, it looks like this:

wr.s3.to_parquet(
    df = df,
    boto3_session = session,
    path = s3_path_ingested

Before committing my changes, I took the time to put the main workings of my ETL in a class. This provides a clean structure for my Python script and will make it easier to reuse in future projects.

That completes the changes. Let’s review what has been created.

Architecture

Here is an architectural diagram of how everything fits together:

Here is a breakdown of the processes involved:

  1. User runs the Python ETL script locally.
  2. Python reads the CSV object in datalake-raw S3 bucket.
  3. Python extracts data from CSV into a DataFrame and transforms several columns.
  4. Python writes the DataFrame to datalake-ingested S3 bucket as a Parquet file.
  5. Python notifies User of a successful run.
  6. User sends query to Athena.
  7. Athena reads data from datalake-ingested S3 bucket.
  8. Athena returns query results to User.

Testing

In this section, I will test my resources to make sure they work as expected. Bare in mind that this setup hasn’t been designed with production use in mind, so my testing is somewhat limited and would be insufficient for production deployment.

Testing Python

TEST: Upload a CSV to the datalake-raw S3 bucket, then run the Python script. The Python script must run successfully and print updates in the terminal throughout.

RESULT: I upload an iTunes Export CSV to the datalake-raw S3 bucket:

The Python script runs, printing the following output in the terminal:

Creating DataFrame.
DataFrame columns are Index(['Name', 'Artist', 'Album', 'Genre', 'Time', 'Track Number', 'Year', 'Date Modified', 'Date Added', 'Bit Rate', 'Plays', 'Last Played', 'Skips', 'Last Skipped', 'My Rating', 'Location'], dtype='object')
Deleting unnecessary DataFrame columns.
Renaming DataFrame columns.
Reformatting DateTime DataFrame columns.
Creating Date Columns From DateTime Columns.
Creating MyRatingDigit Column.
Replacing blank values to prevent IntCastingNaN errors.
Setting Data Types.
Creating Parquet file from DataFrame.
Processes complete.

Testing S3

TEST: After the Python script successfully runs, the datalake-ingested S3 bucket must contain an itunesdata.parquet object.

RESULT: Upon accessing the datalake-ingested S3 bucket, an itunesdata.parquet object is found:

(On an unrelated note, look at the size difference between the Parquet and CSV files!)

Testing Athena

TEST: When the datalake-ingested S3 bucket contains an itunesdata.parquet object, data from the iTunes Export CSV must be shown when the following Athena query is run:

SELECT * FROM basic_itunes_python_etl;

RESULT: Most of the Athena results match the iTunes Export data. However, the transformed dates did not match expectations:

This appears to be a formatting problem, as some parts of a date format are still visible.

To diagnose the problem I wanted to see how these columns were being stored in the Parquet file. I used mukunku’s ParquetViewer for this, which is described in the GitHub repo as:

…a quick and dirty utility that I created to easily view Apache Parquet files on Windows desktop machines.

It works very well!

Here is a screenshot of the data. The lastplayed column has dates and times, while the datamodifieddate column has dates only:

The cause of the problem becomes apparent when the date columns are viewed using the ISO 8601 format:

The date columns are all using timestamps, even when no times are included!

A potential fix would be to change the section of my Python ETL script that handles data types. Instead, I update the data types used in my Athena table from date:

  `datemodifieddate` date, 
  `dateaddeddate` date, 
  `lastplayeddate` date, 

To timestamp:

  `datemodifieddate` timestamp, 
  `dateaddeddate` timestamp, 
  `lastplayeddate` timestamp, 

This time, when I view my Athena table the values all appear as expected:

Scripts

My ETL_ITU_Play.py file commit from 2022-08-08 can be viewed here:

ETL_ITU_Play.py on GitHub

My updated repo readme can be viewed here:

README.md on GitHub

Summary

In this post, I updated my existing iTunes Python ETL to return a Parquet file, which I then uploaded S3 and viewed using Athena. I explained my reasoning for choosing S3, Athena and the Parquet file format, and I handled a data formatting issue.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Security & Monitoring

Unexpected CloudWatch In The Billing Area

In this post I will investigate an unexpected CloudWatch charge on my April 2022 AWS bill, and explain how to interpret the bill and find the resources responsible.

Table of Contents

Introduction

My April 2022 AWS bill has arrived. The total wasn’t unusual – £4.16 is a pretty standard charge for me at the moment, most of which is S3. Then I took a closer look at the services and found an unexpected cost for CloudWatch, which is usually zero.

But not this month:

While $0.30 isn’t bank-breaking, it is unexpected and worth investigating. More importantly, nothing should be running in EU London! And there were no CloudWatch changes at all on my March 2022 bill. So what’s going on here?

Let’s start with the bill itself.

The April 2022 Bill

Looking at the bill, the rows with unexpected CloudWatch charges all mention alarms. Since nothing else has generated any charges, let’s take a closer look at all of the rows referring to alarms.

$0.00 Per Alarm Metric Month – First 10 Alarm Metrics – 10.000 Alarms

The AWS Always Free Tier includes ten CloudWatch alarms.

$0.10 Per Alarm Metric Month (Standard Resolution) – EU (Ireland) – 2.000002 Alarms

In EU Ireland, each standard resolution alarm after the first ten costs $0.10. The bill says there are twelve alarms in EU Ireland – ten of these are free and the other two cost $0.10 each – $0.20 in total.

$0.10 Per Alarm Metric Month (Standard Resolution) – EU (London) – 1.000001 Alarms

CloudWatch standard resolution alarms also cost $0.10 in EU London. As all my free alarms are seemingly in EU Ireland, the one in EU London costs a further $0.10.

So the bill is saying I have thirteen alarms – twelve in EU Ireland and one in EU London. Let’s open CloudWatch and see what’s going on there.

CloudWatch Alarm Dashboard

It seems I have thirteen CloudWatch alarms. Interesting, because I could only remember the four security alarms I set up in February.

CloudWatch says otherwise. This is my current EU Ireland CloudWatch dashboard:

Closer inspection finds eight alarms with names like:

  • TargetTracking-table/Rides-ProvisionedCapacityHigh-a53f2f67-9477-45a6-8197-788d2c7462b3
  • TargetTracking-table/Rides-ProvisionedCapacityLow-a36cf02f-7b3c-4fb0-844e-cf3d03fa80a9

Two of these are constantly In Alarm, and all have Last State Update values on 2022-03-17. The alarm names led me to suspect that DynamoDB was involved, and this was confirmed by viewing the Namespace and Metric Name values in the details of one of the alarms:

At this point I had an idea of what was going on. To be completely certain, I wanted to check my account history for 2022-03-17. That means a trip to CloudTrail!

CloudTrail Event History

CloudTrail’s Event History shows the last 90 days of management events. I entered a date range of 2022-03-17 00:00 > 2022-03-18 00:01 into the search filter, and it didn’t take long to start seeing some familiar-looking Resource Names:

Alongside the TargetTracking-table resource names linked to monitoring.amazonaws.com, there are also rows on the same day for other Event Sources including:

  • dynamodb.amazonaws.com
  • apigateway.amazonaws.com
  • lambda.amazonaws.com
  • cognito-idp.amazonaws.com

I now know with absolute certainty where the unexpected CloudWatch alarms came from. Let me explain.

Charge Explanations

So far I’ve reviewed my bills, found the CloudWatch alarms and established what was happening in my account when they were added. Now I’ll explain how this all led to charges on my bill.

The $0.20 EU Ireland Charge

When I was recently studying for the Developer Associate certification, I followed an AWS tutorial on how to Build a Serverless Web Application with AWS Lambda, Amazon API Gateway, AWS Amplify, Amazon DynamoDB, and Amazon Cognito. This was to top up my serverless knowledge before the exam.

The third module involves creating a DynamoDB table for the application. A table that I provisioned with auto-scaling for read and write capacity:

These auto-scaling policies rely on CloudWatch alarms to function, as demonstrated by some of the alarm conditions:

The DynamoDB auto-scaling created eight CloudWatch alarms. Four for Read Capacity Units:

  • ConsumedReadCapacityUnits > 42 for 2 datapoints within 2 minutes
  • ConsumedReadCapacityUnits < 30 for 15 datapoints within 15 minutes
  • ProvisionedReadCapacityUnits > 1 for 3 datapoints within 15 minutes
  • ProvisionedReadCapacityUnits < 1 for 3 datapoints within 15 minutes

And four for Write Capacity Units:

  • ConsumedWriteCapacityUnits > 42 for 2 datapoints within 2 minutes
  • ConsumedWriteCapacityUnits < 30 for 15 datapoints within 15 minutes
  • ProvisionedWriteCapacityUnits > 1 for 3 datapoints within 15 minutes
  • ProvisionedWriteCapacityUnits < 1 for 3 datapoints within 15 minutes

These eight alarms joined the existing four. The first ten were free, leaving two accruing charges.

This also explains why two alarms are always In Alarm – the criteria for scaling in are being met but the DynamoDB table can’t scale down any further.

I could have avoided this situation by destroying the resources after finishing the tutorial. The final module of the tutorial covers this. Instead I decided to keep everything around so I could take a proper look at everything under the hood.

No resources accrued any charges in March, so I left everything in place during April. I’ll go into why there was nothing on the March bill shortly, but first…

The $0.10 EU London Charge

Remember when I said that I shouldn’t be running anything in EU London? Turns out I was!

I found a very old CloudWatch alarm from 2020. It’s been there ever since. Never alerting so I didn’t know it was there. Included in the Always Free tier, so never costing me anything or triggering an AWS Budget alert. Appearing on my bill, but always as a free entry so never drawing attention.

When I exceeded my ten free CloudWatch alarms, the one in EU London became chargeable for the first time. A swift delete later and that particular problem is no more.

No CloudWatch Charge On The March 2022 Bill

That only leaves the question of why there were no CloudWatch charges on my March 2022 bill, despite there being thirteen alarms on my account for almost half of that month:

I wanted to understand what was going on, so I reached out to AWS Support.

In what must have been a first for them, I asked why no money had been billed for CloudWatch in March:

On my April 2022 bill I was charged $0.30 for CloudWatch. $0.20 in Ireland and $0.10 in London. I understand why.

What I want to understand is why I didn’t see a charge for them on my March 2022 bill. The alerts were added to the account on March 17th, so from that moment on I had thirteen alerts which is three over the free tier.

Can I get confirmation on why they don’t appear on March but do on April please?

I soon received a reply from AWS Support that explained the events in full:

…although you enabled all 13 Alarms in March, the system only calculated a pro-rated usage value, since the Alarms were only enabled on 17th March. The pro-rated Alarm usage values only amounted to 7.673 Alarms in the EU (Ireland) region, and 1.000003 Alarms in the EU (London) region.

The total pro-rated Alarm usage calculated for March (8.673003 Alarms) is thus within the 10 Alarm Free Tier threshold and thus incurred no charges, whereas in April the full 13 Alarm usage came into play for the entire month…

To summarise, I hadn’t been charged for the alarms in March because they’d only been on my account for almost half a month. Thanks for the help folks!

Summary

In this post I investigated an unexpected CloudWatch charge on my April 2022 AWS bill. I showed what the bill looked like, demonstrated how to find the resources generating the charges and explained how those resources came to be on my AWS account.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Categories
Internet Of Things & Robotics

Building A Raspberry Pi Zero Lean Green Machine

In this post I will set up my new Raspberry Pi Zero, wire up a moisture sensor and use Python to convert the sensor data into email alerts.

Table of Contents

Introduction

I recently wrote about setting up my Raspberry Pi 4. This isn’t the only Pi I have in the house though, as I put a Raspberry Pi Zero on my Christmas wishlist at about the same time and it turns out Santa was taking notes. Glad I got it before the current supply issues!

This post will flow differently from the last one as the Raspberry Pi Zero didn’t come as a kit. This time my board came in a plastic sleeve inside an unbranded white box. It came with GPIO pins attached (so I don’t have to learn how to solder properly yet!), but it didn’t come with a MicroSD card containing the Raspberry Pi OS. So let’s start there!

Installing Raspberry Pi OS 11 On A MicroSD Card

Raspberry Pis have no hard drives. Instead, they rely on memory cards for their storage. These memory cards can run a wide range of operating systems, the main limitations being the processor’s power and the operating system’s security.

For example, Windows 10 could be loaded onto a memory card but a Raspberry Pi Zero would have no chance of running it well. On the other hand, a Raspberry Pi might be able to run Windows 95 but it’d be a security nightmare.

Most use cases for a Raspberry Pi lend themselves to Raspberry Pi OS (previously called Raspbian) – a Debian-based operating system that is optimized for most Raspberry Pi hardware. Here, I’ll be installing it using the Raspberry Pi Imager.

Raspberry Pi Imager: Main Options

In my last post I mentioned that I’d need to take a look at the Raspberry Pi Imager. This software installs Raspberry Pi OS on MicroSD cards and replaces NOOBS. Versions are available for Windows, Mac and Ubuntu.

Firstly I need to choose an Operating System. There are a lot of choices here! There are different versions of the Raspberry Pi OS depending on whether I want Buster or Bullseye, and whether I want a desktop environment or not. Alternatively, I can choose to install a non-Raspberry Pi OS such as Ubuntu or Manjaro Linux.

Here the recommended setup is fine.

Next I need to choose my storage. I got a good deal on a multipack of Sandisk 64GB MicroSD cards, and will use one of those instead of overwriting the MicroSD card that came with my Labists Raspberry Pi 4 4GB Complete Starter Kit.

Raspberry Pi Imager: Advanced Options

After selecting the main options I can then access some advanced options to further customise my Raspberry Pi OS installation.

These Advanced Options are as follows:

Set Hostname

The hostname of a Raspberry PI enables it to be addressed by a name as well as an IP address. It is how a Raspberry Pi identifies itself to other systems on a local network. By default, the hostname is set to raspberrypi, but as I have more than one Raspberry Pi I will change this to avoid confusion.

Arise, mako!

Enable SSH

This option is essentially the same as the one in the Raspberry Pi Configuration settings. Enabling this now will save me a job later on.

Set Username And Password

This is a recent addition. The default username and password for any new Pi are pi and raspberry respectively, but a default password carries obvious security problems if unchanged.

Knowing a username isn’t as risky, but leaving the default in place makes life easier for a potential hacker so changing it is in my best interests. And no – it’s not mako.

Configure Wireless LAN

This is a technical way of saying ‘Set Up Wifi’.

Set Locale Settings

This selects the timezone and keyboard layout. Both were already set to GB, so this might have used my laptop’s settings. No changes needed here.

Writing To The MicroSD Card

Finally I started writing to the card. Seven minutes later I had a card which, upon insertion into my Raspberry Pi Zero and after a couple of minutes finding its feet, gave me a working installation of the latest version of Raspberry Pi OS 11 to play with!

Setting Up The Moisture Sensor

Time for my Raspberry Pi Zero to get its first job! This comes in the form of the ModMyPi Soil Moisture Sensor Kit.

The moisture sensor board features both analogue and digital outputs. The digital output gives a On or Off signal when the soil moisture content is above a certain value. The value can be set or calibrated using the adjustable onboard potentiometer.

Let’s get building!

Wiring Up The Sensor

The sensor kit comes with a jumper wire for connecting the sensor spade to the comparator board. It doesn’t come with any wires to connect the board to the Raspberry Pi though! Some hastily ordered jumper wires later and I was back in business.

As a first-timer potentially talking to other first-timers, I will say that the process of connecting the jumper cables to the various pins is an anxious experience.

All my senses were telling me that the pins looked delicate, so I used minimal pressure with the jumper wires and wondered why they kept coming off. It turns out these pins were designed to cope with some abuse after all, so being heavy-handed is encouraged. Just give the wires a good push!

It is also important to connect the wires to the correct GPIO pins on the Raspberry Pi. I used the Pi4J Project’s GPIO diagram to make sure the correct wires were connected to the correct pins.

Checking The Python Script

ModMyPi offer a Python script for this sensor on their Github repo. Let’s run through it and see what it does.

Importing Libraries

The script uses three libraries, all of which are included by default with Raspberry Pi OS 11:

  • RPi.GPIO: for controlling the GPIO pins on the Raspberry Pi.
  • smtplib: sends emails via SMTP.
  • time: for a sleep function that is part of the script.

Sending Emails

A sendEmail function uses the smtplib library and a set of variables to send emails when the function is called. The script prints "Successfully sent email" for successes and "Error: unable to send email" for failures.

Monitoring GPIO Pin

A callback function uses the RPi.GPIO library and a pair of variables to monitor the GPIO pin that the sensor is connected to. When a change is registered, one of two emails is sent via the sendEmail function depending on whether the sensor’s output is On or Off.

To keep the script running, the time library is used to add a slight delay to an infinite loop. This stops all of the CPU being used by the script, which would leave none for Raspberry Pi OS.

Testing The Sensor

To check that the sensor readings were coming through correctly, I needed a way to make the sensor wet or dry quickly. Necessity is the mother of invention, so I came up with this:

A sliced up milk bottle with a small hole in the lid and wires coming out of it. What a time to be alive.

My first attempts showed that the sensor was working (LED off and LED on) but the emails were failing:

Python 3.9.2 (/usr/bin/python3)
>>> %Run moisture.py
LED off
Error: unable to send email
LED on
Error: unable to send email

Troubleshooting: Statements Must Be Separated By Newlines Or Semicolons

An example of one of the uses of the print function in the script is:

print "LED off"

Visual Studio Code flags this as a problem, stating that Statements must be separated by newlines or semicolons.

This is down to differences between Python 2 and Python 3. Those differences are beyond the scope of my post, but the problem itself is easy to fix. As print is considered a function in Python 3, it requires parentheses to work correctly:

print ("LED off")

Troubleshooting: Blocked SMTP Port

The original script sets the smtp_port to 25. This wouldn’t be a problem if my Raspberry Pi Zero was sending the emails. However, here I’m using Google’s Gmail SMTP server to send emails instead.

TCP port 25 is frequently blocked by Internet Service Providers, including Google, as an anti-spam technique. ISPs prefer port 587 as it is more advanced and supports secure communication via Transport Layer Security (TLS).

TLS enables secure and trustworthy communication. This security requires some additional information to work properly though…

Troubleshooting: Missing SMTP Methods

This is the section of the sample script that handles sending emails:

smtpObj = smtplib.SMTP(smtp_host, smtp_port)
smtpObj.login(smtp_username, smtp_password) 
smtpObj.sendmail(smtp_sender, smtp_receivers, smtp_message)
  • The first line sends a request to smtp.gmail.com on TCP port 25.
  • The second line provides a user name and password for smtp.gmail.com.
  • Lastly, strings are given for the email sender, the recipients and what the email says.

In its current form this request will be rejected, as Gmail blocks TCP port 25. Initially, I just changed the port from 25 to 587 and ran the script again. This still didn’t work so I continued my research.

Having consulted Stack Overflow and Python’s smtplib documentation I realised that the script needed some additional methods. The sendEmail function needed two extra lines:

smtpObj = smtplib.SMTP(smtp_host, smtp_port)
smtpObj.ehlo()
smtpObj.starttls()
smtpObj.login(smtp_username, smtp_password)
smtpObj.sendmail(smtp_sender, smtp_receivers, smtp_message) 

With the move to TCP port 587 and TLS, these new methods are needed to correctly introduce the script (and by extension the Raspberry Pi) to the Gmail SMTP server.

SMTP.ehlo opens communication between the script and Gmail. The script identifies itself via the Raspberry Pi’s fully qualified domain name, giving Gmail a way to identify the source of the request.

SMTP.starttls then asks Gmail if it supports TLS encryption. Gmail replies that it does, and all SMTP commands that follow are encrypted.

That’ll work now, right? Right?!

Troubleshooting: Insufficient Authentication

Even after these changes I was still getting problems. A Stack Abuse article suggested enabling the Less Secure App Access setting of my Gmail account. It turns out that Google and I have something in common – neither of us is keen on plain-text passwords flying around the Internet.

Google had to find some sort of middle ground and came up with this setting:

It is disabled by default and can’t be enabled on accounts with active MFA. Google are actively in the process of removing this setting, and will no longer support it from the end of May 2022. But for now this should be enough to get the emails flowing.

Retesting The Sensor

I ran the script again and got the feedback I was hoping for:

Python 3.9.2 (/usr/bin/python3)
>>> %Run moisture.py
LED off
Successfully sent email
LED on
Successfully sent email

And a string of emails in my inbox:

Success!

Next Steps

Although this approach does work, it isn’t ideal for several reasons and will stop working completely when Google pull the plug on the Less Secure App Access setting. There are a number of changes I want to make.

Use AWS For Emails Instead Of Python

Sending emails via Python caused the bulk of the problems here. The approach this script uses is not secure and will soon be unsupported by the third party it relies on.

I could set up the Raspberry Pi in such a way that it could send emails itself, but ideally I want the Raspberry Pi to be doing as little work as possible with as few credentials as possible.

Enter AWS. I’ve already used SNS a number of times for emails, and the various AWS IoT services offer several options for communication with my device. This would let me decouple the email functionality from the sensor functionality.

In addition, AWS can handle the security side of things. Instead of my Raspberry Pi having root credentials for a Google account, it can have an AWS IoT certificate that will only allow specific actions.

Disable Google Less Secure App Access Setting

If AWS are handling the emails then I don’t need to use the smtplib library anymore. Which means I don’t need to use the Gmail SMTP. Which means I can disable the Less Secure App Access setting!

Google is sending me security warnings about this on almost a weekly basis at the moment. They want this OFF. So I want this off too.

Control Email Frequency

As the earlier screenshot showed, I got hammered with emails by the script. I would prefer to get emails based on a series of events or periods of time instead of a blow-by-blow account. CloudWatch is an obvious candidate here, and AWS IoT Events may also be a contender.

Monitor Device Health

Finally, there’s the question of device health. Any number of problems could occur with the current setup. The sensor could stop working (soil is acidic after all). There could be a power cut. Wolfie could knock everything over. In all of these situations, nothing happens. The readings just stop.

With AWS I can monitor the flow of data. I can look for periods with no data, and set alarms based on those periods. I might have to adjust how the Python script reads the outputs from the sensor, but at least I’ll know that the sensor is working and that my plant is lush and green instead of brown and crisp.

Summary

In this post I have set up my new Raspberry Pi Zero, wired up a moisture sensor and used Python to convert the sensor data into email alerts. I have modernised the Python script and removed syntax errors, and have identified areas where I can improve the security, reliability and operational excellence of the overall solution.

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~