AI & Machine Learning Archives

In this post, I install the new DBeaver OpenAI ChatGPT integration, try it out with some prompts and examine my OpenAI usage costs.

Introduction
Preparation
ChatGPT Data Sharing
Tests
Pricing
- Tokens
- Usage
- Billing
Summary

Introduction

DBeaver is a free database tool and SQL client. It is multi-platform, open-source and supports a variety of databases including Microsoft SQL Server, Amazon Athena and MySQL.

In February 2023, the latest version of DBeaver’s Community Edition was released. Version 22.3.4’s release notes start with:

ChatGPT integration for smart completion and code generation (as optional extension).

ChatGPT has been in the news a lot since its launch in November 2022. It has earned widespread praise and scorn, has been embraced in some places and banned in others and has attracted billions of dollars of investment.

So how does ChatGPT fare in this setting? Let’s find out!

Preparation

In this section, I install and configure DBeaver’s OpenAI ChatGPT integration.

DBeaver’s OpenAI ChatGPT integration is not automatically installed with Version 22.3.4. This makes sense – it’s currently quite new and many users and organisations may not want to install it (for reasons I’ll go into later). This approach keeps DBeaver’s footprint as small as possible and lets people make up their own minds.

DBeaver’s GitHub wiki covers the installation process, which amounts to:

Open DBeaver’s integration installer.
Select the DBeaver AI (GPT) integration library.
Select and install DBeaver AI (GPT) Support.

Following this, the DBeaver OpenAI ChatGPT integration must be configured. This process needs an active API token from the OpenAI platform.

OpenAPI account holders can generate tokens in the API Keys section of their profile. Afterwards, this token must be pasted into the API Token field in DBeaver’s preferences:

The DBeaver OpenAI ChatGPT integration will now work. The other settings are out of scope for this post and are covered in DBeaver’s wiki.

Separately, I found I was getting SQL 08S01 timeout errors during my first attempts to send ChatGPT prompts. I ended up increasing my database connection’s connectionTimeout value to resolve this.

There is an important fact to consider when using DBeaver’s OpenAI ChatGPT integration:

DBeaver needs to send database metadata to the OpenAI platform.

To clarify, this means that DBeaver sends a list of the selected database’s table names and column names to OpenAI.

DBeaver doesn’t hide this fact away. Their wiki is very explicit, and the DBeaver OpenAI ChatGPT integration states this when used for the first time.

For my purposes, this isn’t a problem. My amazonwebshark MySQL database contains nothing confidential or sensitive, and the tables and columns of a standard WordPress database are well documented.

But I wouldn’t currently use this with databases I don’t own. Although the shared metadata is limited to table names and column names, these can still be sensitive and potentially misused. I’m sure OpenAI have data security measures in place, but as this technology is still quite new I currently prefer to be cautious.

Tests

In this section, I test the DBeaver OpenAI ChatGPT integration with prompts referring to my amazonwebshark MySQL database. Before starting, I should define the concepts of prompts and completions:

Prompts are the instructions given to ChatGPT.
Completions are ChatGPT’s response to prompts.

Test 1: Category Usage

For my first test, I wanted to answer a question similar to one from my last post:

How many amazonwebshark blog categories have been used more than twice?

I answered the question with this SQL query:

SELECT t.name, tt.count
FROM term_taxonomy AS tt
INNER JOIN terms AS t 
	ON tt.term_id = t.term_id
WHERE tt.taxonomy = 'category'
	AND tt.count > 2

Which returned these results:

name	count
Data & Analytics	6
Me	4
Developing & Application Integration	8
Security & Monitoring	3

Now it’s ChatGPT’s turn! My first prompt for ChatGPT was:

Find all categories used more than 2 times.

GPT’s completion was:

SELECT COUNT(*) AS count, category
FROM statistics_historical
GROUP BY category
HAVING count > 2;

Although this query looks fine, it doesn’t return any results! In fact the whole table has nothing in it:

SELECT COUNT(*) FROM statistics_historical

Name	Value
COUNT(*)	0

ChatGPT interpreted my prompt literally and searched my database metadata for tables with a category column. This search only returned one table, which is used for archiving purged report data: statistics_historical.

ChatGPT didn’t know that the table was empty, and so had no way to know its query wasn’t appropriate. This was my fault – my prompt needed more work.

Test 2: Category Usage Take 2

For my second test, I tried to re-engineer my prompt to give ChatGPT more insight into my data. My question remained:

“How many amazonwebshark blog categories have been used more than twice?”

My second prompt was phrased to give ChatGPT some guidance on what to look for:

Get all names of terms with taxonomy of category and count greater than 2

This time, GPT’s completion was:

SELECT name, count 
FROM terms 
INNER JOIN term_taxonomy 
	ON terms.term_id = term_taxonomy.term_id 
WHERE term_taxonomy.taxonomy = 'category' 
	AND term_taxonomy.count > 2

This was almost identical to my own SQL query:

SELECT t.name, tt.count
FROM term_taxonomy AS tt
INNER JOIN terms AS t 
	ON tt.term_id = t.term_id
WHERE tt.taxonomy = 'category'
	AND tt.count > 2

And the results were the same in both cases:

name	count
Data & Analytics	6
Me	4
Developing & Application Integration	8
Security & Monitoring	3

Test 3: Page Views

For my final test I tried a different approach. Instead of writing a SQL query myself, I gave ChatGPT a prompt and then used a WPStatistics report to compare results.

My question is now:

Which amazonwebshark posts have had the highest views in the past week?

I know which table has the data needed for this, so I included it along with some vital columns in my ChatGPT prompt:

Get the uris for the past week from statistics_pages. Sum all the counts. Show the top 5 in descending order.

GPT’s completion was:

SELECT uri, SUM(count) AS total_count
FROM statistics_pages
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY uri
ORDER BY total_count DESC
LIMIT 5;

This was impressive! ChatGPT used the table and columns I specified, filtered the dates correctly and limited the results as instructed.

The query results were:

uri	total count
/connecting-athena-to-power-bi-with-simba-athena/	31
/writing-user-stories-for-an-itunes-dashboard/	24
/using-athena-to-query-s3-inventory-parquet-objects/	22
/s3-glacier-instant-retrieval-first-impressions/	10
/open-source-documentation-with-read-the-docs/	7

I compared these with my WPStatistics Pages report:

And the results were the same!

Pricing

In this section, I examine OpenAI’s usage and billing services.

Tokens

OpenAI uses the concept of tokens to calculate pricing. In their words:

The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
OpenAI Tokenizer site

And:

“You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.”
OpenAI Pricing site

OpenAI currently offers several tools for determining pricing:

Pricing site for costs in USD.
Tokenizer site for showing how the API will tokenize text.
transformers package for Python.
gpt-3-encoder package for node.js.

OpenAI also operates a free tier. New users get $18 worth of free tokens. These expire after 3 months, after which accounts can be upgraded to paid plans operating on a pay-as-you-go basis. At this point, accounts with no billing information can still be accessed but no further API requests will be granted.

Usage

Here is my OpenAI usage for 2023-02-15:

And 2023-02-16:

To understand this better, let’s examine the last row that relates to Test 3. The values stated are:

3284 prompt tokens.
61 completion tokens.

The Completion value is easy to verify. Entering ChatGPT’s Test 3 completion into the tokenizer site scores it at 61 tokens:

Now let’s examine the Prompt value. When I enter my Test 3 prompt into the tokenizer site, it only scores it at 25 tokens. I wanted to see where the other 3259 tokens came from, so I needed to see what DBeaver sent to ChatGPT.

Currently, my OpenAI account doesn’t keep a record of what their APIs send and receive. However, I can log what DBeaver sends to OpenAI! This is done by enabling the Write GPT queries to debug log option in DBeaver’s ChatGPT settings.

It turns out that DBeaver includes DDL statements for all of the tables and columns in my amazonwebshark MySQL database in every ChatGPT prompt. Even in cases like Test 3 where I tell ChatGPT which table to use!

This isn’t unreasonable, as ChatGPT’s information retention is limited and it can’t access past conversations to inform its responses. But this explains the high prompt values per request. The OpenAI tokeniser says Test 3’s full prompt is 8293 characters. And remember – every 4 characters requires 1 token!

Addendum: DBeaver version 22.3.5 added metadata scope configuration to their ChatGPT smart completion.

In terms of size, my amazonwebshark MySQL database has around 20 tables with modest column counts. Larger databases with dozens of tables and potentially hundreds of columns could quickly rack up high usage fees, so use caution in these situations.

Billing

OpenAI has features to help with billing. Their site stores payment methods and shows invoices from the past 12 months. In addition, preferences like additional email addresses for billing notifications and business tax ID retention can be specified.

Accounts can also have usage limits, triggering notification emails when reached. These work alongside the usage tracking dashboard to help monitor spending.

So how much have I spent here?

By default, DBeaver uses the DaVinci model. This is currently listed at $0.0200 per 1000 tokens. If I take the 30,057 total tokens I’ve used and price them at $0.0200 per 1000 tokens, I get $0.60114.

In GBP that’s about £0.50 (or about 5p per query) which my free tier will cover.

Summary

In this post, I installed the new DBeaver OpenAI ChatGPT integration, tried it out with some prompts and examined the OpenAI usage costs.

Overall I was very impressed with what I found! ChatGPT works really well with a well-engineered prompt. It understood what I wanted and worked quickly. Would I pay for it? I think so, provided I kept an eye on the prompts being sent and the resulting usage costs.

In a future version, I’d like to be able to control what database metadata is included in a prompt. This would let me make sure that ChatGPT wasn’t getting sensitive or unnecessary metadata, and the smaller prompts would reduce my API usage costs.

Addendum: DBeaver version 22.3.5 added metadata scope configuration to their ChatGPT smart completion.

This is all very exciting though! I’m interested to discover how else the DBeaver OpenAI ChatGPT integration can help me out, and fascinated to see future developments!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

In this post, I talk about my recent experience with the Microsoft AI-900 certification and the resources I used to study for it.

Introduction
Motivation
Resources
Summary

Introduction

On 04 November 2022, I earned the Microsoft Certified Azure AI Fundamentals certification. I’ve had my eye on the AI-900 since passing the SC-900 over Summer. Last month I found the time to sit down with it properly! This is my fourth Microsoft certification, joining my other badges on Credly.

Firstly, I’ll talk about my motivation for studying for the Microsoft AI-900. Then I’ll talk about the resources I used and how they fitted into my learning plan.

Motivation

In this section, I’ll talk about my reasons for studying for the Microsoft AI-900.

Increased Effectiveness

A common Data Engineering task is extracting data. This usually involves structured data, which have well-defined data models that help to organise and map the data available.

Sources of structured data include:

CSV data extracts.
Excel spreadsheets.
SQL database tables.

Increasingly, insights are being sought from unstructured data. This is harder to extract, as unstructured data aren’t arranged according to preset data models or schemas.

Examples of unstructured data sources include:

Inbound correspondence.
Recorded calls.
Social media activity.

Historically, extracting unstructured data needed special equipment, complex software and dedicated personnel. In recent years, public cloud providers have produced Artificial Intelligence and Machine Learning services aimed at quickly and easily extracting unstructured data.

In the case of Microsoft Azure, these include:

Azure Form Recognizer for correspondence analysis.
Speech to Text for transcribing phone calls.
Azure Cognitive Service for Language for social media analysis.

Knowing that these tools exist and understanding their use cases will help me create future data pipelines and ETL processes for unstructured data sources. This will add value to the data and will make me a more effective Data Engineer.

And on that note…

Skill Diversification

Recently I was introduced to the idea of T-shaped skills in a CollegeInfoGeek article by Ransom Patterson. Ransom summarises a T-shaped person as having:

…deep knowledge/skills in one area and a broad base of general supporting knowledge/skills.
“The T-Shaped Person: Building Deep Expertise AND a Wide Knowledge Base” – Ransom Patterson on CollegeInfoGeek

Ransom’s article made me realise that I’ve been developing T-shaped skills for a while. I’ve then applied these skills back to my Data Engineering role. For example:

Writing my PowerShell music upload script helped me write scripts that monitor EBS volume snapshots.
Reading George Pólya’s How To Solve It helped me resolve problems when writing Python data ingestion scripts.
Learning about DynamoDB for my AWS Developer Associate certification helped me extract data from DynamoDB to Redshift.

My studying for the AI-900 is a continuation of this. This isn’t me saying “I want to be a Machine Learning Engineer now!” This is me seeing a topic, being interested in it and examining how it could be useful for my personal and professional interests.

Multi-Cloud Fluency

This kind of follows on from T-shaped skills.

Earlier in 2022, Forrest Brazeal examined the benefits of multi-cloud fluency, and built a case summarised in one of his tweets:

1. Have a “primary” cloud that feels like your native language.

Go deep on your primary cloud for hirable expertise, broad on your “secondary” cloud for situational awareness.

Another way to put this: build on your primary, get certified on your secondary.
— Forrest Brazeal (@forrestbrazeal) January 7, 2022

This applies to the data world pretty well, as many public cloud services can interact with each other across vendor boundaries.

For example:

With multi-cloud fluency, decisions can be made based on using the best services for the job as opposed to choosing services based on vendor or familiarity alone.

This GuyInACube video gives an example of this using the Microsoft Power BI Service:

To connect the Power BI Service to an AWS data source, a data gateway needs to be running on an EC2 instance to handle authentication. This introduces server costs and network management.

Conversely, data stored in Azure (Azure SQL Database in the video) can be accessed by other Azure services with a single click. As a multi-cloud fluent Data Engineer in this scenario, I now have options where previously there was only one choice.

Improved multi-cloud fluency means I can use AWS for some jobs and Azure for others, in the same way that I use Windows for some jobs and Linux for others. It’s about having the knowledge and skills to choose the best tools for the job.

Resources

In this section, I’ll talk about the resources I used to study for the Microsoft AI-900.

John Savill

John Savill’s Technical Training YouTube channel started in 2008. Since then he’s created a wide range of videos from deep dives to weekly updates. In addition, he has numerous playlists for many Microsoft certifications including the AI-900.

Having watched John’s SC-900 video I knew I was in good hands. John has a talent for simple, straightforward discussions of important topics. His AI-900 video was the first resource I used when starting to study, and the last resource I used before taking the exam.

Exceptional work as usual John!

Microsoft Learn

Microsoft Learn was my main study resource for the AI-900. It has a lot going for it! The content is up to date, the structure makes it easy to dip in and out and the knowledge checks and XP system keep the momentum up.

To start, I attended one of Microsoft’s Virtual Training Days. The courses are free, and their AI Fundaments course currently provides a free certification voucher when finished. Microsoft Product Manager Loraine Lawrence presented the course and it was a great introduction to the various Azure AI services.

Complimenting this, Microsoft Learn has a free learning path with six modules tailed for the AI-900 exam. These modules are well-organised and communicate important knowledge without being too complex.

The modules include supporting labs for learning reinforcement. The labs are well documented and use the Azure Portal, Azure Cloud Shell and Git to build skills and real experience.

I didn’t end up using the labs due to time constraints, but someone else had me covered on that front…

Andrew Brown

Andrew Brown is the CEO of ExamPro. He has numerous freeCodeCamp videos, including his free AI-900 one.

I’ve used some of Andrew’s AWS resources before and found this to be of his usual high standard. The video is four hours long, with dozens of small lectures that are time-stamped in the video description. This made it easy to replay sections during my studies.

Andrew also includes two hours of him using Azure services like Computer Vision, Form Recognizer and QnAMaker. This partnered with the Microsoft Learn material very well and helped me understand and visualise topics I wasn’t 100% on.

Summary

In this post, I talked about my recent experience with the Microsoft AI-900 certification and the resources I used to study for it. I can definitely use the skills I’ve picked up moving forwards, and the certification is some great self-validation!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Table of Contents

Introduction

Preparation

ChatGPT Data Sharing

Tests

Test 1: Category Usage

Test 2: Category Usage Take 2

Test 3: Page Views

Pricing

Tokens

Usage

Billing

Summary

Table of Contents

Introduction

Motivation

Increased Effectiveness

Skill Diversification

Multi-Cloud Fluency

Resources

John Savill

Microsoft Learn

Andrew Brown

Summary