Artificial Intelligence has been a big deal in recent months. One of the main drivers of this has been OpenAI, whose DALL-E 2 and ChatGPT services have seen extraordinary public interest and participation.
Microsoft has been one of OpenAI’s most prominent supporters. In July 2019 Microsoft invested $1 billion in OpenAI and became their exclusive cloud provider.
However, Power BI is reliant on user skill levels. Like all data visualization tools, Power BI can create bad dashboards in the wrong hands. Like, really bad. Dashboards can suffer from several problems that make them useless at best and misleading at worst.
"Improve my annual sales dashboard"
>> I have changed the pie chart showing 12 team members to a bar chart, as this will improve the visualization's legibility.
Azure IAC AI Assistant
IAC (Infrastructure As Code) has revolutionized the public cloud industry, bringing with it benefits like:
Automated, faster deployments.
Repeatable and consistent deployments.
Self-documenting infrastructure.
But IAC also presents challenges:
IAC scripts rely on the skills of the engineer writing them.
It’s not easy to incorporate existing infrastructure.
ChatGPT could resolve many of these problems, turning infrastructure creation into a conversation. It could, for example:
Create infrastructure based on non-technical requests:
"Make me what I need to start a blog."
>> I have created a LAMP stack on a virtual machine in your default region. Your access details are here:
Username: Username
Password: Password
Learn current infrastructure usage patterns and create optimisations for busy and quiet periods.
Spot potential conflicts and step in to prevent data loss or downtime.
Make existing infrastructure faster, cheaper or more performant without the need for manual refactoring.
Resolve problems like high latency, failing connections and unexpected cost increases:
"Why is my web app generating errors?"
>> One of your virtual machines does not allow connection requests from CIDR range 10.01.10.01/28. Do you want me to fix this?
"Yes please."
>> I have now amended virtual machine MYAPP001's Network Security Group to accept incomming connection requests from CIDR range 10.01.10.01/28.
Summary
In this post, I hosted March 2023’s T-SQL Tuesday with an invitation concerning the ongoing Microsoft and OpenAI partnership. I look forwards to reading everyone’s responses!
If this post has been useful, please feel free to follow me on the following platforms for future updates:
So how does ChatGPT fare in this setting? Let’s find out!
Preparation
In this section, I install and configure DBeaver’s OpenAI ChatGPT integration.
DBeaver’s OpenAI ChatGPT integration is not automatically installed with Version 22.3.4. This makes sense – it’s currently quite new and many users and organisations may not want to install it (for reasons I’ll go into later). This approach keeps DBeaver’s footprint as small as possible and lets people make up their own minds.
Following this, the DBeaver OpenAI ChatGPT integration must be configured. This process needs an active API token from the OpenAI platform.
OpenAPI account holders can generate tokens in the API Keys section of their profile. Afterwards, this token must be pasted into the API Token field in DBeaver’s preferences:
The DBeaver OpenAI ChatGPT integration will now work. The other settings are out of scope for this post and are covered in DBeaver’s wiki.
Separately, I found I was getting SQL 08S01 timeout errors during my first attempts to send ChatGPT prompts. I ended up increasing my database connection’s connectionTimeout value to resolve this.
ChatGPT Data Sharing
There is an important fact to consider when using DBeaver’s OpenAI ChatGPT integration:
DBeaver needs to send database metadata to the OpenAI platform.
To clarify, this means that DBeaver sends a list of the selected database’s table names and column names to OpenAI.
DBeaver doesn’t hide this fact away. Their wiki is very explicit, and the DBeaver OpenAI ChatGPT integration states this when used for the first time.
For my purposes, this isn’t a problem. My amazonwebshark MySQL database contains nothing confidential or sensitive, and the tables and columns of a standard WordPress database are well documented.
But I wouldn’t currently use this with databases I don’t own. Although the shared metadata is limited to table names and column names, these can still be sensitive and potentially misused. I’m sure OpenAI have data security measures in place, but as this technology is still quite new I currently prefer to be cautious.
Tests
In this section, I test the DBeaver OpenAI ChatGPT integration with prompts referring to my amazonwebshark MySQL database. Before starting, I should define the concepts of prompts and completions:
For my first test, I wanted to answer a question similar to one from my last post:
How many amazonwebshark blog categories have been used more than twice?
I answered the question with this SQL query:
SELECT t.name, tt.count
FROM term_taxonomy AS tt
INNER JOIN terms AS t
ON tt.term_id = t.term_id
WHERE tt.taxonomy = 'category'
AND tt.count > 2
Which returned these results:
name
count
Data & Analytics
6
Me
4
Developing & Application Integration
8
Security & Monitoring
3
Now it’s ChatGPT’s turn! My first prompt for ChatGPT was:
Find all categories used more than 2 times.
GPT’s completion was:
SELECT COUNT(*) AS count, category
FROM statistics_historical
GROUP BY category
HAVING count > 2;
Although this query looks fine, it doesn’t return any results! In fact the whole table has nothing in it:
SELECT COUNT(*) FROM statistics_historical
Name
Value
COUNT(*)
0
ChatGPT interpreted my prompt literally and searched my database metadata for tables with a category column. This search only returned one table, which is used for archiving purged report data: statistics_historical.
ChatGPT didn’t know that the table was empty, and so had no way to know its query wasn’t appropriate. This was my fault – my prompt needed more work.
Test 2: Category Usage Take 2
For my second test, I tried to re-engineer my prompt to give ChatGPT more insight into my data. My question remained:
“How many amazonwebshark blog categories have been used more than twice?”
My second prompt was phrased to give ChatGPT some guidance on what to look for:
Get all names of terms with taxonomy of category and count greater than 2
This time, GPT’s completion was:
SELECT name, count
FROM terms
INNER JOIN term_taxonomy
ON terms.term_id = term_taxonomy.term_id
WHERE term_taxonomy.taxonomy = 'category'
AND term_taxonomy.count > 2
This was almost identical to my own SQL query:
SELECT t.name, tt.count
FROM term_taxonomy AS tt
INNER JOIN terms AS t
ON tt.term_id = t.term_id
WHERE tt.taxonomy = 'category'
AND tt.count > 2
And the results were the same in both cases:
name
count
Data & Analytics
6
Me
4
Developing & Application Integration
8
Security & Monitoring
3
Test 3: Page Views
For my final test I tried a different approach. Instead of writing a SQL query myself, I gave ChatGPT a prompt and then used a WPStatistics report to compare results.
My question is now:
Which amazonwebshark posts have had the highest views in the past week?
I know which table has the data needed for this, so I included it along with some vital columns in my ChatGPT prompt:
Get the uris for the past week from statistics_pages. Sum all the counts. Show the top 5 in descending order.
GPT’s completion was:
SELECT uri, SUM(count) AS total_count
FROM statistics_pages
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 7 DAY)
GROUP BY uri
ORDER BY total_count DESC
LIMIT 5;
This was impressive! ChatGPT used the table and columns I specified, filtered the dates correctly and limited the results as instructed.
In this section, I examine OpenAI’s usage and billing services.
Tokens
OpenAI uses the concept of tokens to calculate pricing. In their words:
The GPT family of models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.
“You can think of tokens as pieces of words used for natural language processing. For English text, 1 token is approximately 4 characters or 0.75 words. As a point of reference, the collected works of Shakespeare are about 900,000 words or 1.2M tokens.”
OpenAI also operates a free tier. New users get $18 worth of free tokens. These expire after 3 months, after which accounts can be upgraded to paid plans operating on a pay-as-you-go basis. At this point, accounts with no billing information can still be accessed but no further API requests will be granted.
Usage
Here is my OpenAI usage for 2023-02-15:
And 2023-02-16:
To understand this better, let’s examine the last row that relates to Test 3. The values stated are:
3284 prompt tokens.
61 completion tokens.
The Completion value is easy to verify. Entering ChatGPT’s Test 3 completion into the tokenizer site scores it at 61 tokens:
Now let’s examine the Prompt value. When I enter my Test 3 prompt into the tokenizer site, it only scores it at 25 tokens. I wanted to see where the other 3259 tokens came from, so I needed to see what DBeaver sent to ChatGPT.
Currently, my OpenAI account doesn’t keep a record of what their APIs send and receive. However, I can log what DBeaver sends to OpenAI! This is done by enabling the Write GPT queries to debug log option in DBeaver’s ChatGPT settings.
It turns out that DBeaver includes DDL statements for all of the tables and columns in my amazonwebshark MySQL database in every ChatGPT prompt. Even in cases like Test 3 where I tell ChatGPT which table to use!
This isn’t unreasonable, as ChatGPT’s information retention is limited and it can’t access past conversations to inform its responses. But this explains the high prompt values per request. The OpenAI tokeniser says Test 3’s full prompt is 8293 characters. And remember – every 4 characters requires 1 token!
Addendum: DBeaver version 22.3.5 added metadata scope configuration to their ChatGPT smart completion.
In terms of size, my amazonwebshark MySQL database has around 20 tables with modest column counts. Larger databases with dozens of tables and potentially hundreds of columns could quickly rack up high usage fees, so use caution in these situations.
Billing
OpenAI has features to help with billing. Their site stores payment methods and shows invoices from the past 12 months. In addition, preferences like additional email addresses for billing notifications and business tax ID retention can be specified.
Accounts can also have usage limits, triggering notification emails when reached. These work alongside the usage tracking dashboard to help monitor spending.
So how much have I spent here?
By default, DBeaver uses the DaVinci model. This is currently listed at $0.0200 per 1000 tokens. If I take the 30,057 total tokens I’ve used and price them at $0.0200 per 1000 tokens, I get $0.60114.
In GBP that’s about £0.50 (or about 5p per query) which my free tier will cover.
Summary
In this post, I installed the new DBeaver OpenAI ChatGPT integration, tried it out with some prompts and examined the OpenAI usage costs.
Overall I was very impressed with what I found! ChatGPT works really well with a well-engineered prompt. It understood what I wanted and worked quickly. Would I pay for it? I think so, provided I kept an eye on the prompts being sent and the resulting usage costs.
In a future version, I’d like to be able to control what database metadata is included in a prompt. This would let me make sure that ChatGPT wasn’t getting sensitive or unnecessary metadata, and the smaller prompts would reduce my API usage costs.
Addendum: DBeaver version 22.3.5 added metadata scope configuration to their ChatGPT smart completion.
This is all very exciting though! I’m interested to discover how else the DBeaver OpenAI ChatGPT integration can help me out, and fascinated to see future developments!
If this post has been useful, please feel free to follow me on the following platforms for future updates:
To mark the occasion, I decided to examine the MySQL database that WordPress uses to run amazonwebshark, and see what it could tell me about the past twelve months.
In addition, I’ve been trying out DataCamp and have recently finished their Intermediate Python course. It introduced me to Matplotlib, and this post is a great chance to try out those new skills!
Timeline
I’ll start by answering a question. Why is amazonwebshark’s birthday on January 09 when my first post’s publication date is December 02 2021?
In response, here’s a brief timeline of how amazonwebshark came to be:
17 October 2021: Work starts on amazonwebshark as a WordPress.com free site.
November 2021: WordPress.com’s limitations make WordPress.org more attractive. I decide to write a blog post and see how things go.
While phpMyAdmin is great for basic maintenance, it’s not very convenient for data analysis:
There are no data tools like schema visualization or query plans.
It lacks scripting tools like IntelliSense or auto-complete.
Accessing it usually involves accessing the web host’s portal first.
Ideally, I’d prefer to access my database remotely using a SQL client instead. While this needs some additional config, SiteGround makes this very simple!
Remote Database Access
By default, SiteGround denies any remote access requests. Many people will never use this feature, so disabling it is a good security measure. Remote access is enabled in the SiteGround portal, which grants immediate remote access for specified IPs and hostnames.
This isn’t strictly necessary, but the default user has unlimited access and violates the principle of least privilege in this situation.
The following SiteGround video demonstrates enabling remote access:
Analysis Tools
In this section, I examine the various tools I’ll use for amazonwebshark’s 1st birthday data analysis.
DBeaver
DBeaver is a free database tool and SQL client. It is multi-platform, open-source and supports a variety of databases including Microsoft SQL Server, Amazon Athena and MySQL.
After mysql-connector-python is installed and imported, a connection to a MySQL database can be made using the mysql.connector.connect() function with the following arguments:
host: Hostname or IP address of the MySQL server.
database: MySQL database name.
user: User name used to authenticate with the MySQL server.
password: Password to authenticate the user with the MySQL server.
This opens a connection to the MySQL server and creates a connection object. I store this in the variable conn_mysql:
I have my credentials stored in a separate Python script that is imported as var. This means I can protect them via .gitignore until I get something better in place!
After that, I need a cursor for running my SQL queries. I create this using the cursor() method of my conn_mysql connection object and store the whole thing as cursor:
cursor = conn_mysql.cursor()
I’m now in a position to start running SQL queries from Python. I import a SQL query from my var script (in this case my Category query) and store it as query:
query = var.QUERY_CATEGORY
I then run my query using the execute() method. I get the results as a list of tuples using the fetchall() method, which I store as results:
cursor.execute(query)
results = cursor.fetchall()
Finally, I disconnect my cursor and MySQL connection with the close() method:
In this section, I begin amazonwebshark’s 1st birthday data analysis by writing a SQL query for amazonwebshark’s categories and analysing the results with Python.
Categories Analysis: SQL Query
For my Category SQL query, I’ll be using the terms and term_taxonomy tables:
WordPress has a taxonomy system for content organization. Individual taxonomy items are called terms, and they are stored in the terms table. Terms for amazonwebshark include Data & Analytics and Security & Monitoring.
The term_taxonomy table links a term_id to a taxonomy, giving context for each term. Common taxonomies are category, post_tag and nav_menu.
If I join these tables on term_id then I can map terms to taxonomies. In the following query results, the first two columns are from terms and the rest are from term_taxonomy:
My final query keeps the join, cleans up terms.name, returns all categories with at least one use and orders the results by term_taxonomy.count and terms.name:
SELECT
REPLACE (t.name, '&', '&') AS name,
tt.count
FROM
term_taxonomy AS tt
INNER JOIN terms AS t ON
tt.term_id = t.term_id
WHERE
tt.taxonomy = 'category'
AND tt.count > 0
ORDER BY
tt.count ASC ,
t.name DESC
In future, I’ll need to limit the results of this query to a specific time window. Here, I want all the results so no further filtering is needed.
Categories Analysis: Python Script
Currently, my results variable contains the results of the var.QUERY_CATEGORY SQL query. When I run print(results), I get this list of tuples:
So how do I turn this into a graph? Firstly, I need to split results up into what will be my X-axis and Y-axis. For this, I create two empty lists called name and count:
name = []
count = []
After that, I populate the lists by looping through result. For each tuple, the first item is appended to name and the second is appended to count:
for result in results:
name.append(result[0])
count.append(result[1])
When I print the lists now, name and count contain the names and counts from the SQL query in the same order as the original results:
I use barh to change the graph to a horizontal bar chart, and then swap the xlabel and ylabel strings around. This time the chart is far easier to read:
Tags Analysis
In this section, I continue amazonwebshark’s 1st birthday data analysis by writing a SQL query for amazonwebshark’s tags and analysing the results with Python.
Tags Analysis: SQL Query
My Tags SQL query is almost the same as my Categories one. This time, my WHERE clause is filtering on post_tag:
SELECT
REPLACE (t.name, '&', '&') AS name,
tt.count
FROM
term_taxonomy AS tt
INNER JOIN terms t ON
tt.term_id = t.term_id
WHERE
tt.taxonomy = 'post_tag'
AND tt.count > 0
ORDER BY
tt.count ASC ,
t.name DESC
I also planned to analyse my post views here. But, as I mentioned in my last post, some of this data is missing! So any chart will be wrong.
I haven’t had time to look at this yet, so stay tuned!
Summary
In this post, I celebrated amazonwebshark’s 1st birthday with an analysis of my site’s MySQL data using DBeaver, Python and Matplotlib.
I had fun researching and writing this! It can be tricky to find a data source that isn’t contrived or overused. Having access to the amazonwebshark database gives me data that I’m personally invested in, and an opportunity to practise writing MySQL queries.
I’ve also been able to improve my Python, and meaningfully experiment with Matplotlib to get charts that will be useful going forward. For example, I used the very first Tags chart to prune some unneeded tags from my WordPress taxonomy.
If this post has been useful, please feel free to follow me on the following platforms for future updates: