Tag: Open Source

SQL Workbench: On-Demand DuckDB-Wasm With No Bill

Post author By Damien Jones
Post date June 27, 2024

In this post, I try Tobias Müller‘s free SQL Workbench tool powered by DuckDB-Wasm. And maybe make one or two gratuitous duck puns.

Introduction
About DuckDB
- DuckDB
- DuckDB-Wasm
SQL Workbench
Summary

Introduction

Hopefully like many people in tech, I have a collection of articles, repos, projects and whatnot saved under the banner of “That sounds cool / looks interesting / feels useful – I should check that out at some point.” Sometimes things even come off that list…

DuckDB was on my list, and went to the top when I heard about Tobias Müller’s free SQL Workbench tool powered by DuckDB-Wasm. What I heard was very impressive and pushed me to finally examine DuckDB up close.

So what is DuckDB?

About DuckDB

This section examines DuckDB and DuckDB-Wasm – core components of SQL Workbench.

DuckDB

DuckDB is an open-source SQL Online Analytical Processing (OLAP) database management system. It is intended for analytical workloads, and common use cases include in-process analytics, exploration of large datasets and machine learning model prototyping. DuckDB was released in 2019 and reached v1.0.0 on June 03 2024.

DuckDB is an embedded database that runs within a host process. It is lightweight, portable and requires no separate server. It can be embedded into applications, scripts, and notebooks without bespoke infrastructure.

DuckDB uses a columnar storage format and supports standard SQL features like complex queries, joins, aggregations and window functions. It can be used with programming languages like Python and R.

In this Python example, DuckDB:

Creates an in-memory database
Defines a users table.
Inserts data into users.
Runs a query to fetch results.

Python

import duckdb

# Create a new DuckDB database in memory
con = duckdb.connect(database=':memory:')

# Create a table and insert some data
con.execute("""
CREATE TABLE users (
    user_id INTEGER,
    user_name VARCHAR,
    age INTEGER
);
""")

con.execute("INSERT INTO users VALUES (1, 'Alice', 30), (2, 'Bob', 25), (3, 'Charlie', 35)")

# Run a query
results = con.execute("SELECT * FROM users WHERE age > 25").fetchall()
print(results)

DuckDB-Wasm

Launched in 2021, DuckDB WebAssembly (Wasm) is a version of the DuckDB database that has been compiled to run in WebAssembly. This lets DuckDB run in web browser processes and other environments with WebAssembly support.

So what’s WebAssembly? Don’t worry – I didn’t know either and so deferred to an expert:

Since all data processing happens in the browser, DuckDB-Wasm brings new benefits to DuckDB. In-browser dashboards and data analysis tools with DuckDB-Wasm enabled can operate without any server-side processing, improving their speed and security. If data is being sent somewhere then DuckDB-Wasm can handle ETL operations first, reducing both processing time and cost.

Running in-browser SQL queries also removes the need for setting up database servers and IDEs, reducing patching and maintenance (but don’t forget about the browser!) while increasing availability and convenience for remote work and educational purposes.

Like DuckDB, it is open-source and the code is on GitHub. A DuckDB Web Shell is available at shell.duckdb.org and there are also more technical details on DuckDB’s blog.

SQL Workbench

This section examines SQL Workbench and tries out some of its core features.

About SQL Workbench

Tobias Müller produced SQL Workbench in January 2024. He integrated DuckDB-Wasm into an AWS serverless static site and built an online interactive tool for querying data, showing results and generating visualizations.

SQL Workbench is available at sql-workbench.com. Note that it doesn’t currently support mobile browsers. Tobias has also written a great tutorial post that includes:

and loads more content in addition that I won’t reproduce here. So go and give Tobias some traffic!

Layout

The core SQL Workbench components are:

The Object Explorer section shows databases, schema and tables:

The Tools section shows links, settings and a drag-and-drop section for adding files (more on that later):

The Query Pane for writing SQL, which preloads with the below script upon each browser refresh:

And finally, the Results Pane shows query results and visuals:

Next, let’s try it out!

Sample Queries

SQL Workbench opens with a pre-loaded script that features:

Instructions:

SQL

-- WELCOME TO THE ONLINE SQL WORKBENCH!
-- To run a SQL query in your browser, select the query text and press:
-- CTRL + Enter (Windows/Linux) / CMD + Enter (Mac OS)

-- See https://duckdb.org/docs/sql/introduction for more info about DuckDB SQL syntax

Example queries using Parquet files:

SQL

-- Remote Parquet scans:
SELECT * FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet' LIMIT 1000;

SELECT avg(c_acctbal) FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/customer.parquet';

SELECT count(*)::int as aws_service_cnt FROM 'https://raw.githubusercontent.com/tobilg/aws-iam-data/main/data/parquet/aws_services.parquet';

SELECT * FROM 'https://raw.githubusercontent.com/tobilg/aws-edge-locations/main/data/aws-edge-locations.parquet';

SELECT cloud_provider, sum(ip_address_cnt)::int as cnt FROM 'https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.parquet' GROUP BY cloud_provider;

SELECT * FROM 'https://raw.githubusercontent.com/tripl-ai/tpch/main/parquet/lineitem/part-0.parquet';

Example query using a CSV file:

SQL

-- Remote CSV scan
SELECT * FROM read_csv_auto('https://raw.githubusercontent.com/tobilg/public-cloud-provider-ip-ranges/main/data/providers/all.csv');

These queries are designed to show SQL Workbench’s capabilities and the style of SQL queries it can run (expanded on further here). They can also be used for experimentation, so here goes!

Firstly, I’ve added an ORDER BY to the first Parquet query to make sure the first 1000 orders are returned:

SQL

-- Show first 1000 orders for Parquet file
SELECT * 
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'
ORDER BY o_orderkey
LIMIT 1000;

Secondly, I’ve used the COUNT and CAST functions to count the total orders and return the value as an integer:

SQL

-- Show number of orders for Parquet file
SELECT CAST(COUNT(o_orderkey) AS INT) AS orderkeycount 
FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet';

Finally, this query captures all 1996 orders in a CTE and uses it to calculate the 1996 order price totals grouped by order priority:

SQL

-- Show 1996 order totals in priority order
WITH cte_orders1996 AS (
    SELECT o_orderkey, o_custkey, o_totalprice, o_orderdate, o_orderpriority
    FROM 'https://shell.duckdb.org/data/tpch/0_01/parquet/orders.parquet'
    WHERE YEAR(o_orderdate) = 1996
)

SELECT SUM(o_totalprice) AS totalpriorityprice, o_orderpriority 
FROM cte_orders1996 
GROUP BY o_orderpriority 
ORDER BY o_orderpriority;

Now let’s try some of my data!

File Imports

SQL Workbench can also query data stored locally (as well as remote data which is out of this post’s scope). For this section I’ll be using some Parquet files generated by my WordPress bronze data orchestration process: posts.parquet and statistics_pages.parquet.

After dragging each file onto the assigned section they quickly appear as new tables:

These tables show column names and inferred data types when expanded:

2024 06 26 SQLWorkbenchSchemaWordPressExpanded

Queries can be written manually or by right-clicking the tables in the Object Explorer. A statistics_pages SELECT * test query shows the expected results:

As posts.parquet‘s ID primary key column matches statistics_pages.parquet‘s ID foreign key column, the two files can be joined on this column using an INNER JOIN:

SQL

SELECT 
  p.post_title,
  sp.type, 
  sp.date as viewdate,
  sp.count
FROM 
  'statistics_pages.parquet' AS sp
  INNER JOIN 'posts.parquet' AS p ON sp.id = p.id
 ORDER BY sp.date

I can use this query to start making visuals!

Visuals

Next, let’s examine SQL Workbench’s visualisation capabilities. To help things along, I’ve written a new query showing the cumulative view count for my Using Athena To Query S3 Inventory Parquet Objects post (ID 92):

SQL

 SELECT 
  sp.date AS viewdate,
  SUM(sp.count) OVER (PARTITION BY sp.id ORDER BY sp.date) AS cumulative_views
FROM 
  'statistics_pages.parquet' AS sp
  INNER JOIN 'posts.parquet' AS p ON sp.id = p.id
WHERE p.id = 92
ORDER BY sp.date;

(On reflection the join wasn’t necessary for this output, but anyway… – Ed)

Tobias has written visualisation instructions here so I’ll focus more on how my visual turns out. After the query finishes running, several visual types can be selected:

2024 06 26 SQLWorkbenchVisualizationsOptions

Selecting a visual quickly renders it in the browser. Customisation options are also available. That said, in my case the first chart was mostly fine!

2024 06 26 SQLWorkbenchVisualizationsLineChart

All I need to do is move viewdata (bottom right) to the Group By bin (top right) to add dates to the X-axis.

Exports & Cleanup

Finally, let’s examine SQL Workbench’s export options. And there’s plenty to choose from!

The CSV and JSON options export the query results in their respective formats. The Arrow option exports query results in Apache Arrow format, which is designed for high-speed in-memory data processing and so compliments DuckDB very well. Dremio wrote a great Arrow technical guide for those curious.

The HTML option produces an interactive graph with tooltips and configuration options. Handy for sharing and I’m sure there’ll be a way to use the backend code for embedding if I take a closer look. The PNG has me covered in the meantime, immediately producing this image:

athenainventorycumulativeCropped — *(Image cropped for visibility – Ed)*

Finally there’s the Config.JSON option, which exports the visual’s settings for version control and reproducibility:

JSON

{"version":"2.10.0","plugin":"Y Line","plugin_config":{},"columns_config":{},"settings":true,"theme":"Pro Light","title":"Using Athena To Query S3 Inventory Parquet Objects Cumulative Views","group_by":["viewdate"],"split_by":[],"columns":["cumulative_views"],"filter":[],"sort":[],"expressions":{},"aggregates":{}}

All done! So how do I remove my data from SQL Workbench? I…refresh the site! Because SQL Workbench uses DuckDB, my data has only ever been stored locally and in memory. This means the data only persists until the SQL Workbench page is closed or reloaded.

Summary

In this post, I tried Tobias Müller’s free SQL Workbench tool powered by DuckDB-Wasm.

SQL Workbench and DuckDB are both very impressive! The results they produce with no servers or data movement are remarkable, and their tooling greatly simplifies working with modern data formats.

DuckDB makes lots of sense as an AWS Lambda layer, and DuckDB-Wasm simplifies in-browser analytics at a time when being data-driven and latency-averse are constantly gaining importance. I don’t feel like I’ve even scratched the surface of SQL Workbench either, so go and check it out!

Phew – no duck puns. No one has time for that quack. If this post has been useful then the button below has links for contact, socials, projects and sessions:

Thanks for reading ~~^~~

Tags Apache Parquet, DuckDB, GitHub, Open Source, SQL, WebAssembly, WordPress amazonwebshark MySQL Database

Architecture & Resilience

amazonwebshark’s Abandoned 2019 AWS Architecture

Post author By Damien Jones
Post date January 9, 2024

In this post, I respond to January 2024’s T-SQL Tuesday #170 Invitation by examining amazonwebshark’s abandoned 2019 AWS architecture.

Introduction
How awshark2019 Was Built
- Hugo Static Site Generator
- S3 Static Site
Why awshark2019 Failed
What I Learned
Summary

Introduction

amazonwebshark is two years old today!

I wrote an analysis post last year, and when deciding on the second birthday’s topic I saw this month’s T-SQL Tuesday invitation from Reitse Eskens:

“What projects did you abandon but learn a lot from?”

One immediately sprang to mind! Since this T-SQL Tuesday falls on amazonwebshark’s second birthday, it seemed a good time to evaluate it.

Rewind to 2019. I was new to AWS and was studying towards their Certified Cloud Practitioner certification. To that end, I set up an AWS account and tried several tutorials including an S3 static website.

After earning the certification, I kept the site going to continue my learning journey. I made the site into a blog and chose a snappy (Groan – Ed) name…amazonwebshark. In fact, that site is still around!

I’ll start by looking at the site architecture, then examine what went wrong and end with how it influenced the current amazonwebshark site. For the rest of this post, I’ll refer to amazonwebshark 2019 as awshark2019 and the current version as awshark2021.

How awshark2019 Was Built

In this section, I examine the architecture behind awshark2019.

Hugo Static Site Generator

Hugo is an open-source static site generator written in the Go programming language. Go is known for its efficiency and performance, making Hugo’s build process very fast.

Hugo’s content files are written in Markdown which enables easy post creation and formatting. These Markdown posts are then converted to static HTML files at build time. The built site has a file system structure and can be deployed to platforms like traditional web servers, content delivery networks (CDNs), and cloud storage services.

Speaking of which…

S3 Static Site

awshark2019 has been operating out of a public S3 bucket since its creation:

This won’t be a particularly technical section, as the AWS documentation and tutorial are already great resources for this S3 feature. So let’s talk about the benefits of static sites instead:

Since static websites consist of pre-built HTML, CSS, and JavaScript files, they load quickly and can scale rapidly.

Static websites are inherently more secure and maintainable because there’s no server-side code execution, database vulnerabilities or plugin updates.

All site processing is done before deployment, so the only ongoing cost is for storage. awshark2019 weighs in at around 4MB, so in the four years it has been live this has been essentially free.

So far this all sounds good. What went wrong?

Why awshark2019 Failed

In this section, I examine awshark2019’s problems. Why was the 2019 architecture abandoned?

Unclear Objectives

Firstly, awshark2019 had no clear purpose.

In my experience, good blogs have their purpose nailed down. It could be automation, data, biscuits…anything as long as it becomes consistent and plays to the creator’s strengths.

With awshark2019, some posts are about S3 Static Sites and Billing Alerts. These are good topics to explore. However, almost half of the posts are about creating the site and are in a web design category. But the blog isn’t about web design, and I’ve never been a web designer!

Rounding things off, the About page is…the Hugo default. So who is the site for? If I, as the blog creator, don’t know that then what chance does anyone else have?

Poor Content

Secondly, as awshark2019’s objectives were unclear the content was…not very good. The topic choices are disjointed, some of the posts are accidental documentation rehashes and ultimately there’s little value.

Let’s take the example of Adding An Elastic IP To An Amazon Linux EC2 Instance. The post explores the basics, shows the AWS console changes and mentions costs. This is fine, but there’s not much else here. If I wrote this post today, I’d define a proper use case and explore the problem more by pinging the instance’s IP before and after a stoppage. This shows the problem instead of telling it.

Another post examines Setting Up A Second AWS Account With AWS Organizations. There’s more here than the IP address post, but there’s no context. What am I doing with the second account? Why does my use case support the use of AWS Organisations? What problems is it helping me solve?

There’s nothing in these posts that I can’t get from the AWS documentation and no new insights for readers.

Awkward To Publish

Finally, awshark2019 was too complex to publish. More accurately, Hugo’s deployment process wasn’t the problem. The way I was doing it was.

Hugo sites can be deployed in several ways. These centre around putting files and folders in a location accessible by the deployment service. So far so good.

But instead of automating this process, I had a horrible manual workflow of creating and testing the site locally, and then manually overwriting the existing S3 objects. This quickly got so tedious that I eventually ran out of enthusiasm.

What I Learned

In this section, I examine what I learned from the abandoned 2019 architecture when creating awshark2021.

Decide On Scope

My first key awshark2021 decision was the blog’s purpose.

While ‘Welcome To My Blog’ posts are something of a cliche, I took the time to write Introducing amazonwebshark as a standard to hold myself to:

By writing about my experiences I can check and confirm my understanding of new topics, give myself points of reference for future projects and exam revision, evidence my development where necessary and help myself out in the moments when my imposter syndrome sees an opportunity to strike.
Introducting amazonwebshark: What Is amazonwebshark For?

awshark2021 took as much admin away as possible, letting me explore topics and my curiosity instead. amazonwebshark was, and is, a place for me to:

Try things
Make mistakes
Improve myself
Be creative

While this is firstly a technology and cloud computing blog, I allow myself some freedom (for example the Me category) as long as the outcome is potentially useful. To this end, I’ve also written about life goals, problem-solving and public speaking.

Add Value

Secondly, let’s examine the posts themselves.

I probably average about eight hours of writing per post. I want to get the most out of that time investment, so I try to ensure my posts add value to their subject matter. There’s no set process for this, as value can take many forms like:

Examining how I apply services to my situation or use case.
Raising awareness of topics with low coverage.
Detailing surprising or unexpected event handling.

My attitude has always been that I’m not here to tell people how and why to do things. I’m here to tell people how and why I did things. Through this process, I can potentially help others in the technology community while also helping myself.

Post introspection and feedback have led to improvements in my working practises like:

Improving SQL code readability with CTEs.
Avoiding sensitive variable disclosure with AWS Parameter Store.
Preventing code duplication with Object Oriented Programming.

It could be argued that amazonwebshark is a big ongoing peer review. It’s made me a better engineer and has hopefully helped others out too.

Keep It Simple

Finally, let’s discuss architecture.

awshark2021 is a WordPress blog, currently hosted on Hostinger servers. While this architecture isn’t free and has tradeoffs, it offers a fast, reliable deployment path managed by organisations specialising in this field.

This is exactly what I wanted for awshark2021:

…my main focus was to get the ball rolling and get something online. I’ve wanted to start a blog for some time, but have run into problems like knowledge gaps, time pressures and running out of enthusiasm.
Introducing amazonwebshark: Why Didn’t You Use AWS For Hosting?

I enjoy writing, so my priority is there. If I begin seriously considering a serverless amazonwebshark, one of the core tests will be the deployment process. For now, the managed services I’m paying for meet my needs and let me focus on creativity over admin.

Summary

In this post, I responded to January 2024’s T-SQL Tuesday #170 Invitation by examining amazonwebshark’s abandoned 2019 AWS architecture.

It’s unfair to blame the architecture. Rather, my implementation of it was at fault. awshark2019 was a good idea but suffered from poor and over-ambitious architectural decisions. I’ve considered deleting it. But if nothing else it reminds me of a few things:

I won’t always get it right first time.
It doesn’t have to be perfect.
Enjoy the process.

awshark2019’s lessons have allowed awshark2021 to reach two years. Happy birthday!

If this post has been useful, the button below has links for contact, socials, projects and sessions:

Thanks for reading ~~^~~

Tags Amazon S3, Amazon Web Services, Open Source, T-SQL Tuesday

Data & Analytics

WordPress MySQL Database Tables Deep Dive

Post author By Damien Jones
Post date May 6, 2023

In this post, I do a deep dive into some of the amazonwebshark WordPress MySQL database tables following the journey of a recent post.

Introduction
WordPress Database 101
wp_posts
- Table Purpose
- Post Journey
wp_term_relationships
- Table Purpose
- Post Journey
wp_term_taxonomy
- Table Purpose
- Post Journey
wp_terms
- Table Purpose
- Post Journey
wp_statistics_pages
- Table Purpose
- Post Journey
Summary

Introduction

In January I used Python and Matplotlib to create some visualisations using the WordPress amazonwebshark MySQL database.

Since then I’ve been doing a lot with Power BI at work, so I’ve created a Power BI connection to the amazonwebshark database to reacquaint myself with some features and experiment with a familiar dataset.

I talked about doing a views analysis in January’s post. While some of the 2022 data is missing, I can still accurately analyse 2023 data. I plan to measure:

Total views for each post.
Total views for each category.

I’ll use this post to examine some of the MySQL tables, and link back to it in future analysis posts.

Let’s begin with a brief WordPress database overview.

WordPress Database 101

In this section, I take a high-level view of a typical WordPress database and identify the tables I’ll need.

There’s plenty of great documentation online about typical WordPress installations. I’m particularly keen on The Ultimate Developer’s Guide to the WordPress Database by DeliciousBrains, which includes an in-depth tour of the various tables.

As for table relationships, this WordPress ERD shows object names, primary keys and relationship types:

I’ll be concentrating on these WordPress tables:

And the wp_statistics_pages table used by WPStatistics.

I’ll examine each table in the context of a recent post: DBeaver OpenAI ChatGPT Integration.

wp_posts

In this section of my WordPress database deep dive, I examine the most important WordPress database table: wp_posts.

Table Purpose

WordPress uses wp_posts to manage a site’s content. Each row in the table is an event relating to a piece of content, like a post, page or attachment. Examples of these events in the context of a blog post are:

Creating A New Draft: A new row is created with a post_status of draft. This row is the parent of all future activity for the blog post.

Updating A Draft: A new row is created with details of the update. The new row’s post_parent is set to the initial post’s ID.

Publishing A Draft: The initial row’s post_status is changed to publish, and the post_date is changed to the publication date. WordPress finds revisions to the post by filtering rows with a post_parent matching the initial row’s ID.

Post Journey

Let’s start by finding DBeaver OpenAI ChatGPT Integration‘s parent row, which is its earliest record. The following query finds rows where the post_title is DBeaver OpenAI ChatGPT Integration, then orders by ID and returns the first result.

SELECT 
  id, 
  post_date, 
  post_title, 
  post_status, 
  post_name, 
  post_parent, 
  post_type 
FROM 
  `wp_posts` 
WHERE 
  post_title = 'DBeaver OpenAI ChatGPT Integration' 
ORDER BY 
  id 
LIMIT 
  1

Note that I order by ID, not post_date. The publication process changes the parent post’s post_date, so I must use ID to find the earliest post.

This record is returned:

Name	Value
ID	1902
post_date	2023-02-19 20:28:22
post_title	DBeaver OpenAI ChatGPT Integration
post_status	publish
post_name	dbeaver-openai-chatgpt-integration
post_parent	0
post_type	post

So the DBeaver OpenAI ChatGPT Integration parent row is ID 1902. I can use this to count the number of changes to this post by searching for wp_posts rows with a post_parent of 1902:

SELECT 
  COUNT(*) 
FROM 
  `wp_posts`
WHERE 
  post_parent = 1902

81 rows are returned:

Name    |Value|
--------+-----+
COUNT(*)|81   |

Now let’s examine these results more closely.

In the following query, I get all rows relating to DBeaver OpenAI ChatGPT Integration and then group the results by:

Date the post was made (using the MySQL DATE function to remove the time values for more meaningful aggregation).
Status of the post.
Post’s parent post.
Type of post.

I also count the rows that match each group and order the results by ID to preserve the event order:

SELECT 
  COUNT(*) AS ID_count, 
  DATE(post_date) AS post_date, 
  post_status, 
  post_parent, 
  post_type 
FROM 
  `wp_posts`
WHERE 
  ID = 1902 
  OR post_parent = 1902 
GROUP BY 
  DATE(post_date), 
  post_status, 
  post_parent, 
  post_type 
ORDER BY 
  ID

The query results are below. A couple of things to note:

The first two columns show what happens when a post is published. Row 1 is ID 1902 as it has no post_parent, and it has a post_status of publish and a post_date of 2023-02-19.
Row 2 is the first revision of ID 1902, and it has a post_status of inherit and a post_date of 2023-02-15. This is why I order by ID instead of post_date – ordering by post_date would show the revisions before the parent post in the results.
There are various post_type valves – revisions are text updates and attachments are image updates.

ID_count	post_date	post_status	post_parent	post_type
1	2023-02-19	publish	0	post
1	2023-02-15	inherit	1902	revision
19	2023-02-16	inherit	1902	revision
7	2023-02-16	inherit	1902	attachment
24	2023-02-17	inherit	1902	revision
1	2023-02-17	inherit	1902	attachment
7	2023-02-18	inherit	1902	revision
21	2023-02-19	inherit	1902	revision
1	2023-02-26	inherit	1902	revision

Spotlighting some of these results for context:

On 2023-02-16 there were 19 text revisions and 7 images attached. I save a lot!
On 2023-02-19 there were 21 text revisions and then the post was published.
There was a further text revision on 2023-02-26 in response to a DBeaver software update.

That’s enough about wp_posts for now. Next, let’s start examining how WordPress groups content.

wp_term_relationships

In this section, I examine the first of the WordPress taxonomy tables: wp_term_relationships.

Table Purpose

wp_term_relationships stores information about the relationship between posts and their associated taxonomy terms (More on taxonomies in the next section). WordPress uses it as a bridge table between wp_posts and the various taxonomy tables.

Post Journey

In this query, I join wp_term_relationships to wp_posts on object_id (this is ID in wp_posts), then find the rows where either wp_posts.id or wp_posts.post_parent is 1902:

SELECT 
  yjp.ID, 
  DATE(yjp.post_date) AS post_date, 
  yjp.post_type, 
  yjp.post_status,
  yjtr.object_id, 
  yjtr.term_taxonomy_id 
FROM 
  `wp_posts` AS yjp 
  INNER JOIN `wp_term_relationships` AS yjtr 
    ON yjtr.object_id = yjp.ID 
WHERE 
  yjp.ID = 1902 
  OR yjp.post_parent = 1902

wp_term_relationships only contains published posts, so the only rows returned concern the parent ID 1902:

ID	post_date	post_type	post_status	object_id	term_taxonomy_id
1902	2023-02-19	post	publish	1902	2
1902	2023-02-19	post	publish	1902	69
1902	2023-02-19	post	publish	1902	71
1902	2023-02-19	post	publish	1902	74
1902	2023-02-19	post	publish	1902	76
1902	2023-02-19	post	publish	1902	77

The query returned six distinct wp_term_relationships.term_taxonomy_id values. My next step is to establish what these IDs relate to.

wp_term_taxonomy

In this section, I examine the table that groups term_taxonomy_id values into taxonomy types: wp_term_taxonomy.

Table Purpose

WordPress uses the wp_term_taxonomy table to store the taxonomy data for terms. Taxonomies in WordPress are used to group posts and custom post types together. Examples of WordPress taxonomies are category, post_tag and nav_menu.

Post Journey

In this query, I add a new join to the previous query, joining wp_term_taxonomy to wp_term_relationships on term_taxonomy_id. Some of the wp_posts columns have been removed from the query to save space.

SELECT 
  yjp.ID,  
  yjtr.term_taxonomy_id, 
  yjtt.taxonomy
FROM 
  `wp_posts` AS yjp 
  INNER JOIN `wp_term_relationships` AS yjtr 
    ON yjtr.object_id = yjp.ID 
  INNER JOIN `wp_term_taxonomy` AS yjtt 
    ON yjtr.term_taxonomy_id = yjtt.term_taxonomy_id 
WHERE 
  yjp.ID = 1902 
  OR yjp.post_parent = 1902

These results give some content to the previous results. I can now see that wp_posts.id 1902 has one category and five tags.

ID	term_taxonomy_id	taxonomy
1902	2	category
1902	69	post_tag
1902	71	post_tag
1902	74	post_tag
1902	76	post_tag
1902	77	post_tag

To get the names of the categories and tags, I must bring one more table into play…

wp_terms

In this section of my WordPress database deep dive, I examine the table that holds the names and details of the taxonomy terms used on amazonwebshark: wp_terms.

Table Purpose

The wp_terms table stores all of the terms that are used across all taxonomies on a WordPress site. Each row represents a single term, and the columns in the table contain information about that term, including name and ID.

Post Journey

In this query, I add another join to the previous query, joining wp_terms to wp_term_taxonomy on term_id.

SELECT 
  yjp.ID, 
  yjtr.term_taxonomy_id, 
  yjtt.taxonomy,
  yjt.name 
FROM 
  `wp_posts` AS yjp 
  INNER JOIN `wp_term_relationships` AS yjtr 
    ON yjtr.object_id = yjp.ID 
  INNER JOIN `wp_term_taxonomy` AS yjtt 
    ON yjtr.term_taxonomy_id = yjtt.term_taxonomy_id 
  INNER JOIN `wp_terms` AS yjt 
    ON yjtt.term_id = yjt.term_id 
WHERE 
  yjp.ID = 1902 
  OR yjp.post_parent = 1902

The results now identify the category and each of the five tags by name:

ID	term_taxonomy_id	taxonomy	name
1902	2	category	AI & Machine Learning
1902	69	post_tag	WordPress
1902	71	post_tag	DBeaver
1902	74	post_tag	MySQL
1902	76	post_tag	OpenAI
1902	77	post_tag	ChatGPT

This is a perfect match for the post’s taxonomy in the WordPress portal:

So that’s the categories. What about the views?

wp_statistics_pages

In this final section, I examine the WPStatistics table that holds view counts: wp_statistics_pages.

Table Purpose

WPStatistics uses wp_statistics_pages to store data about page views. Each row shows a URI’s total views on the date specified.

WPStatistics documentation isn’t as in-depth as WordPress, so here are the table’s DDL and column descriptions:

CREATE TABLE `1yJ_statistics_pages` (
  `page_id` bigint(20) NOT NULL AUTO_INCREMENT,
  `uri` varchar(190) NOT NULL,
  `type` varchar(180) NOT NULL,
  `date` date NOT NULL,
  `count` int(11) NOT NULL,
  `id` int(11) NOT NULL,
  PRIMARY KEY (`page_id`),
  UNIQUE KEY `date_2` (`date`,`uri`),
  KEY `url` (`uri`),
  KEY `date` (`date`),
  KEY `id` (`id`),
  KEY `uri` (`uri`,`count`,`id`)
)

Table Name	Description
page_id	Primary key. Unique identifier for the table.
uri	Uniform Resource Identifier used to access a page.
type	uri type: `home` / `page` / `post`
date	Date the uri was viewed
count	uri total views on the specified date
id	uri ID in `wp_posts.ID`

Post Journey

As wp_statistics_pages.id is the same as wp_posts.id, I can use id 1902 in a query knowing it will still refer to DBeaver OpenAI ChatGPT Integration.

For example, this query counts the number of rows in wp_statistics_pages relating to id 1902:

SELECT 
  COUNT(*) 
FROM 
  `wp_statistics_pages` 
WHERE 
  id = 1902

COUNT(*)|
--------+
      14|

I can also calculate how many visits DBeaver OpenAI ChatGPT Integration has received by using SUM on all wp_statistics_pages.count values for id 1902:

SELECT 
  SUM(yjsp.count) 
FROM 
  `wp_statistics_pages` AS yjsp
WHERE 
  yjsp.id = 1902

SUM(count)|
----------+
        40|

So the page currently has 40 views. I can see how these views are made up by selecting and ordering by wp_statistics_pages.date:

SELECT 
  yjsp.date, 
  yjsp.count 
FROM 
  `wp_statistics_pages` AS yjsp
WHERE 
  yjsp.id = 1902 
ORDER BY 
  yjsp.date

date	count
2023-02-19	1
2023-02-20	5
2023-02-21	1
2023-02-22	4
…	…
2023-03-07	6
2023-03-08	3
2023-03-09	2
2023-03-10	1

I can also join wp_posts to wp_statistics_pages on their id columns, bridging the gap between the WPStatistics table and the standard WordPress tables:

SELECT 
  yjsp.date, 
  yjsp.count, 
  yjp.post_title 
FROM 
  `wp_statistics_pages` AS yjsp 
  INNER JOIN `wp_posts` AS yjp 
    ON yjsp.id = yjp.id 
WHERE 
  yjsp.id = 1902 
ORDER BY 
  yjsp.date

date	count	post_title
2023-02-19	1	DBeaver OpenAI ChatGPT Integration
2023-02-20	5	DBeaver OpenAI ChatGPT Integration
2023-02-21	1	DBeaver OpenAI ChatGPT Integration
2023-02-22	4	DBeaver OpenAI ChatGPT Integration
…	…	…
2023-03-07	6	DBeaver OpenAI ChatGPT Integration
2023-03-08	3	DBeaver OpenAI ChatGPT Integration
2023-03-09	2	DBeaver OpenAI ChatGPT Integration
2023-03-10	1	DBeaver OpenAI ChatGPT Integration

Summary

In this post, I did a deep dive into some of the amazonwebshark WordPress MySQL database tables following the journey of a recent post.

I’ve used this post to present the journey a typical post goes through in the WordPress database. Future posts will use this knowledge and the WordPress database as a data source for various dashboards, scripting and processes. Watch this space!

If this post has been useful, please feel free to follow me on the following platforms for future updates:

Thanks for reading ~~^~~

Tags DBeaver, MySQL, Open Source, SQL, WordPress, WordPress amazonwebshark MySQL Database

Table of Contents

Introduction

About DuckDB

DuckDB

DuckDB-Wasm

SQL Workbench

About SQL Workbench

Layout

Sample Queries

File Imports

Visuals

Exports & Cleanup

Summary

Table of Contents

Introduction

How awshark2019 Was Built

Hugo Static Site Generator

S3 Static Site

Why awshark2019 Failed

Unclear Objectives

Poor Content

Awkward To Publish

What I Learned

Decide On Scope

Add Value

Keep It Simple

Summary

Table of Contents

Introduction

WordPress Database 101

wp_posts

Table Purpose

Post Journey

wp_term_relationships

Table Purpose

Post Journey

wp_term_taxonomy

Table Purpose

Post Journey

wp_terms

Table Purpose

Post Journey

wp_statistics_pages

Table Purpose

Post Journey

Summary