By John Doe March 28, 2023
GPT can write SQL. If we write our question in plain English, and have GPT write the SQL to get the answer, well then we can save a lot of time spent on enterprise process management and data analysis software.
Let’s try it out with a simple query.
> "Get all the customers who have not made any order" SELECT customers.* FROM customers LEFT JOIN orders ON customers.customer_id = orders.customer_id WHERE orders.customer_id IS NULL;
Hmm, doesn’t look great.
- I guess GPT thought our question could be modified with some qualifications, so it added a predicate and included this in English
- GPT doesn’t know about our tables, so it’s making up its own
Note that GPT confidently returns an incorrect answer even though it lacks almost all the necessary information to do so correctly. GPT has no notion of asking a follow-up question.
In OpenAI’s interface, there’s a
temperature parameter that controls how free-wheeling GPT should be in its response. What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2, will make it more focused and deterministic.
Let’s try again with the temperature cooled to
> "Get all the customers" SELECT * FROM customers;
Very cool and very square. But GPT still doesn’t know the schema or the actual table name so this query won’t run.
We need to tell GPT more information about the data we’re working with; what schemas have what tables, and what columns those tables have.
The only mechanism GPT offers for this is the prompt, so we’ll have to do what the kids call prompt engineering - a fancy phrase for cramming the most information possible into a limited amount of space.
We can generate some terse, plain English descriptions of our tables to add to the prompt:
Schema STRIPE has tables: CHARGE with columns [ID, AMOUNT, CREATED, CURRENCY, CUSTOMER_ID, ...] CUSTOMER with columns [ID, NAME, CREATED, SHIPPING_ADDRESS_STATE, ...] ...
After adding the schema information to the prompt, running the same query with the new context, GPT returns:
> "Get all the customers" SELECT * FROM STRIPE.CUSTOMER;
Looking good and running successfully for the first time! Let’s try another:
> "Get all charges from customers created in the past year" SELECT c.ID, c.AMOUNT, c.CREATED, c.CURRENCY, c.CUSTOMER_ID FROM CHARGE c INNER JOIN CUSTOMER cu ON c.CUSTOMER_ID = cu.ID WHERE cu.CREATED > date('now', '-1 year');
I’ve noticed GPT prefers to enumerate columns even when
* would do.
This looks good but does not run.
- I’m not sure where
date('now', '-1 year')is valid syntax, but it’s not in PostgreSQL so we should specify the database.
- The schema needs to be specified before each table reference
- GPT thinks the
CREATEDcolumn is a date (reasonable) but this is actually a VARCHAR that needs to be cast. Ideally we’d tell GPT the type of every column.
Let’s clarify this in the prompt and try another query:
> "Get the three users from the China who have paid the most money since February 2022" SELECT c.name, c.email, SUM(ch.amount) AS total_amount FROM STRIPE.CUSTOMER c INNER JOIN STRIPE.CHARGE ch ON c.id = ch.customer_id WHERE c.shipping_address_country = 'China' AND date_part(MONTH, CAST(ch.created AS DATE)) >= 2 AND date_part(YEAR, CAST(ch.created AS DATE)) >= 2022 GROUP BY c.name, C.email ORDER BY total_amount DESC LIMIT 3;
SQL nitpicks aside, this is incredible. My vague, poorly written question about my own business was answered instantly by a machine. We live in exciting times.