"Elegance is not optional"
- Richard A. O'Keefe
goal: Let's not make reading SQL harder than it already is. This means keeping lines short, and trading aggressively compact code for regularity and ease of modification. All code, even analysts' SQL is read more often than it is written.
SQL is often written by non programmers. Power Business Users, Analysts, Data Scientists & Project Owners often claim to know / understand SQL. And they're right. SQL is an extremely friendly language. It's easy to pick up due to its natural language like syntax, and its extremely concise & expressive. The diverse background of all the practitioners of SQL yield a very complex world where very few things are standardized. Just as there are a lot of really bad javascript & php out there, there are a lot of badly written sql queries in existence.
Expert Programmers who normally work in their favourite language have to deal with SQL whenever they interact with databases. They bring the flavours of their favourite language into SQL. Good intentioned as this may be, it further pollutes SQL style.
A fairly common complaint is that 'SQL is easy to write, but hard to read later'. This is not true for good SQL.
The same reasons that make snippets of code in other languages good also hold for SQL
- meaningful variable names
- short(ish) lines
- good indentation and whitespace
- good use of line breaks and continuation
- useful comments
- efficient
- modularity
Keep it under 100 characters. Fewer than 80 characters is preferable, but not always possible. Greater than 100 should not be accepted. Try to break complex expressions into multiple lines.
Most complex expressions will have natural newline points at the (infix) operators.
CASE
WHEN
users.email LIKE '%gmail.com'
OR users.email LIKE '%hotmail.com'
OR users.email LIKE '%yahoo.com'
THEN 'free-domain'
ELSE 'company-domain'
END domain_type
CASE WHEN users.email LIKE '%gmail.com' OR users.email LIKE '%hotmail.com' OR users.email LIKE '%yahoo.com'
THEN 'free-domain'
ELSE 'company-domain' END domain_type
Regular expressions (at least in postgresql) are comparable in speed to LIKE
pattern matching. Here, we can shorten the lines, which adds to developer comfort when scanning this code. It also reduces the number of branches that the database has to execute. Instead of
evaluating 3 conditions and combining them with OR
. The database only evaluates one deterministic pattern.
CASE
WHEN users.email ~ '[gmail|hotmail|yahoo]\.com$'
THEN 'free-domain'
ELSE 'company-domain'
END domain_type
Even though it is common to put all the conditions in the same line, it is much more readable if broken up into multiple lines, and the operands are aligned rather than the operator with the preceding operand.
, orders.subtotal
- orders.discount
+ orders.tax
+ orders.shipping AS payable
, orders.subtotal -
orders.discount +
orders.tax +
orders.shipping AS payable
, orders.subtotal
- orders.discount
+ orders.tax
+ orders.shipping AS payable
, orders.subtotal - orders.discount + orders.tax + orders.shipping AS payable
- Alias long table names.
- Specify the table alias (or full name) when referring to columns. This avoids future name collisions when more joins are added, and makes diffs easier to read.
- Aliases are meant to be short & descriptive (3 ~ 5 characters is good, IMO).
- 1 letter aliases are to be avoided in committed code. Okay for one off queries
AS
is optional when declaring aliases or labels, because the alias should be the last word in a fairly short line, it should be easy to spot. Be consistent with the internal guideline.
SELECT
user_id
, COUNT(id) bookings
FROM reservation
WHERE _dump IS NULL
AND complete = TRUE
AND created_at >= '2016-01-01'
Adding a join would likely confuse the database due to field name collisions. Its common for tables to have id
, user_id
and created_at
column names.
SELECT
r.user_id
, COUNT(r.id) bookings
FROM reservation r
WHERE r._dump IS NULL
AND r.complete = TRUE
AND r.created_at >= '2016-01-01'
GROUP BY 1
The alias is clear in this context, but joining to cancelled_reservation
with the same aliasing convention yields the abbreviation cr
. Joined again to credit_card
may produce cc
. At which point this convention starts to show its lackings.
SELECT
books.user_id
, COUNT(books.id) bookings
FROM reservation books
WHERE books._dump IS NULL
AND books.complete = TRUE
AND books.created_at >= '2016-01-01'
GROUP BY 1
-
If there's a naming / aliasing convention set up, adhere to it. e.g. If every body agrees that
u
is an unambiguous alias forusers
table, then its okay, I guess, even though it is contrary to the advice given above. -
Avoid SQL reserved words, i.e. don't name a table or column
user
, or timestamp. -
Avoid spaces, or anything else that would require one to put surrounding
"
around table or column names. -
One convention requires that table & column names & their aliases to be
lower case
withunder_score
separating words. Some people usePascalCase
orcamelCase
. Other conventions state that tables should bePascalCase
while columnscamelCase
. Adhere to conventions your team has used in the past. The important thing is that there should be a convention! -
Stick with either plurals (e.g.
users
) or singular (e.g.reservation
) for table names. The plural form indicates that the table is a collecion of many records, while the singular refers to the type of entity being stored. Note that in this document I mix & match. This is because the examples are inspired by a database I have worked on that was designed / implemented by someone else. Don't be that person. -
All Keys used in equi-joins, should have the common suffix. This helps identify relationships easily to later developers.
_id
(orId
) is a classic choice you can't go wrong with (e.g.user_id
). It's like ordering chocolate ice cream. -
Try to make your primary key names guessable. Both
users.id
andusers.user_id
are great candidates for the primary key of theusers
table. If you go with 1 approach, be consistent across all your tables. Obvious exclusions to this rule are association proxy tables or calendar tables, which may have composite keys that follow a natural name other thanentity_name_id
, e.g. the primay key for the calendar table might bedate
. Using the same column name for the two keys across a relationship allows usingNATURAL JOIN
and specifying the join condition asUSING (col1, col2)
. While you might not use them daily, they come in handy when doing aFULL [OUTER] JOIN
. -
Use a suffix
_at
,_hour
,_date
, ... for datetime columns at their respective granularities. If an entitycreated_at
is specified in another table, be sure to name itentity_created_at
. -
users
is a hotly overloaded term. What does it even mean? Unless there is 1 organizational definition ofuser
, avoidusers
in favour of something more descriptive. In a typical data-warehouse,users
will often be a table integrated from various external systems. The transformation layer should strive to rename & unify the varioususers
tables to logical (dimensional) groups. -
Column names should be descriptive. Avoid shortening column names to save a few characters. People will read column names many more times than they will type them up.
-
these tables will often have 1 row for a period grain (such as day, week, month), and columns will record the value of a quantity at period-start & period-end. I've found it most clear when the measurement columns are named with suffix
_at_start
and_at_end
, e.g.count_distinct_users_at_start
&count_distinct_users_at_end
. It is particularly important when a summary measurement is the result ofcount(distinct <field>)
, because it is no longer guaranteed to be additive afterwards. -
some of these tables will be rollup tables (or cube tables). But any table generated as a result of
group by ... rollup / grouping sets / cube
should use a suffix to identify as such, so, if a table produces rolled up summary stats, in a following step it can be identified so that the total is not aggregated again leading to inaccuracies.
Put a Blank Line between each clause, as in above example.
Indentation is useful to show where expressions belong. Proper indentation should also be employed to break down a long expression (> 100 characters) over multiple lines.
All the expressions conjugated with a boolean operator (OR
| AND
) should align.
users.created_at > '2016-01-01'
AND users.role = 'admin'
users.created_at > '2016-01-01'
AND users.role = 'admin'
users.created_at > '2016-01-01'
AND users.old = 'admin'
This way it is easier to scan the expressions, and spot where expressions are conjugated using the AND
and OR
operators.
users.old = 'admin'
is an expression, it should be viewed in its entirety, so the extra spaces before the =
have no utility. And what will you do if the next expression has users.organization_id
? All the filter expressions will need laborious additions of spaces. This will screw up the diff. DON'T BE STUPID
Use brackets whenever there are both AND
and OR
present in the conjugate. Not everyone remembers operator precedence.
users.created_at > '2016-01-01'
AND ( users.role = 'admin'
OR users.role = 'moderator')
users.created_at > '2016-01-01'
AND (
users.role = 'admin'
OR users.role = 'moderator'
)
users.created_at > '2016-01-01'
AND (users.role = 'admin' OR users.role = 'moderator')
users.created_at > '2016-01-01'
AND users.role = 'admin'
OR users.role = 'moderator'
These show up in SELECT
, GROUP BY
, LIMIT
clauses most often. The comma can also show up
occassionally in the FROM
clause, WINDOW
expressions, and queries with multiple CTEs.
users.id
, users.email
, book.created_at
, book.checkin_at
users.id,
users.email,
book.created_at,
book.checkin_at
Comma omission is one of the most common errors seen during interactive querying. The comma-at-start style makes it easy to spot missing commas, because all the commas are aligned and the space makes it stick out.
If one needs to add a new field to GOOD
, it causes a 1 line diff, But a 2 line diff to BAD
. The extra diff is due to the ,
at the end of book.checkin_at
. The same applies if we remove the last field book.checkin_at
from the list. This
is also useful during interactive querying. Most queries are built iteratively as expressions are added, each time an expression is added / removed the programmer only needs to modify 1 line.
Here, the indentation is used to show membership to a clause, and each clause sets its own level of indentation. Or you can write the clause elements starting at the next line. Then, according to the comma separated list guidelines, the first line should get 4 spaces, and the subsequent ones 2, then a comma, then a space then the clause element.
SELECT Block: First field is indented 4 spaces, 2nd field onwards indented with 2 spaces
SELECT
users.id
, users.email
SELECT users.id
, users.email
FROM Block: A new join on a new line, each join condition on a separate line, indented to align with the JOIN
.
SELECT
users.email
, book.created_at
, book.checkin_at
, inv.created_at invoice_at
FROM reservation book
JOIN users
ON users.id = book.user_id
LEFT JOIN invoice inv
ON inv.reservation_id = book.id
AND inv.is_paid = FALSE
WHERE book.checkin_at IS NOT NULL
Notice the first ON
aligns with JOIN
, while the 2nd aligns with LEFT JOIN
. In both cases, they're aligning with their respective clauses
The FROM
, JOIN
, LEFT JOIN
are all left aligned. This block also left aligns
with SELECT
.
Alternatively specify the tables & each join condition on a separate line indented by 6 spaces
FROM
reservation book
JOIN
users
ON users.id = book.user_id
LEFT JOIN
invoice inv
ON inv.reservation_id = book.id
AND inv.is_paid = FALSE
WHERE Block: With just 1 condition, place it on the same line as above, or put it indented (6 spaces) on the next line, if there are more conditions, put the expressions on separate lines below, and follow the advice on conjugating Boolean Operators.
WHERE book.checkin_at IS NOT NULL
WHERE book.checkin_at IS NOT NULL
AND book.is_complete = TRUE
WHERE
book.checkin_at IS NOT NULL
AND book.is_complete = TRUE
WHERE book.checkin_at IS NOT NULL
AND book.is_complete = TRUE
WHERE
book.checkin_at IS NOT NULL
AND book.is_complete = TRUE
... but you wouldn't make these mistake because you've read the section on Boolean Operators
GROUP BY Block: If using field positions, comma separate the numbers on 1 line. If using field expressions or names, the Comma Separated List rules apply.
GROUP BY 1, 2, 3, 4
GROUP BY
users.email
, book.created_at
, book.checkin_at
, inv.created_at
GROUP BY users.email
, book.created_at
, book.checkin_at
, inv.created_at
GROUP BY users.email, book.created_at, book.checkin_at, inv.created_at
ORDER BY Block, LIMIT Block: Same rules as GROUP BY
Block
HAVING Block: Same rules as WHERE
clause.
Use comments as appropriate. Debuging / Code review is not possible if the reviewer does not know the intent. Big queries should add a blurb at the top explaining what it is meant to return. Committed SQL is often sitting embedded inside other language. This puts a lot of context switching overload on the reviewer going through the file. Comments are super because the reviewer can get the context more easily.
Always add a comment if:
-
A custom user defined function is being used.
-
An
anti-join
orset-difference
is being used, as this spans both theFROM
clause &WHERE
clause. In this case, annotate the join, specifiying what condition SHOULD NOT be removed from theWHERE
clause. -
todo: more examples of tricky situations
Avoid Sub Queries at all costs. If you DB supports CTEs (WITH
queries), use them, otherwise try to get by with
a temporary table.
Indent the SELECT
query within a CTE. With multiple CTEs, the comma rules apply.
WITH
first_books AS (
SELECT
books.user_id
, MIN(book.created_at) book_at
FROM reservation books
GROUP BY
books.user_id
)
SELECT
TO_CHAR(first_books.book_at, 'YYYY-MM') first_book_month
, COUNT(*) bookers
FROM first_book
GROUP BY
TO_CHAR(first_books.book_at, 'YYYY-MM')
WITH
first_books AS (
SELECT
book.user_id
, MIN(books.created_at) book_at
, 'first_book'::varchar event_type
FROM reservation books
GROUP BY
books.user_id
)
, first_sesions AS (
SELECT
sess.user_id
, MIN(sess.created_at) sess_at
, 'first_session'::varchar event_type
FROM sessions sess
GROUP BY
sess.user_id
)
, month_books AS (
SELECT
TO_CHAR(first_book.book_at, 'YYYY-MM') event_month
, event_type
, COUNT(*) event_count
FROM first_book
GROUP BY 1, 2
)
, month_sess AS (
SELECT
TO_CHAR(first_book.book_at, 'YYYY-MM') event_month
, event_type
, COUNT(*) event_count
FROM first_book
GROUP BY 1, 2
)
SELECT * FROM month_books
UNION ALL
SELECT * FROM month_sess
Define the window expression at the end of the query if your DB allows it. Postgresql does.
Putting them inline in the SELECT
clause decreases legibility, due to long lines.
When comparing two fields, if one field accepts NULL
, try to use IS DISTINCT FROM
or IS NOT DISTINCT FROM
. If you are not fortunate enough to work on a database that support the syntax above, wrap the fields in a COALESCE
function with sensible default before comparing. Avoid doing this, and you'll have to debug one of the trickiest bugs in SQL. The significance of a NULL in a given situation should be evaluated and then COALESCE()'d appropriately to improve code readability)
SELECT NULL = TRUE; -- returns NULL
SELECT NULL != TRUE; -- returns NULL
SELECT NULL != FALSE; -- returns NULL
SELECT NULL IS NULL IS NULL; -- returns FALSE
In particular, the practice of allowing NULL
in a Boolean column should be frowned upon.
How do we know what is the meaning behind a specific null value? It cannot be done systematically and it yields too many quid pro quos.
There is actually a study going on by the University of Edinburgh on the interpretation of nulls:
Please consider taking part (if it hasn't ended already)... for humanity.
-
Put all dimensions at the top of the
SELECT
list, and all aggregates (or windowed fields) at the bottom.This makes it easy to evaluate what is doing the grouping and what is being aggregated, allows one to copy-paste into the
GROUP BY
clause, or build it easily by specifying position. PROTIP: you can count the number of dimensions by subtracting the line numbers between the first & last dimension -
Make every effort to keep queries short. Other people don't like reading your long code. Longer code leads to more errors.
-
It's great that you can write complex where clauses. You know what's better? Altering your schemas so that incomprehensible
(condition 1 AND condition 2 AND (condition 3 OR condition 4)) OR NOT (condition 5 AND condition 1)
expression is not needed. -
Don't repeat yourself. Convert frequently used CTEs into views.
-
Styles change over time. Fix old / bad queries when you see them.
Got something to add?
-
Open an issue.
-
Discuss
-
Submit PR / Merge PR
-
Use this doc to win a flame war with the programmer next to you
-
Profit!!!