By John Doe December 30, 2024
Summary: In this article, we’ll look at how you can tell PostgreSQL how you want to have strings sorted.
Table of Contents
Introduction
PostgreSQL collation determines how string comparison is performed in the database. It affects sorting and equality checks, which are crucial for queries involving text data. Understanding collation is essential for ensuring that your database behaves as expected, especially in multi-language applications.
Collation refers to a set of rules that determine how data is sorted and compared. In PostgreSQL, collation can be defined at the database, table, or column level. This flexibility allows developers to tailor the behavior of string comparisons to meet specific application requirements.
Example
Looking at this test instance, this is what it currently has:
postgres=# \l
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype | ICU Locale | ICU Rules | Access privileges
-----------+----------+----------+-----------------+-------------+-------------+-------------+-----------+-----------------------
postgres | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | |
template0 | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
template1 | postgres | UTF8 | icu | en_US.UTF-8 | en_US.UTF-8 | en-US-x-icu | | =c/postgres +
| | | | | | | | postgres=CTc/postgres
(3 rows)
Checking how strings are sorted in this setup can easily be verified by creating a small table with a couple of rows:
CREATE TABLE t ( a text );
INSERT INTO t VALUES ('a'),('A'),('ii'),('II'),('*'),('1.1'),('-');
SELECT * FROM t;
a
-----
a
A
ii
II
*
1.1
-
(7 rows)
Sorting this right now gives the following result:
SELECT * FROM t ORDER BY 1;
a
-----
*
-
1.1
a
A
ii
II
(7 rows)
Special characters come first, the number afterwards and characters last, lower case before upper case. What happens if we want to sort this with a English locale? As mentioned in the article Locales and encodings: If you want to use the locales provided by glibc, then those need to be installed on the operating system:
$ locale -a | grep en_US
en_US.utf8
SELECT * FROM t ORDER BY a COLLATE "en_US.utf8";
a
-----
*
-
1.1
a
A
ii
II
(7 rows)
This gives exactly the same result. What about “C” or “POSIX”?
SELECT * FROM t ORDER BY a COLLATE "POSIX";
a
-----
*
-
1.1
A
II
a
ii
(7 rows)
Now upper case characters sort before lower characters and the order is not anymore the same. You can also use the ICU collations listed in pg_collation:
SELECT * FROM t ORDER BY a COLLATE "zh-Hans-x-icu";
a
-----
-
*
1.1
a
A
ii
II
(7 rows)
To summarize this: Depending on how you want to have your strings sorted, you need to use the correct collation for this. Not all languages follow the same rules and if you want to support multiple languages things might become a bit more tricky.