Introduction
DuckDB has been updated to version 1.20. π
It seems that various features have been added, so I tried out some of the new functionality mentioned in the official blog.
Reference
https://duckdb.org/2025/02/05/announcing-duckdb-120.html
Codename for this Update
The codename for this release is "Histrionicus"
This refers to the Harlequin Duck (scientific name: Histrionicus histrionicus) that lives in cold rivers in North America, Greenland, Iceland, and Eastern Russia.
The Japanese name is "Shinorigamo," and they apparently fly to coastal areas of Hokkaido and the Tohoku region in winter.
I looked at some images, but since I've always lived in the western part of Japan, it has an appearance I don't think I've seen before. π¦
Common Python Code Setup
I'm using the following code to enable DuckDB in Python (the usual setup):
import duckdb
# Create an in-memory DuckDB database
con = duckdb.connect(database=':memory:')
β RANDOM() update now outputs more types of random numbers
In this update, the random output has been changed from 32-bit to 64-bit.
So the previous limit:
32-bit maximum: 4,294,967,295 variations
has become:
64-bit maximum: 18,446,744,073,709,551,615 variations
I tried the following code to test this. I generated 5 billion random numbers (note: this takes about 10 minutes).
# Generate a large number of random numbers
query = """
SELECT
COUNT(DISTINCT random_value) AS unique_count,
MIN(random_value) AS min_value,
MAX(random_value) AS max_value
FROM (
SELECT CAST(RANDOM() * (2 ** 63) AS BIGINT) AS random_value
FROM range(1, 5000000000)
) AS subquery
"""
# Execute the query and get results
result = con.sql(query)
print(result)
max_32bit = (2 ** 32) - 1
print(f"32-bit maximum: {max_32bit}")
max_64bit = (2 ** 64) - 1
print(f"64-bit maximum: {max_64bit}")
Results:
ββββββββββββββββ¬βββββββββββββ¬ββββββββββββββββββββββ
β unique_count β min_value β max_value β
β int64 β int64 β int64 β
ββββββββββββββββΌβββββββββββββΌββββββββββββββββββββββ€
β 4999999093 β 3023944772 β 9223372031664709632 β
ββββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββββ
32-bit maximum: 4294967295
64-bit maximum: 18446744073709551615
Looking at the output, it seems the unique values indeed exceed the 32-bit limit of about 4.3 billion types.
(I'm not completely sure if this is the meaning of the update.)
β‘ Map return values changed from lists to single values
SELECT map(['k'], ['v'])['k'];
This creates a dictionary (map) with k and v, and when called with key k, the result should be v.
When running this code previously, the result was apparently:
Old:
['v']
With the update, it's now:
New:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β "map"(main.list_value('k'), main.list_value('v'))['k'] β
β varchar β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β v β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This could be a significant change for those who have been using map, as the return value has changed from a list to a single string.
β’ Adding primary keys to existing tables
It seems you can now add primary keys to tables you've already created!
data = pd.DataFrame({
"id": [1, 2, 3, 4, 5],
"name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"age": [25, 30, 35, 40, 45]
})
# Create table (initially without primary key)
con.sql("CREATE OR REPLACE TABLE tbl AS SELECT * FROM data")
result_df = con.sql("SELECT * FROM tbl")
print(result_df)
con.sql("ALTER TABLE tbl ADD PRIMARY KEY (id);")
con.sql("INSERT INTO tbl VALUES (5, 'Frank', 50)")
After adding a primary key to the existing table, if you try to insert data with ID 5 which already exists:
con.sql("INSERT INTO tbl VALUES (5, 'Frank', 50)")
duckdb.duckdb.ConstraintException: Constraint Error: Duplicate key "id: 5" violates primary key constraint.
It's rejected due to the primary key constraint that prevents duplicate values.
Also, previously you couldn't insert data with the same ID as a record you had deleted, but that has been improved and is now possible:
con.sql("""
DELETE FROM tbl WHERE id = 1;
""")
con.sql("""
INSERT INTO tbl VALUES (1, 'Dareka',100);
""")
print(con.sql("SELECT * FROM tbl"))
The insertion succeeds as shown below:
βββββββββ¬ββββββββββ¬ββββββββ
β id β name β age β
β int64 β varchar β int64 β
βββββββββΌββββββββββΌββββββββ€
β 2 β Bob β 30 β
β 3 β Charlie β 35 β
β 4 β David β 40 β
β 5 β Eve β 45 β
β 1 β Dareka β 100 β
βββββββββ΄ββββββββββ΄ββββββββ
β£ π¦π¦π¦
Surprisingly, you can now use π¦ as a separator when reading CSV files!
This means you can use 4-byte characters (the duck emoji is apparently 4 bytes) as separators.
So if you have a CSV file with this data:
aπ¦b
helloπ¦world
And run this code:
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',sep = 'π¦')"))
You get this result:
βββββββββββ¬ββββββββββ
β a β b β
β varchar β varchar β
βββββββββββΌββββββββββ€
β hello β world β
βββββββββββ΄ββββββββββ
It splits nicely.
However, if you create a CSV like:
aπ―π΅b
helloπ―π΅world
And run:
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',sep = 'π―π΅')"))
You get this error:
duckdb.duckdb.InvalidInputException: Invalid Input Error: The delimiter option cannot exceed a size of 4 bytes.
This is because π―π΅ exceeds 4 bytes (apparently it's 6 bytes) and can't be used as a separator. I found it interesting to think about emoji byte sizes, which I'd never considered before.
I also thought it was quite remarkable that the original text used the English word "emoji." Looking it up, I found that emoji were first used in 1999 in NTT DoCoMo's i-mode service for mobile phones. I've been using emoji since the feature phone era but didn't know this history. They were standardized globally and spread as "emoji" in 2011.
β€ strict_mode to only read CSVs that follow RFC 4180
There are many types of CSV files, including haphazard ones like:
1,2,3
a,b,c,d,e,f,g
h
i,j,k,l
With the update, you can now enforce reading only CSVs that follow the strict CSV standard (RFC 4180).
If you want to read non-compliant CSVs, you can use:
FROM read_csv('rfc_4180-defiant.csv', strict_mode = false);
This could be useful for filtering out messy CSVs when you want to use reliable data sources.
β₯ SQL column and table name aliases can now be set at the beginning without using AS
Normally in SQL, you use AS for aliases:
SELECT
some_long_and_winding_expression AS e1,
t2.a_column_name AS e2
FROM
long_schema.some_long_table_name AS t1,
short_s.tbl AS t2;
With this update, you can now set column names like this (which is indeed more readable):
SELECT
e1: some_long_and_winding_expression,
e2: t2.a_column_name
FROM
t1: long_schema.some_long_table_name,
t2: short_s.tbl;
There are many other updates, and more details will apparently be written in a blog post in the near future, which I'm looking forward to. π