Introduction

DuckDB has been updated to version 1.20. πŸ™Œ

It seems that various features have been added, so I tried out some of the new functionality mentioned in the official blog.

Reference

https://duckdb.org/2025/02/05/announcing-duckdb-120.html

Codename for this Update

The codename for this release is "Histrionicus"

This refers to the Harlequin Duck (scientific name: Histrionicus histrionicus) that lives in cold rivers in North America, Greenland, Iceland, and Eastern Russia.

The Japanese name is "Shinorigamo," and they apparently fly to coastal areas of Hokkaido and the Tohoku region in winter.

I looked at some images, but since I've always lived in the western part of Japan, it has an appearance I don't think I've seen before. πŸ¦†

Common Python Code Setup

I'm using the following code to enable DuckDB in Python (the usual setup):

import duckdb

# Create an in-memory DuckDB database
con = duckdb.connect(database=':memory:')

β‘  RANDOM() update now outputs more types of random numbers

In this update, the random output has been changed from 32-bit to 64-bit.

So the previous limit:

32-bit maximum: 4,294,967,295 variations

has become:

64-bit maximum: 18,446,744,073,709,551,615 variations

I tried the following code to test this. I generated 5 billion random numbers (note: this takes about 10 minutes).

# Generate a large number of random numbers
query = """
SELECT 
    COUNT(DISTINCT random_value) AS unique_count,
    MIN(random_value) AS min_value,
    MAX(random_value) AS max_value
FROM (
    SELECT CAST(RANDOM() * (2 ** 63) AS BIGINT) AS random_value
    FROM range(1, 5000000000)
) AS subquery
"""

# Execute the query and get results
result = con.sql(query)
print(result)

max_32bit = (2 ** 32) - 1
print(f"32-bit maximum: {max_32bit}")
max_64bit = (2 ** 64) - 1
print(f"64-bit maximum: {max_64bit}")

Results:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ unique_count β”‚ min_value  β”‚      max_value      β”‚
β”‚    int64     β”‚   int64    β”‚        int64        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   4999999093 β”‚ 3023944772 β”‚ 9223372031664709632 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

32-bit maximum: 4294967295
64-bit maximum: 18446744073709551615

Looking at the output, it seems the unique values indeed exceed the 32-bit limit of about 4.3 billion types.

(I'm not completely sure if this is the meaning of the update.)

β‘‘ Map return values changed from lists to single values

SELECT map(['k'], ['v'])['k'];

This creates a dictionary (map) with k and v, and when called with key k, the result should be v.

When running this code previously, the result was apparently:

Old:

['v']

With the update, it's now:

New:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ "map"(main.list_value('k'), main.list_value('v'))['k'] β”‚
β”‚                        varchar                         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ v                                                      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This could be a significant change for those who have been using map, as the return value has changed from a list to a single string.

β‘’ Adding primary keys to existing tables

It seems you can now add primary keys to tables you've already created!

data = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "age": [25, 30, 35, 40, 45]
})

# Create table (initially without primary key)
con.sql("CREATE OR REPLACE TABLE tbl AS SELECT * FROM data")

result_df = con.sql("SELECT * FROM tbl")

print(result_df)

con.sql("ALTER TABLE tbl ADD PRIMARY KEY (id);")

con.sql("INSERT INTO tbl VALUES (5, 'Frank', 50)")

After adding a primary key to the existing table, if you try to insert data with ID 5 which already exists:

con.sql("INSERT INTO tbl VALUES (5, 'Frank', 50)")
duckdb.duckdb.ConstraintException: Constraint Error: Duplicate key "id: 5" violates primary key constraint.

It's rejected due to the primary key constraint that prevents duplicate values.

Also, previously you couldn't insert data with the same ID as a record you had deleted, but that has been improved and is now possible:

con.sql("""
DELETE FROM tbl WHERE id = 1;
""")
con.sql("""
INSERT INTO tbl VALUES (1, 'Dareka',100);
""")
print(con.sql("SELECT * FROM tbl"))

The insertion succeeds as shown below:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚  id   β”‚  name   β”‚  age  β”‚
β”‚ int64 β”‚ varchar β”‚ int64 β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€
β”‚     2 β”‚ Bob     β”‚    30 β”‚
β”‚     3 β”‚ Charlie β”‚    35 β”‚
β”‚     4 β”‚ David   β”‚    40 β”‚
β”‚     5 β”‚ Eve     β”‚    45 β”‚
β”‚     1 β”‚ Dareka  β”‚   100 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

β‘£ πŸ¦†πŸ¦†πŸ¦†

Surprisingly, you can now use πŸ¦† as a separator when reading CSV files!

This means you can use 4-byte characters (the duck emoji is apparently 4 bytes) as separators.

So if you have a CSV file with this data:

aπŸ¦†b
helloπŸ¦†world

And run this code:

print(con.sql("SELECT * FROM read_csv('csv_file1.csv',sep = 'πŸ¦†')"))

You get this result:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    a    β”‚    b    β”‚
β”‚ varchar β”‚ varchar β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ hello   β”‚ world   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

It splits nicely.

However, if you create a CSV like:

aπŸ‡―πŸ‡΅b
helloπŸ‡―πŸ‡΅world

And run:

print(con.sql("SELECT * FROM read_csv('csv_file1.csv',sep = 'πŸ‡―πŸ‡΅')"))

You get this error:

duckdb.duckdb.InvalidInputException: Invalid Input Error: The delimiter option cannot exceed a size of 4 bytes.

This is because πŸ‡―πŸ‡΅ exceeds 4 bytes (apparently it's 6 bytes) and can't be used as a separator. I found it interesting to think about emoji byte sizes, which I'd never considered before.

I also thought it was quite remarkable that the original text used the English word "emoji." Looking it up, I found that emoji were first used in 1999 in NTT DoCoMo's i-mode service for mobile phones. I've been using emoji since the feature phone era but didn't know this history. They were standardized globally and spread as "emoji" in 2011.

β‘€ strict_mode to only read CSVs that follow RFC 4180

There are many types of CSV files, including haphazard ones like:

1,2,3
a,b,c,d,e,f,g
h
i,j,k,l

With the update, you can now enforce reading only CSVs that follow the strict CSV standard (RFC 4180).

If you want to read non-compliant CSVs, you can use:

FROM read_csv('rfc_4180-defiant.csv', strict_mode = false);

This could be useful for filtering out messy CSVs when you want to use reliable data sources.

β‘₯ SQL column and table name aliases can now be set at the beginning without using AS

Normally in SQL, you use AS for aliases:

SELECT 
    some_long_and_winding_expression AS e1,
    t2.a_column_name AS e2
FROM
    long_schema.some_long_table_name AS t1,
    short_s.tbl AS t2;

With this update, you can now set column names like this (which is indeed more readable):

SELECT 
    e1: some_long_and_winding_expression,
    e2: t2.a_column_name 
FROM
    t1: long_schema.some_long_table_name,
    t2: short_s.tbl;

There are many other updates, and more details will apparently be written in a blog post in the near future, which I'm looking forward to. πŸ™Œ