DuckDB: The In-Memory Analytics Database Revolutionizing Data Processing

Gaurav Kumar
8 min readJun 12, 2024

Introduction

DuckDB is an in-process SQL OLAP (Online Analytical Processing) database management system designed to support fast analytical queries on large datasets. With its robust, high-performance capabilities, DuckDB has become a popular choice among data analysts and data scientists. In this article, we’ll explore the features, advantages, and limitations of DuckDB, provide code examples, and compare it with SQLite and PostgreSQL.

What is DuckDB?

DuckDB is an open-source, embeddable SQL database engine designed specifically for analytical workloads. Unlike traditional OLTP (Online Transaction Processing) databases, DuckDB focuses on optimizing queries for reading large datasets rather than writing numerous small transactions. Its design enables efficient execution of complex analytical queries directly within the user’s application.

Key Features of DuckDB

  1. In-Process Execution: DuckDB runs within the same process as your application, eliminating the need for a separate database server. This reduces the complexity of deployment and enhances performance by minimizing data transfer overhead.
  2. Columnar Storage: DuckDB uses columnar storage, which is optimal for read-heavy analytical queries. This allows for efficient data compression and faster query execution.
  3. Vectorized Execution: DuckDB uses vectorized query execution, processing data in chunks (vectors) rather than row-by-row. This approach leverages modern CPU architectures, improving cache utilization and execution speed.
  4. SQL Support: DuckDB provides a comprehensive SQL interface, supporting standard SQL queries, including complex joins, aggregations, and window functions.
  5. Integration with Analytical Tools: DuckDB integrates seamlessly with popular data analysis tools and languages such as Python, R, and Jupyter notebooks, making it an excellent choice for data scientists.

Performance Insights

In practice, DuckDB often outperforms SQLite for analytical queries due to its columnar storage and vectorized execution. Compared to PostgreSQL, DuckDB’s in-process execution can provide faster results for certain workloads, especially when the overhead of client-server communication is significant.

DuckDB vs. SQLite vs. PostgreSQL

DuckDB

  • Type: In-process, OLAP-focused
  • Storage: Columnar
  • Use Case: Analytical queries, data science, and embedded analytics
  • Strengths: Fast query execution, easy integration, in-process simplicity
  • Weaknesses: Not suitable for OLTP, limited concurrency

SQLite

  • Type: In-process, general-purpose
  • Storage: Row-based
  • Use Case: Lightweight, embedded applications, mobile apps
  • Strengths: Simplicity, lightweight, widely used
  • Weaknesses: Limited to smaller datasets, not optimized for complex analytical queries

PostgreSQL

  • Type: Client-server, general-purpose
  • Storage: Row-based (with optional columnar store extensions like Citus)
  • Use Case: High concurrency, complex transactions, large-scale applications
  • Strengths: Robust feature set, extensibility, high concurrency support
  • Weaknesses: More complex setup and maintenance, larger footprint

Deep Dive into DuckDB

Architecture

DuckDB’s architecture is designed to facilitate high-performance analytical queries within an embedded environment. Here’s a closer look at its architectural components:

  1. Columnar Storage: DuckDB stores data in a columnar format, which is optimal for read-heavy operations typical in analytical queries. This structure allows for better compression rates and faster access to the relevant data, reducing the amount of data read from disk or memory.
  2. Vectorized Execution Engine: DuckDB processes data in vectors (chunks of columns) instead of rows. This vectorized execution leverages CPU cache more effectively, leading to significant performance improvements, particularly for operations on large datasets.
  3. In-Memory Processing: While DuckDB can work with data on disk, it excels when data fits into memory, providing very low-latency query responses. The in-memory execution minimizes I/O operations, which are often a bottleneck in database performance.
  4. In-Process Architecture: DuckDB runs within the same process as the application using it. This design choice eliminates the need for inter-process communication (IPC), reducing overhead and simplifying deployment.

Use Cases

DuckDB is well-suited for a variety of analytical and data science use cases:

  1. Data Analysis: Analysts can use DuckDB to quickly analyze large datasets, perform aggregations, and generate insights without setting up a complex database infrastructure.
  2. Data Science and Machine Learning: DuckDB integrates seamlessly with Python and R, making it an excellent choice for data scientists who need to preprocess large datasets, join multiple tables, and perform exploratory data analysis (EDA) within their notebooks or scripts.
  3. ETL Pipelines: DuckDB can be used in Extract, Transform, Load (ETL) processes to transform and aggregate data before loading it into a data warehouse or another storage system.
  4. Embedded Analytics: Applications that require embedded analytics capabilities can benefit from DuckDB’s in-process nature, enabling sophisticated analytics features without a separate database server.

Practical Example with DuckDB

To illustrate DuckDB’s capabilities, let’s walk through a practical example using Python.

Installation

Installing DuckDB is straightforward. You can use pip for Python:

pip install duckdb

Creating a Database and Table

import duckdb
# Connect to DuckDB (creates an in-memory database)
con = duckdb.connect()
# Create a sample table
con.execute("""
CREATE TABLE employees (
id INTEGER,
name VARCHAR,
department VARCHAR,
salary INTEGER
)
""")

Inserting Data

con.execute("""
INSERT INTO employees VALUES
(1, 'Alice', 'Engineering', 100000),
(2, 'Bob', 'HR', 60000),
(3, 'Charlie', 'Engineering', 120000),
(4, 'David', 'Marketing', 70000)
""")

Querying Data

# Simple SELECT query
result = con.execute("SELECT * FROM employees").fetchall()
print(result)

Output:

[(1, ‘Alice’, ‘Engineering’, 100000), (2, ‘Bob’, ‘HR’, 60000), (3, ‘Charlie’, ‘Engineering’, 120000), (4, ‘David’, ‘Marketing’, 70000)]

Analytical Query

# Aggregation query to find the average salary by department
avg_salary = con.execute("""
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
""").fetchall()
print(avg_salary)

Output:

[(‘Engineering’, 110000.0), (‘Marketing’, 70000.0), (‘HR’, 60000.0)]

Basic API Usage

The most straight-forward manner of running SQL queries using DuckDB is using the duckdb.sql command.

import duckdb
duckdb.sql("SELECT 42").show()

This will run queries using an in-memory database that is stored globally inside the Python module. The result of the query is returned as a Relation. A relation is a symbolic representation of the query. The query is not executed until the result is fetched or requested to be printed to the screen.

Relations can be referenced in subsequent queries by storing them inside variables, and using them as tables. This way queries can be constructed incrementally.

import duckdb
r1 = duckdb.sql("SELECT 42 AS i")
duckdb.sql("SELECT i * 2 AS k FROM r1").show()

Data Input

DuckDB can ingest data from a wide variety of formats — both on-disk and in-memory. See the data ingestion page for more information.

import duckdb
duckdb.read_csv("example.csv") # read a CSV file into a Relation
duckdb.sql("SELECT * FROM 'example.csv'") # directly query a CSV file
duckdb.read_parquet("example.parquet")        # read a Parquet file into a Relation             # read a JSON file into a Relation
duckdb.sql("SELECT * FROM 'example.parquet'") # directly query a Parquet file
duckdb.read_json("example.json")
duckdb.sql("SELECT * FROM 'example.json'") # directly query a JSON file

DataFrames

DuckDB can directly query Pandas DataFrames, Polars DataFrames and Arrow tables. Note that these are read-only, i.e., editing these tables via INSERT or UPDATE statements is not possible.

import duckdb
# directly query a Pandas DataFrame
import pandas as pd
pandas_df = pd.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM pandas_df")
# directly query a Polars DataFrame
import polars as pl
polars_df = pl.DataFrame({"a": [42]})
duckdb.sql("SELECT * FROM polars_df")
# directly query a pyarrow table
import pyarrow as pa
arrow_table = pa.Table.from_pydict({"a": [42]})
duckdb.sql("SELECT * FROM arrow_table")

Result Conversion

DuckDB supports converting query results efficiently to a variety of formats.

import duckdb
duckdb.sql("SELECT 42").fetchall() # Python objects
duckdb.sql("SELECT 42").df() # Pandas DataFrame
duckdb.sql("SELECT 42").pl() # Polars DataFrame
duckdb.sql("SELECT 42").arrow() # Arrow Table
duckdb.sql("SELECT 42").fetchnumpy() # NumPy Arrays

Writing Data to Disk

DuckDB supports writing Relation objects directly to disk in a variety of formats. The COPY statement can be used to write data to disk using SQL as an alternative.

import duckdb
duckdb.sql("SELECT 42").write_parquet("out.parquet") # Write to a Parquet file
duckdb.sql("SELECT 42").write_csv("out.csv") # Write to a CSV file
duckdb.sql("COPY (SELECT 42) TO 'out.parquet'") # Copy to a Parquet file

Connection Options

Applications can open a new DuckDB connection via the duckdb.connect() method.

Using an In-Memory Database

When using DuckDB through duckdb.sql(), it operates on an in-memory database, i.e., no tables are persisted on disk. Invoking the duckdb.connect() method without arguments returns a connection, which also uses an in-memory database:

import duckdb
con = duckdb.connect()
con.sql("SELECT 42 AS x").show()

Persistent Storage

The duckdb.connect(dbname) creates a connection to a persistent database. Any data written to that connection will be persisted, and can be reloaded by reconnecting to the same file, both from Python and from other DuckDB clients.

import duckdb
# create a connection to a file called 'file.db'
con = duckdb.connect("file.db")
# create a table and load data into it
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
# query the table
con.table("test").show()
# explicitly close the connection
con.close()
# Note: connections also closed implicitly when they go out of scope

You can also use a context manager to ensure that the connection is closed:

import duckdb
with duckdb.connect("file.db") as con:
con.sql("CREATE TABLE test (i INTEGER)")
con.sql("INSERT INTO test VALUES (42)")
con.table("test").show()
# the context manager closes the connection automatically

Configuration

The duckdb.connect() accepts a config dictionary, where configuration options can be specified. For example:

import duckdb
con = duckdb.connect(config = {'threads': 1})

Connection Object and Module

The connection object and the duckdb module can be used interchangeably – they support the same methods. The only difference is that when using the duckdb module a global in-memory database is used.

Using Connections in Parallel Python Programs

The DuckDBPyConnection object is not thread-safe. If you would like to write to the same database from multiple threads, create a cursor for each thread with the DuckDBPyConnection.cursor() method.

Loading and Installing Extensions

DuckDB’s Python API provides functions for installing and loading extensions, which perform the equivalent operations to running the INSTALL and LOAD SQL commands, respectively. An example that installs and loads the spatial extension looks like follows:

import duckdb
con = duckdb.connect()
con.install_extension("spatial")
con.load_extension("spatial")

To load unsigned extensions, use the config = {"allow_unsigned_extensions": "true"} argument to the duckdb.connect() method.

Conclusion

DuckDB represents a significant advancement in the realm of analytical databases, offering high performance and ease of use for data analysis tasks. While it excels in scenarios involving complex analytical queries and large datasets, it is not a one-size-fits-all solution. For general-purpose applications requiring high concurrency and transactional integrity, PostgreSQL remains a robust choice. Meanwhile, SQLite continues to be a reliable option for lightweight, embedded applications. Understanding the strengths and weaknesses of each database system will help you choose the right tool for your specific needs.

--

--