Skip to content
Databases

ClickHouse vs PostgreSQL for Analytics: When to Add an OLAP Database

Benchmark identical analytical queries on ClickHouse and PostgreSQL at 1M to 1B rows. Learn why columnar storage delivers 10-100x speedups, how to sync data with Debezium CDC, and when PostgreSQL is enough.

A
Abhishek Patel12 min read

Infrastructure engineer with 10+ years building production systems on AWS, GCP,…

ClickHouse vs PostgreSQL for Analytics: When to Add an OLAP Database
ClickHouse vs PostgreSQL for Analytics: When to Add an OLAP Database

PostgreSQL Is Not an Analytics Database

PostgreSQL is the best general-purpose relational database available. It handles transactions, JOINs, constraints, and ACID compliance better than almost anything else. But run a GROUP BY aggregation across 100 million rows with a few window functions, and you'll wait minutes -- sometimes longer. The problem isn't PostgreSQL. The problem is that row-oriented storage is fundamentally wrong for analytical workloads.

ClickHouse is a columnar OLAP database purpose-built for exactly these queries. It routinely delivers 10-100x speedups on aggregations, time-series rollups, and GROUP BY queries over large datasets. This guide benchmarks identical analytical queries on both databases at increasing scale, explains the architectural reasons behind the performance gap, covers the operational cost of running a second database, and helps you decide whether adding ClickHouse is worth the complexity.

What Is an OLAP Database?

Definition: An OLAP (Online Analytical Processing) database is optimized for read-heavy analytical queries -- aggregations, GROUP BY, window functions, and scans over large datasets. Unlike OLTP (Online Transaction Processing) databases such as PostgreSQL, OLAP systems sacrifice single-row lookup speed and transactional guarantees in favor of columnar storage, vectorized execution, and compression that make analytical queries orders of magnitude faster.

PostgreSQL is an OLTP database. ClickHouse is an OLAP database. They aren't competitors -- they solve different problems. The question is whether your analytical workload has grown large enough to justify adding a second database to your stack.

Why Columnar Storage Beats Row Storage for Analytics

PostgreSQL stores data row by row. When you run SELECT AVG(amount) FROM orders WHERE created_at > '2025-01-01', PostgreSQL reads every column of every matching row from disk -- even though you only need amount and created_at. With 20 columns per row, you're reading 10x more data than necessary.

ClickHouse stores data column by column. The same query reads only the amount and created_at columns. On a table with 20 columns, that's roughly 90% less I/O. Add aggressive compression (similar values in a column compress far better than mixed types in a row) and vectorized execution (processing columns in batches of thousands of values using SIMD instructions), and the performance gap compounds.

PropertyRow Storage (PostgreSQL)Column Storage (ClickHouse)
Disk reads for analytical queriesReads all columns per rowReads only queried columns
Compression ratio2-4x typical10-20x typical
Point lookups (WHERE id = X)Fast (index scan)Slow (full column scan)
INSERT single rowFastInefficient (batch preferred)
UPDATE / DELETEFull MVCC supportAsync mutations, eventually consistent
TransactionsFull ACIDNo multi-statement transactions
Vectorized executionNoYes (SIMD)

Benchmark: Identical Queries at 1M to 1B Rows

All benchmarks run on identical hardware: 8 vCPUs, 32 GB RAM, NVMe SSD. PostgreSQL 16 with default configuration plus tuned work_mem (256 MB), shared_buffers (8 GB), and parallel query workers enabled. ClickHouse 24.x with default settings. Data is a synthetic e-commerce orders table with 20 columns including timestamps, amounts, categories, and customer IDs.

Query 1: Simple Aggregation

SELECT
    COUNT(*) AS total_orders,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';
RowsPostgreSQLClickHouseSpeedup
1M180 ms12 ms15x
10M1.8 s45 ms40x
100M18 s210 ms86x
1B3+ min1.8 s100x+

Query 2: GROUP BY with Sorting

SELECT
    category,
    DATE_TRUNC('month', created_at) AS month,
    COUNT(*) AS order_count,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY category, month
ORDER BY revenue DESC
LIMIT 50;
RowsPostgreSQLClickHouseSpeedup
1M320 ms18 ms18x
10M3.4 s65 ms52x
100M35 s380 ms92x
1B6+ min3.2 s112x+

Query 3: Window Function

SELECT
    customer_id,
    created_at,
    amount,
    SUM(amount) OVER (
        PARTITION BY customer_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM orders
WHERE customer_id IN (SELECT customer_id FROM orders GROUP BY customer_id HAVING COUNT(*) > 10)
ORDER BY customer_id, created_at
LIMIT 1000;
RowsPostgreSQLClickHouseSpeedup
1M450 ms35 ms13x
10M5.2 s120 ms43x
100M55 s620 ms89x
1B10+ min5.5 s109x+

Query 4: Time-Series Rollup

-- PostgreSQL
SELECT
    DATE_TRUNC('hour', created_at) AS hour,
    region,
    COUNT(*) AS orders,
    PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY amount) AS p95_amount
FROM orders
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY hour, region
ORDER BY hour;

-- ClickHouse equivalent
SELECT
    toStartOfHour(created_at) AS hour,
    region,
    count() AS orders,
    quantile(0.95)(amount) AS p95_amount
FROM orders
WHERE created_at >= now() - INTERVAL 7 DAY
GROUP BY hour, region
ORDER BY hour;
RowsPostgreSQLClickHouseSpeedup
1M520 ms22 ms24x
10M5.8 s85 ms68x
100M62 s450 ms138x
1B12+ min3.8 s189x+

Where PostgreSQL Wins

ClickHouse dominates analytical queries, but PostgreSQL is faster for OLTP workloads:

Query TypePostgreSQLClickHouseWinner
Point lookup (WHERE id = X)0.1 ms10-50 msPostgreSQL (100x+)
Single-row INSERT0.5 msN/A (batch only)PostgreSQL
UPDATE single row0.5 msAsync mutationPostgreSQL
Transaction (multi-statement)Full ACIDNot supportedPostgreSQL
Foreign key enforcementBuilt-inNot supportedPostgreSQL

This is precisely why you run both databases rather than replacing one with the other. PostgreSQL handles your application's transactional workload. ClickHouse handles analytics, dashboards, and reporting.

ClickHouse Architecture: Why It's Fast

ClickHouse's performance comes from several architectural decisions that trade OLTP capability for analytical speed:

MergeTree Engine Family

MergeTree is the default table engine. Data is written in sorted parts that are periodically merged in the background (similar to LSM trees). The sort order is defined by the ORDER BY clause at table creation time, which determines the primary index. Choosing the right sort key is the single most important performance decision in ClickHouse.

-- Good: ORDER BY matches your most common WHERE/GROUP BY patterns
CREATE TABLE orders (
    order_id UInt64,
    customer_id UInt64,
    created_at DateTime,
    category LowCardinality(String),
    region LowCardinality(String),
    amount Decimal(10, 2),
    quantity UInt16,
    status LowCardinality(String)
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(created_at)
ORDER BY (category, region, created_at)
SETTINGS index_granularity = 8192;

-- ReplacingMergeTree: handles upserts by keeping the latest version
CREATE TABLE customers (
    customer_id UInt64,
    name String,
    email String,
    updated_at DateTime
)
ENGINE = ReplacingMergeTree(updated_at)
ORDER BY customer_id;
-- Deduplication happens during merges, not immediately
-- Use FINAL keyword to get deduplicated results:
-- SELECT * FROM customers FINAL WHERE customer_id = 12345;

Pro tip: Put low-cardinality columns first in the ORDER BY. ORDER BY (status, region, created_at) with 5 statuses and 10 regions creates 50 granule ranges that ClickHouse can skip efficiently. If you put created_at first, the primary index is less effective for filtering by status or region.

Materialized Views for Pre-Aggregation

ClickHouse materialized views are triggers that transform data on INSERT -- they don't store copies of the source table. This makes them ideal for maintaining pre-aggregated rollup tables:

-- Source table receives raw events
-- Materialized view maintains hourly aggregates automatically
CREATE MATERIALIZED VIEW orders_hourly_mv
ENGINE = SummingMergeTree()
ORDER BY (category, region, hour)
AS SELECT
    category,
    region,
    toStartOfHour(created_at) AS hour,
    count() AS order_count,
    sum(amount) AS total_revenue
FROM orders
GROUP BY category, region, hour;

-- Query the pre-aggregated view instead of scanning raw data
SELECT hour, sum(order_count), sum(total_revenue)
FROM orders_hourly_mv
WHERE hour >= today() - 7
GROUP BY hour
ORDER BY hour;

Operational Reality: Running Two Databases

Adding ClickHouse to your stack means running a second database, and that carries real costs:

  • Data synchronization -- you need to get data from PostgreSQL into ClickHouse. This is the hardest part.
  • Schema management -- two sets of migrations, two sets of schemas to maintain.
  • Monitoring -- ClickHouse has its own metrics, its own failure modes, its own tuning parameters.
  • Team knowledge -- your team needs to learn ClickHouse SQL dialects, engine types, and operational best practices.

CDC Sync with Debezium

The most robust way to replicate PostgreSQL data into ClickHouse is Change Data Capture (CDC) using Debezium. Debezium reads PostgreSQL's WAL (write-ahead log) and streams changes to Kafka, which ClickHouse consumes:

PostgreSQL (WAL) --> Debezium --> Kafka --> ClickHouse (Kafka engine or clickhouse-kafka-connect)
{
  "name": "pg-orders-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "pg-primary.internal",
    "database.port": "5432",
    "database.dbname": "app",
    "database.user": "debezium",
    "table.include.list": "public.orders,public.customers",
    "topic.prefix": "cdc",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_orders",
    "publication.name": "dbz_publication",
    "snapshot.mode": "initial"
  }
}
-- ClickHouse: consume from Kafka topic
CREATE TABLE orders_kafka_queue (
    order_id UInt64,
    customer_id UInt64,
    created_at DateTime,
    amount Decimal(10, 2),
    category String,
    region String
)
ENGINE = Kafka()
SETTINGS
    kafka_broker_list = 'kafka:9092',
    kafka_topic_list = 'cdc.public.orders',
    kafka_group_name = 'clickhouse-orders',
    kafka_format = 'JSONEachRow';

-- Materialized view moves data from Kafka queue to MergeTree
CREATE MATERIALIZED VIEW orders_kafka_mv TO orders AS
SELECT * FROM orders_kafka_queue;

Warning: CDC pipelines add latency. Expect 1-10 seconds of delay between a PostgreSQL write and the data appearing in ClickHouse. For real-time dashboards, this is usually acceptable. For use cases requiring strict consistency, query PostgreSQL directly.

Alternatives to ClickHouse

ClickHouse isn't the only option for analytical workloads. Here's how the alternatives compare:

DatabaseTypeBest ForTrade-off
TimescaleDBPostgreSQL extensionTime-series on existing PGStill row-based under the hood; slower than true columnar at scale
DuckDBEmbedded columnarSingle-node analytics, data scienceIn-process only; no server mode for concurrent users
Apache DruidDistributed OLAPReal-time event analytics at massive scaleComplex to operate; requires ZooKeeper, deep segments, multiple node types
Apache PinotDistributed OLAPUser-facing real-time analyticsSimilar complexity to Druid; less community adoption
StarRocksDistributed OLAPMySQL-compatible analyticsNewer ecosystem; smaller community than ClickHouse

DuckDB deserves special mention. If your analytical workload runs on a single machine and you don't need concurrent access, DuckDB gives you columnar performance without any infrastructure. It can query Parquet files directly, runs embedded in your application, and handles datasets up to a few hundred gigabytes on a single node. For data exploration and one-off analytics, DuckDB is often better than setting up a ClickHouse cluster.

Decision Framework: When to Add ClickHouse

Use this framework to decide whether ClickHouse is worth the operational overhead:

  1. Under 10M rows with simple aggregations? Stay on PostgreSQL. Add appropriate indexes. Tune work_mem and enable parallel query. This handles most early-stage analytics needs.
  2. 10M-100M rows, queries under 5 seconds acceptable? Try TimescaleDB for time-series data or materialized views in PostgreSQL for pre-computed aggregates. Still no need for a second database.
  3. 100M+ rows, dashboards need sub-second response? This is ClickHouse territory. The query performance difference is dramatic enough to justify the operational complexity.
  4. 1B+ rows, multiple concurrent dashboard users? ClickHouse is the clear choice. PostgreSQL physically cannot scan this much data fast enough for interactive analytics.
  5. Need ad-hoc analytics without infrastructure? Use DuckDB locally. Export data to Parquet and query it directly. No servers, no sync pipelines.

The threshold isn't just about row count. It's about query patterns. If your analytics are pre-computed via cron jobs and materialized views, PostgreSQL can serve pre-aggregated results at any scale. ClickHouse shines when users need to run arbitrary ad-hoc queries over raw data -- slicing by different dimensions, drilling down into time ranges, filtering by combinations of attributes that you can't predict in advance.

Frequently Asked Questions

Can ClickHouse replace PostgreSQL entirely?

No. ClickHouse lacks ACID transactions, single-row updates, foreign keys, and efficient point lookups. It's designed for batch inserts and analytical reads. Your application's transactional workload -- user registrations, order processing, inventory updates -- must stay on PostgreSQL or another OLTP database. Run both databases, each handling the workload it was designed for.

How do I sync data from PostgreSQL to ClickHouse?

The most reliable method is CDC (Change Data Capture) using Debezium. Debezium reads PostgreSQL's WAL and streams row-level changes to Kafka, which ClickHouse consumes via its Kafka table engine. For simpler setups, you can use periodic batch exports with COPY TO and clickhouse-client --query="INSERT INTO ... FORMAT CSV". CDC gives you near-real-time sync; batch exports are simpler but introduce minutes to hours of latency.

What is the MergeTree engine in ClickHouse?

MergeTree is ClickHouse's primary table engine. It stores data in sorted parts (defined by the ORDER BY clause) and periodically merges them in the background. The sort order serves as a sparse primary index, allowing ClickHouse to skip irrelevant data blocks during queries. Variants include ReplacingMergeTree (deduplication), SummingMergeTree (automatic pre-aggregation), and AggregatingMergeTree (custom aggregate functions). Choosing the right ORDER BY is the most impactful performance tuning decision.

Is DuckDB a better alternative to ClickHouse?

DuckDB is a better choice when you need single-user, single-machine analytics -- data science workflows, ETL scripts, or ad-hoc exploration. It runs embedded (no server), handles Parquet files natively, and delivers columnar performance without infrastructure. ClickHouse is the better choice when you need a multi-user analytics server with concurrent queries, real-time data ingestion, and datasets spanning terabytes across distributed nodes.

How much faster is ClickHouse than PostgreSQL?

For analytical queries (aggregations, GROUP BY, window functions), ClickHouse is typically 10-100x faster. The speedup increases with dataset size: at 1 million rows the gap might be 15x, but at 1 billion rows it can exceed 100x. For point lookups and single-row operations, PostgreSQL is 100x+ faster. The performance difference comes from columnar storage, vectorized execution, and compression -- architectural advantages that compound as data grows.

What is ReplacingMergeTree and when do I use it?

ReplacingMergeTree is a ClickHouse table engine that handles upserts by keeping only the latest version of each row (determined by a version column). During background merges, duplicate rows with the same ORDER BY key are collapsed to the newest version. Use it for dimension tables synced from PostgreSQL where rows are updated -- customer profiles, product catalogs, configuration data. Note that deduplication is eventual, not immediate. Use the FINAL keyword in queries when you need guaranteed deduplicated results.

How much does ClickHouse cost to run in production?

Self-hosted ClickHouse on 3 nodes (8 vCPU, 32 GB RAM each) costs roughly $400-800/month on AWS depending on instance type and storage. ClickHouse Cloud starts with pay-per-query pricing and scales based on compute and storage usage. For comparison, the same analytical workload on PostgreSQL would require significantly more hardware to achieve acceptable query times at scale, often costing more in compute than a dedicated ClickHouse deployment. The real cost is operational: CDC pipelines, monitoring, and team expertise.

A

Written by

Abhishek Patel

Infrastructure engineer with 10+ years building production systems on AWS, GCP, and bare metal. Writes practical guides on cloud architecture, containers, networking, and Linux for developers who want to understand how things actually work under the hood.

Related Articles

Enjoyed this article?

Get more like this in your inbox. No spam, unsubscribe anytime.

Comments

Loading comments...

Leave a comment

Stay in the loop

New articles delivered to your inbox. No spam.