Interesting links - March 2025

The problem with publishing February’s interesting links at the beginning of the month and now getting around to publishing March’s at the end is that I have nearly two months' worth of links to share 😅 So with no further ado, let’s crack on.

DuckDB 🔗

I’ve been using DuckDB a lot recently and wrote up a few articles:
The release of a built-in UI got the DuckDB community a flappin' and quackin'. I wrote two blog posts about it:
- Kicking the Tyres on the New DuckDB UI
- Exporting Notebooks from DuckDB UI
DuckDB also added support for reading from Amazon S3 Tables (Amazon’s managed Apache Iceberg offering).
Tobias Müller has started a Learning DuckDB newsletter
DeepSeek released smallpond, a distributed DuckDB implementation that uses 3FS (also from Deepseek) and Ray. This blog post gives a nice overview and assessment of it—including the summary "Is smallpond for me? tl;dr: probably not" :) Daniel Beach also takes a look at it with some code examples.
Robin Linacre writes about why DuckDB is his first choice for data processing.

Kafka (and other Event Streaming technologies) 🔗

Apache Kafka 4.0 has been released. See release notes, blog post, and download.
Iggy has joined the Apache Incubator. They also published an interesting post about benchmarking.
Speaking of benchmarking, Redpanda and StreamNative had a dust up over a benchmark.
Stonemq is a "A high-performance and efficient message queue developed in Rust" that "aims to outperform Kafka in scenarios with massive-scale queue clusters".
A post about the challenges of using Kafka as a queue—which as the author notes starts to become moot once KIP-932 is released, which it is now with Apache Kafka 4.0. Gunnar Morling takes a look at the new functionality here.
An example of Kafka Record Encryption using Kroxylicious.
A bit of housekeeping from Trendyol in this blog post about Detecting and Managing Unused Topics in Kafka Clusters.
A very nice tool from Renato Mefi that visually simulates Kafka traffic. Similar to the one from SoftwareMill a few years ago.

Data Viz 🔗

A nice article from Amanda Makulec and Elijah Meeks about the "Fourth Wave" in Data Viz
Maybe box plots aren’t such a good idea after all.

Stream Processing 🔗

Apache Flink 2.0 has been released - check out the download and blog post.
Nearly ten years old and still a really useful article: Tyler Akidau’s Streaming 101: The world beyond batch
Yaroslav Tkachenko writes about how Apache DataFusion (a query engine) could be used to build a stream processing framework (just as Arroyo and others have already done).
Lalamove’s talk from Flink Forward Asia last year about how their Flink architecture has been written up as a blog post.
Alibaba published their updated Lakehouse Storage Design for Fluss.
📅 Save the date - Flink Forward 2025 will be in Barcelona, October 15 & 16.
Flink and AI is a recurring theme, including in the recent Flink 2.0 release. This post gives an example of using LLMs with Flink.
Kafka Streams Topology Design (KSTD) released a new version.
I’ve been learning more about Flink SQL, and wrote three articles:

General Data Stuff 🔗

Analysis from El Reg on Microsoft’s releasing pg_documentdb_api and pg_documentdb_core Postgres extensions as open source.
Tons of fascinating detail in The Internals of PostgreSQL.
I wrote about Creating an HTTP Source connector on Confluent Cloud from the CLI.
A nifty blog post about using pydbzengine to interact with Debezium from Python, building out a CDC pipeline.
Debezium recently added support for SMTs written in Go.
Debezium 3.1.0.Beta1 includes the first release of Debezium Platform, which is designed to make it easier running Debezium on Kubernetes.
Clue migrated from Redshift to Iceberg (with Trino and Spark)— saving 60% in costs.
Apache Amoro is an open source project providing management for lakehouses. This blog post tries it out.
A really useful set of blog posts explaining what Apache Arrow is and why it’s so fast.

Data Architectures 🔗

A clear and thoughtful article from Jack Vanlightly looking at how data virtualization enabled by OTFs fits into data architectures now and future.
A useful primer and follow up on data modelling.
Roche’s Maxim of Data Transformation: Data should be transformed as far upstream as possible, and as far downstream as necessary.
My colleague Matthew O’Keefe takes up the theme of data modelling in his blog post, and it something that Joe Reis also talks about in his excellent post about The Tension of Orthodoxy and Speed.
I’m a big fan of BlueSky, and was interested to read this post about how they implement timelines and the tradeoffs involved.
Andrew Jones writes a newsletter and wrote about Data Contracts (1) (2) (3)
A punchy article from Ian Miell about why he is no longer talking to architects about microservices.

Other stuff 🔗

The trend for "landscape" posts/infographics in recent years can sometimes seem like an exercise in trying to shape reality to suit the world-view of a vendor—not to mention overwhelming the reader with the number of projects and technologies to try and comprehend. However, the Open Source Data Engineering Landscape that Alireza Sadeghi has put together is pretty decent and comprehensive list, with a solid set of analysis of each category.
Gunnar Morling is a good friend of mine—and an excellent blogger. He was recently interviewed about technical blogging and shares some useful tips.
Troubleshooting is a core skill. Learning how to do it properly, in a considered and logical manner, will benefit you.
Joe Reis recently opened a Discord server Practical Data that’s a friendly and lively place to chat about data stuff. Join here.
If you have a Garmin device you’ll find this fun. It lets you download all your data and analyse it yourself. It’s based on Sqlite—I’m keen to see if I can use it with DuckDB :)

And finally… 🔗

If you’ve never seen the Floppotron, it’s a thing of wonder.