Not got time for all this? I’ve marked 🔥 for my top reads of the month :)
First up, allow me a shameless plug for my blog posts this month:
And with that, on to this month’s set of Interesting Links.
Note
|
I’m linking out to Freedium versions of Medium posts this month, because Medium seems to be pay-walling a bunch of otherwise-freely accessible content. Yay for the open internet 🙄. |
Data Engineering 🔗
-
A nice hands-on example of Sanchit Vijay showing how to use Spark to move data from AWS S3 to Cloudflare R2.
-
Interesting details on the Apache DataFusion blog about embedding User-Defined Indexes in Apache Parquet Files.
-
🔥 My colleague Gilles Philippart wrote up a good guide on getting Apache Iceberg, Polaris, Trino, and MinIO running together locally.
-
In Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be, Animesh Kumar & Travis Thompson discuss the history of the lakehouse and its evolution from closed formats and ecosystems to open formats and interchangable engines Part 1 & Part 2.
-
Data contracts are a good idea, as is standardising them—of which the Open Data Contract Standard (ODCS) is an example.
-
Jaehyeon Kim looks at how Apache Kyuubi provides a gateway between end users and applications, and multiple database engines including Flink, Spark, and Trino.
Open Table Formats & Catalogs 🔗
-
Daniel Beach shows how you can use Apache Iceberg on Databricks.
-
Amazon S3 now supports compaction for Avro and ORC file formats in Apache Iceberg tables.
-
🔥 Thomas Kejser a.k.a. The Database Doctor gives us his spicy take on Iceberg, The Right Idea - The Wrong Spec - Part 1: History / Part 2: The Spec. Some will argue that at times veracity takes the back seat to telling a rollicking good story—but it’s a fun read regardless of which side of the debate you sit.
-
Rahul Joshi explains Delta Lake transaction logs.
-
Badal Prasad Singh tells us all about Iceberg Partitioning and Partitioning Writing Strategies.
-
Apache Polaris hit the big one-oh release (1.00), and Apache Iceberg got a dot release (1.9.2) with a release candidate (RC) for 1.10 in the works.
CDC 🔗
-
🔥 Debezium is widely used these days, making Vojtěch Juránek’s article about improving Debezium performance a useful reference.
-
Abhishek Vishnoi has a hands-on guide that shows how to implement a custom converter in Debezium.
-
An excellent blog post as always from Gunnar Morling, this time looking at Postgres Replication Slots: Preventing WAL Bloat and Other Production Issues.
-
Debezium 3.2.0.Final Released
Kafka and Event Streaming 🔗
-
Is it even an edition of Interesting Links if there’s not a new Kafka clone, either in a different language or deployed using a different architecture? This month, Ravi Atluri writes about xkafka — Kafka, but Simpler (for Go).
-
🔥 Mark Teehan has built a fancy way to send a file: Streaming Files Through Kafka topics.
Stream Processing 🔗
-
André Santos writes about a new connector for Flink that does HTTP lookups and supports caching.
-
There are several interesting papers that have been published recently:
-
Streaming Time Series Subsequence Anomaly Detection: A Glance and Focus Approach (VLDB Volume 18, No. 6).
-
Oceanus: Enable SLO-Aware Vertical Autoscaling for Cloud-Native Streaming Services in Tencent (SIGMOD/PODS June 2025).
-
Streaming Democratized: Ease Across the Latency Spectrum with Delayed View Semantics and Snowflake Dynamic Tables (SIGMOD/PODS June 2025).
-
RDBMS and General Data Stuff 🔗
-
Some people read manuals to learn, but if you’re like me and like learning through doing the SQL Noir online game is a thriller in which you solve puzzles using SQL skills that you develop. If this is your kind of thing you’ll also like this list of SQL games that SQL Noir also published.
-
Are data warehouses a good idea? Definitely. Does everyone need one on day one? Nope. Aleksei Aleinikov has some wise words on when the right time is—and isn’t—to build one.
-
🔥 Despite the "listicle" title—which would normally have me clicking away faster than Andy Byron can hide from a camera—this article from Bernd Wessely has some excellent points in it: Unlearning Data Architecture: 10 Myths Worth Killing.
-
Ben Dicken explains caching in this article with nice animations.
-
Dominik Tornow has a good analysis and commentary on the findings of the recent Jepsen testing of TigerBeetle.
-
Interesting papers:
-
Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility (VLDB Volume 18, No. 6).
-
CRDV: Conflict-free Replicated Data Views (Proc. ACM Manag. Data, Vol. 3, No. 1).
-
Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? (Proc. ACM Manag. Data, Vol. 3, No. 3).
-
Data in Action 🔗
-
Cloudflare - How TimescaleDB helped us scale analytics and reporting.
-
Klaviyo - Our Experience with Amazon Aurora Blue/Green Deployments.
-
Netflix - Netflix Tudum Architecture: from CQRS with Kafka to CQRS with RAW Hollow.
-
Atlassian - Migrating the Jira Database Platform to AWS Aurora.
-
Peloton - Modernizing Data Infrastructure.
-
Stifel - Building a modern data platform using AWS Glue and an event-driven domain architecture.
-
Pinterest - Next Gen Data Processing at Massive Scale (Part 1 of 2).
-
🔥 Datadog - How we built reliable log delivery to thousands of unpredictable endpoints.
-
Lion - How We Built the AWS Data & Analytics Platform (Part 1).
And finally… 🔗
Nothing to do with data, but stuff that I’ve found interesting or has made me smile.
-
Elena Verna writes up her playbook for the first 30, 60, 90 days of a new job.
-
🔥 Charity Majors is one of my favourite writers, and her recent blog post is a great example. It’s genuine, it’s articulate (and it reminds me of real blogging that used to be the norm and is getting swamped these days in AI slop and SEO-chasing bullshit).
-
If you’re as old as me you’ll enjoy this blast of nostalgia courtesy of the Internet Archive’s GeoCities GIF search engine (and FTR, it’s always
/ɡɪf/
, never/dʒɪf/
😜).
Tip
|
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs) |