Interesting links - July 2025

First up, allow me a shameless plug for my blog posts this month:

And with that, on to this month’s set of Interesting Links.

🔥 Not got time for all this? I’ve marked my top reads of the month :)
📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
🔗 Medium posts often skulk behind a gate, so I’ve hyperlinked to the Freedium version. You’ll see [Medium ↗] next to each link if you prefer the original.

Data Engineering 🔗

A nice hands-on example of Sanchit Vijay showing how to use Spark to move data from AWS S3 to Cloudflare R2.
Interesting details on the Apache DataFusion blog about embedding User-Defined Indexes in Apache Parquet Files.
🔥 My colleague Gilles Philippart wrote up a good guide on getting Apache Iceberg, Polaris, Trino, and MinIO running together locally.
In Lakehouse 2.0: The Open System That Lakehouse 1.0 Was Meant to Be, Animesh Kumar & Travis Thompson discuss the history of the lakehouse and its evolution from closed formats and ecosystems to open formats and interchangable engines Part 1 & Part 2.
Data contracts are a good idea, as is standardising them—of which the Open Data Contract Standard (ODCS) is an example.
Jaehyeon Kim looks at how Apache Kyuubi provides a gateway between end users and applications, and multiple database engines including Flink, Spark, and Trino.

Daniel Beach shows how you can use Apache Iceberg on Databricks.
Amazon S3 now supports compaction for Avro and ORC file formats in Apache Iceberg tables.
🔥 Thomas Kejser a.k.a. The Database Doctor gives us his spicy take on Iceberg, The Right Idea - The Wrong Spec - Part 1: History / Part 2: The Spec. Some will argue that at times veracity takes the back seat to telling a rollicking good story—but it’s a fun read regardless of which side of the debate you sit.
Rahul Joshi explains Delta Lake transaction logs.
Badal Prasad Singh tells us all about Iceberg Partitioning and Partitioning Writing Strategies.
Apache Polaris hit the big one-oh release (1.00), and Apache Iceberg got a dot release (1.9.2) with a release candidate (RC) for 1.10 in the works.

🔥 Debezium is widely used these days, making Vojtěch Juránek’s article about improving Debezium performance a useful reference.
Abhishek Vishnoi has a hands-on guide that shows how to implement a custom converter in Debezium.
An excellent blog post as always from Gunnar Morling, this time looking at Postgres Replication Slots: Preventing WAL Bloat and Other Production Issues.
Debezium 3.2.0.Final Released

Is it even an edition of Interesting Links if there’s not a new Kafka clone, either in a different language or deployed using a different architecture? This month, Ravi Atluri writes about xkafka — Kafka, but Simpler (for Go).
🔥 Mark Teehan has built a fancy way to send a file: Streaming Files Through Kafka topics.

Some people read manuals to learn, but if you’re like me and like learning through doing the SQL Noir online game is a thriller in which you solve puzzles using SQL skills that you develop. If this is your kind of thing you’ll also like this list of SQL games that SQL Noir also published.
Are data warehouses a good idea? Definitely. Does everyone need one on day one? Nope. Aleksei Aleinikov has some wise words on when the right time is—and isn’t—to build one.
🔥 Despite the "listicle" title—which would normally have me clicking away faster than Andy Byron can hide from a camera—this article from Bernd Wessely has some excellent points in it: Unlearning Data Architecture: 10 Myths Worth Killing.
Ben Dicken explains caching in this article with nice animations.
Dominik Tornow has a good analysis and commentary on the findings of the recent Jepsen testing of TigerBeetle.
Interesting papers:
- Anarchy in the Database: A Survey and Evaluation of Database Management System Extensibility (VLDB Volume 18, No. 6).
- CRDV: Conflict-free Replicated Data Views (Proc. ACM Manag. Data, Vol. 3, No. 1).
- Low-Latency Transaction Scheduling via Userspace Interrupts: Why Wait or Yield When You Can Preempt? (Proc. ACM Manag. Data, Vol. 3, No. 3).

Nothing to do with data, but stuff that I’ve found interesting or has made me smile.

Elena Verna writes up her playbook for the first 30, 60, 90 days of a new job.
🔥 Charity Majors is one of my favourite writers, and her recent blog post is a great example. It’s genuine, it’s articulate (and it reminds me of real blogging that used to be the norm and is getting swamped these days in AI slop and SEO-chasing bullshit).
If you’re as old as me you’ll enjoy this blast of nostalgia courtesy of the Internet Archive’s GeoCities GIF search engine (and FTR, it’s always /ɡɪf/, never /dʒɪf/ 😜).

📧 Want to receive this monthly round-up as an email? Subscribe to my Substack where I cross-post the same content
If you like these kind of links you might like to read about How I Try To Keep Up With The Data Tech World (A List of Data Blogs)